AAC 6510 Dataset Project

Author

Angela Qian, Nerina Campos, Krystal Lee, Selina Im, Karima Ajahiou

1 Quarto

options(repos = c(CRAN = "https://cran.rstudio.com/"))

2 Introduction to our Dataset

This study explores the most important factors that influence the adoption of shelter animals. By understanding the key characteristics that impact adoption decisions, this analysis aims to provide valuable insights to help animal shelters improve their strategies and increase adoption rates.

Dataset Link

2.1 Research Question

How do various animal traits—such as breed, age, sex, intake condition, color, and intake type—affect the likelihood of being adopted from shelters?

2.2 Dependent Variables

The dependent variable in this study is:

  • Likelihood of Adoption

    • This is a binary outcome that indicates whether an animal was adopted or not:

      • 1 = The animal was adopted.

      • 0 = The animal was not adopted (e.g., transferred to another shelter).

This variable is the central focus of our analysis, as we aim to understand which animal characteristics (independent variables) are associated with a higher likelihood of adoption. By identifying these key factors, we can provide insights to help shelters improve adoption strategies and outcomes.

2.3 Independent Variables

The independent variables in this study are the characteristics of the animals that may influence their likelihood of being adopted. These variables were carefully chosen based on shelter data and include the following:

1. Intake Type - Describes how the animal came into the shelter.

  • Stray: The animal was found without an owner.

  • Public Assist: The animal was brought in through public assistance.

  • Owner Surrender: The previous owner gave up the animal.

2. Intake Condition - The physical or medical state of the animal at the time of intake:

  • Normal: The animal appeared healthy.

  • Injured: The animal had visible injuries.

  • Aged: The animal was considered older.

3. Age Upon Intake - The age of the animal when it entered the shelter, measured in years.

5. Breed Type - The breed classification of the animal:

  • Popular Family Dog: Includes breeds like Labrador Retrievers and Golden Retrievers that are often considered family-friendly.

  • Protective Breed: Breeds known for their protective instincts, such as German Shepherds and Doberman Pinschers.

  • High-Energy Breed: Breeds like Huskies and Malamutes that are known for being active.

  • Small Cuddly Breed: Small-sized breeds like Chihuahuas and Yorkshire Terriers.

6. Intake Year - The year the animal was taken into the shelter, which helps identify trends over time.

7. Intake as Stray or Public Assist - Binary variables indicating if the animal was a stray or brought in through public assistance.

By analyzing these variables, the study identifies which traits—like breed, age, or condition—are most strongly associated with adoption outcomes. This helps shelters focus on improving the adoption chances for animals that might be overlooked.

3 Loading Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(rio)
library(dplyr)

4 Running Code

DataAdoptionPets <- import('aac_intakes_outcomes.csv') |>
   clean_names("upper_camel")
  
head(DataAdoptionPets)
  AgeUponOutcome AnimalIdOutcome   DateOfBirth OutcomeSubtype     OutcomeType
1       10 years         A006100   7/9/07 0:00                Return to Owner
2        7 years         A006100   7/9/07 0:00                Return to Owner
3        6 years         A006100   7/9/07 0:00                Return to Owner
4       10 years         A047759   4/2/04 0:00        Partner        Transfer
5       16 years         A134067 10/16/97 0:00                Return to Owner
6       15 years         A141142   6/1/98 0:00                Return to Owner
  SexUponOutcome AgeUponOutcomeDays AgeUponOutcomeYears AgeUponOutcomeAgeGroup
1  Neutered Male               3650                  10            (7.5, 10.0]
2  Neutered Male               2555                   7             (5.0, 7.5]
3  Neutered Male               2190                   6             (5.0, 7.5]
4  Neutered Male               3650                  10            (7.5, 10.0]
5  Neutered Male               5840                  16           (15.0, 17.5]
6  Spayed Female               5475                  15           (12.5, 15.0]
  OutcomeDatetime OutcomeMonth OutcomeYear OutcomeMonthyear OutcomeWeekday
1   12/7/17 14:07           12        2017          2017-12       Thursday
2  12/20/14 16:35           12        2014          2014-12       Saturday
3    3/8/14 17:10            3        2014          2014-03       Saturday
4    4/7/14 15:12            4        2014          2014-04         Monday
5  11/16/13 11:54           11        2013          2013-11       Saturday
6  11/17/13 11:40           11        2013          2013-11         Sunday
  OutcomeHour OutcomeNumber DobYear DobMonth DobMonthyear AgeUponIntake
1           0             1    2007        7      2017-12      10 years
2          16             2    2007        7      2014-12       7 years
3          17             3    2007        7      2014-03       6 years
4          15             1    2004        4      2014-04      10 years
5          11             1    1997       10      2013-11      16 years
6          11             1    1998        6      2013-11      15 years
  AnimalIdIntake AnimalType                       Breed        Color
1        A006100        Dog        Spinone Italiano Mix Yellow/White
2        A006100        Dog        Spinone Italiano Mix Yellow/White
3        A006100        Dog        Spinone Italiano Mix Yellow/White
4        A047759        Dog                   Dachshund     Tricolor
5        A134067        Dog           Shetland Sheepdog  Brown/White
6        A141142        Dog Labrador Retriever/Pit Bull  Black/White
                                  FoundLocation IntakeCondition      IntakeType
1 Colony Creek And Hunters Trace in Austin (TX)          Normal           Stray
2             8700 Research Blvd in Austin (TX)          Normal   Public Assist
3                  8700 Research in Austin (TX)          Normal   Public Assist
4                                   Austin (TX)          Normal Owner Surrender
5            12034 Research Blvd in Austin (TX)         Injured   Public Assist
6                                   Austin (TX)            Aged           Stray
  SexUponIntake Count AgeUponIntakeDays AgeUponIntakeYears
1 Neutered Male     1              3650                 10
2 Neutered Male     1              2555                  7
3 Neutered Male     1              2190                  6
4 Neutered Male     1              3650                 10
5 Neutered Male     1              5840                 16
6 Spayed Female     1              5475                 15
  AgeUponIntakeAgeGroup IntakeDatetime IntakeMonth IntakeYear IntakeMonthyear
1           (7.5, 10.0]   12/7/17 0:00          12       2017         2017-12
2            (5.0, 7.5] 12/19/14 10:21          12       2014         2014-12
3            (5.0, 7.5]   3/7/14 14:26           3       2014         2014-03
4           (7.5, 10.0]   4/2/14 15:55           4       2014         2014-04
5          (15.0, 17.5]  11/16/13 9:02          11       2013         2013-11
6          (12.5, 15.0] 11/16/13 14:46          11       2013         2013-11
  IntakeWeekday IntakeHour IntakeNumber             TimeInShelter
1      Thursday         14            1 0 days 14:07:00.000000000
2        Friday         10            2 1 days 06:14:00.000000000
3        Friday         14            3 1 days 02:44:00.000000000
4     Wednesday         15            1 4 days 23:17:00.000000000
5      Saturday          9            1 0 days 02:52:00.000000000
6      Saturday         14            1 0 days 20:54:00.000000000
  TimeInShelterDays
1         0.5881944
2         1.2597222
3         1.1138889
4         4.9701389
5         0.1194444
6         0.8708333

5 Directed Acrylical Graph (DAG)

library(ggdag)

Attaching package: 'ggdag'
The following object is masked from 'package:stats':

    filter
library(ggplot2)
set.seed(123)
dag <- dagify(
  IsAdopted ~ Breed + IntakeCondition + IntakeYear + IntakeType + AgeUponIntakeYears,
  IntakeCondition ~ IntakeType, 
  OutcomeType ~ IntakeCondition + IntakeType + Breed,
  outcome = "IsAdopted"
)
ggdag_status(dag, node_size = 22, text_size = 2) +
  theme_void()

6 Cleaning Data

# Cleaning Data for Selected Variables

DataAdoptionPetsYes <- DataAdoptionPets %>% select(OutcomeType, AnimalType, Breed, IntakeCondition, IntakeType, Color, AgeUponIntakeYears, IntakeYear)

# Filtering for Dog's only
filtered_data <- DataAdoptionPetsYes %>% filter(grepl("Dog", AnimalType))

filtered_data <- DataAdoptionPetsYes %>%
  filter(str_trim(OutcomeType) %in% c("Adoption", "Transfer"))

# View the filtered data
head(filtered_data)
  OutcomeType AnimalType                     Breed IntakeCondition
1    Transfer        Dog                 Dachshund          Normal
2    Transfer        Cat    American Shorthair Mix          Normal
3    Adoption        Dog             Dachshund Mix          Normal
4    Adoption        Dog       Chihuahua Shorthair            Aged
5    Adoption        Dog Labrador Retriever/Beagle          Normal
6    Transfer        Dog           Rat Terrier Mix          Normal
       IntakeType             Color AgeUponIntakeYears IntakeYear
1 Owner Surrender          Tricolor                 10       2014
2           Stray Brown Tabby/White                 14       2014
3 Owner Surrender         Black/Tan                 16       2013
4   Public Assist         Tan/Black                 17       2016
5           Stray       Sable/White                 15       2015
6           Stray        Gray/White                 15       2016

7 Selecting and Creating New Variables

#Filtering Breeds 

library(dplyr)

DogDataYes <- filtered_data %>%
  mutate(
    is_popular_family_dog = ifelse(Breed %in% c("Labrador Retriever", "Labrador Retriever Mix", "Golden Retriever", "Beagle", "Pug", "French Bulldog"), 1, 0),
    is_protective_breed = ifelse(Breed %in% c("German Shepherd", "Doberman Pinscher", "Rottweiler"), 1, 0),
    is_high_energy_breed = ifelse(Breed %in% c("Husky, Siberian", "Alaskan Malamute", "Samoyed"), 1, 0),
    is_small_cuddly_breed = ifelse(Breed %in% c("Chihuahua", "Yorkshire Terrier", "Poodle, Toy"), 1, 0),
  )
library(dplyr)
library(tidyr)


DogYesData <- DogDataYes %>%
  mutate(is_normal = ifelse(IntakeCondition == "Normal", 1, 0),
         is_injured = ifelse(IntakeCondition == "Injured", 1, 0),
          is_aged = ifelse(IntakeCondition == "Aged", 1, 0))
library(dplyr)
library(tidyr)

IntakeType <- DogYesData %>%
  mutate(is_stray = ifelse(IntakeType == "Stray", 1, 0),
         is_publicAsst = ifelse(IntakeType == "Public Assist", 1, 0),
         is_surrender = ifelse(IntakeType == "Surrender", 1, 0)
)
library(dplyr)
library(tidyr)
library(janitor)


DogDataYes <- DogDataYes %>% clean_names(case = "upper_camel")

FilteredData <- DogDataYes %>%
  clean_names(case = "upper_camel") %>%
  mutate(
    is_adopted = ifelse(OutcomeType == "Adoption", 1,
                 ifelse(OutcomeType == "Transfer", 0, NA))
  ) %>%
  filter(
    !is.na(is_adopted),                        
    AnimalType == "Dog",                      
    IntakeType %in% c("Stray", "Public Assist", "Owner Surrender"),
    IntakeCondition %in% c("Normal", "Injured", "Aged", "Sick"),
    Breed %in% c(                             
      "Labrador Retriever", "Labrador Retriever Mix", "Golden Retriever",
      "Beagle", "Pug", "French Bulldog", "German Shepherd", "Doberman Pinscher",
      "Rottweiler", "Husky, Siberian", "Alaskan Malamute", "Samoyed",
      "Chihuahua", "Yorkshire Terrier", "Poodle, Toy"
    )
  ) %>%
  select(
    AnimalType, Breed, IntakeCondition, IntakeType, 
    AgeUponIntakeYears, IntakeYear, OutcomeType
  ) %>%
  mutate(
    IsStray = ifelse(IntakeType == "Stray", 1, 0),
    IsPublicAssist = ifelse(IntakeType == "Public Assist", 1, 0),
    IsOwnerSurrender = ifelse(IntakeType == "Owner Surrender", 1, 0),
    IsEuthanasiaReq = ifelse(IntakeType == "Euthanasia Request", 1, 0),
    IsNormal = ifelse(IntakeCondition == "Normal", 1, 0),
    IsInjured = ifelse(IntakeCondition == "Injured", 1, 0),
    IsAged = ifelse(IntakeCondition == "Aged", 1, 0),
    IsSick = ifelse(IntakeCondition == "Sick", 1, 0),
    IsAdopted = ifelse(OutcomeType == "Adoption", 1, 0),
    IsTransferred = ifelse(OutcomeType == "Transfer", 1, 0),
    IsPopularFamilyDog = ifelse(Breed %in% c("Labrador Retriever", "Labrador Retriever Mix", "Golden Retriever", "Beagle", "Pug", "French Bulldog"), 1, 0),
    IsProtectiveBreed = ifelse(Breed %in% c("German Shepherd", "Doberman Pinscher", "Rottweiler"), 1, 0),
    IsHighEnergyBreed = ifelse(Breed %in% c("Husky", "Siberian", "Alaskan Malamute", "Samoyed"), 1, 0),
    IsSmallCuddlyBreed = ifelse(Breed %in% c("Chihuahua", "Yorkshire Terrier", "Poodle, Toy"), 1, 0),
  )


head(FilteredData)
  AnimalType                  Breed IntakeCondition      IntakeType
1        Dog Labrador Retriever Mix          Normal           Stray
2        Dog Labrador Retriever Mix          Normal Owner Surrender
3        Dog Labrador Retriever Mix          Normal Owner Surrender
4        Dog Labrador Retriever Mix            Sick Owner Surrender
5        Dog Labrador Retriever Mix          Normal Owner Surrender
6        Dog Labrador Retriever Mix          Normal Owner Surrender
  AgeUponIntakeYears IntakeYear OutcomeType IsStray IsPublicAssist
1                 15       2015    Transfer       1              0
2                 12       2013    Adoption       0              0
3                 13       2017    Adoption       0              0
4                 11       2016    Adoption       0              0
5                 10       2014    Transfer       0              0
6                  9       2014    Adoption       0              0
  IsOwnerSurrender IsEuthanasiaReq IsNormal IsInjured IsAged IsSick IsAdopted
1                0               0        1         0      0      0         0
2                1               0        1         0      0      0         1
3                1               0        1         0      0      0         1
4                1               0        0         0      0      1         1
5                1               0        1         0      0      0         0
6                1               0        1         0      0      0         1
  IsTransferred IsPopularFamilyDog IsProtectiveBreed IsHighEnergyBreed
1             1                  1                 0                 0
2             0                  1                 0                 0
3             0                  1                 0                 0
4             0                  1                 0                 0
5             1                  1                 0                 0
6             0                  1                 0                 0
  IsSmallCuddlyBreed
1                  0
2                  0
3                  0
4                  0
5                  0
6                  0

8 Logistic Regression

8.1 Attempt 1

LogModel <- glm( IsAdopted ~ IsStray + IsPublicAssist + IsOwnerSurrender + IsEuthanasiaReq + IsNormal + IsInjured + IsAged + IsSick + AgeUponIntakeYears + IsNeuteredMale + IsSpayedFemale + IsPopularFamilyDog + IsProtectiveBreed + IsHighEnergyBreed + IsSmallCuddlyBreed + IsTransferred, data = FilteredData, family = binomial() ) summary(LogModel)

9 Finalized Attempt

LogModel <- glm(
  IsAdopted ~ IsStray + IsPublicAssist + IsNormal + IsInjured + IsAged + 
    AgeUponIntakeYears + 
    IsPopularFamilyDog + IsProtectiveBreed + IsHighEnergyBreed + IsTransferred,
  data = FilteredData,
  family = binomial()
)
Warning: glm.fit: algorithm did not converge
summary(LogModel)

Call:
glm(formula = IsAdopted ~ IsStray + IsPublicAssist + IsNormal + 
    IsInjured + IsAged + AgeUponIntakeYears + IsPopularFamilyDog + 
    IsProtectiveBreed + IsHighEnergyBreed + IsTransferred, family = binomial(), 
    data = FilteredData)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)         2.657e+01  9.030e+04   0.000    1.000
IsStray             7.190e-10  1.421e+04   0.000    1.000
IsPublicAssist      4.806e-10  3.578e+04   0.000    1.000
IsNormal           -1.276e-10  4.003e+04   0.000    1.000
IsInjured          -3.221e-10  5.359e+04   0.000    1.000
IsAged             -5.529e-10  1.004e+05   0.000    1.000
AgeUponIntakeYears  5.783e-11  2.606e+03   0.000    1.000
IsPopularFamilyDog  8.393e-10  8.054e+04   0.000    1.000
IsProtectiveBreed   1.157e-09  8.450e+04   0.000    1.000
IsHighEnergyBreed   6.005e-10  2.643e+05   0.000    1.000
IsTransferred      -5.313e+01  1.289e+04  -0.004    0.997

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4.5762e+03  on 3714  degrees of freedom
Residual deviance: 2.1553e-08  on 3704  degrees of freedom
AIC: 22

Number of Fisher Scoring iterations: 25

Interpretation of Logistic Regression

  • The logistic regression results show that many variables in the dataset are highly correlated with each other, causing issues with the model. This is known as multicollinearity, where strong relationships between predictors make it difficult to determine their individual effects.

  • In our first attempt, some predictor variables resulted in NA values for their coefficients and p-values due to redundancy. These variables were removed to improve the model. However, multicollinearity still caused unstable coefficients and insignificant p-values.

  • The result of multicollinearity makes p-values unreliable in this model, even for important variables. Coefficients become unstable, showing inconsistent effects.

10 Correlation Matrix

library(corrplot)
corrplot 0.95 loaded
library(tidyverse)
library(janitor)

DataForCorAnalysis <- FilteredData %>%
  mutate(
    OutcomeType_Adoption = ifelse(OutcomeType == "Adoption", 1, 0), 
  ) %>%
  select(
    OutcomeType_Adoption, 
    IsStray, IsPublicAssist, IsOwnerSurrender, IsNormal,
    AgeUponIntakeYears, IntakeYear
  ) %>%
  select(where(is.numeric))

CorMatrix <- cor(DataForCorAnalysis, use = "complete.obs")

corrplot(CorMatrix, method = "shade", type = "full",
         addCoef.col = "black", number.cex = 0.7, order = "original", diag = FALSE)

Analysis of Selected Important Variables

  • AgeUponIntakeYears and IsNormal:

    These two variables show a strong negative correlation, suggesting that animals classified as “normal” tend to be older at the time of intake. This may reduce their likelihood of adoption, as potential adopters often prefer younger animals.

  • IsNormal and OutcomeType_Adoption:

    There is a weak positive correlation between these variables, indicating that animals categorized as “normal” have a slightly higher chance of adoption. This implies that a “normal” intake condition, while not strongly impactful, still slightly favors adoption outcomes.

  • IsPublicAssist and OutcomeType_Adoption:

    A strong negative correlation exists between these two variables. This indicates that animals coming from public assist programs are less likely to be adopted. Possible reasons for this trend could include health conditions or behavioral issues associated with animals surrendered through such programs.

11 EDA Report

library(SmartEDA)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
ExpReport(FilteredData, op_file = "EDAReport")


processing file: report_tmp_1.Rmd

  |                                                                 
  |                                                           |   0%
  |                                                                 
  |..                                                         |   3%           
  |                                                                 
  |....                                                       |   6% [setup]   
  |                                                                 
  |......                                                     |  10%           
  |                                                                 
  |........                                                   |  13% [od_1]    
  |                                                                 
  |..........                                                 |  16%           
  |                                                                 
  |...........                                                |  19% [od_2]    
  |                                                                 
  |.............                                              |  23%           
  |                                                                 
  |...............                                            |  26% [od_3]    
  |                                                                 
  |.................                                          |  29%           
  |                                                                 
  |...................                                        |  32% [od_32]   
  |                                                                 
  |.....................                                      |  35%           
  |                                                                 
  |.......................                                    |  39% [snv_all] 
  |                                                                 
  |.........................                                  |  42%           
  |                                                                 
  |...........................                                |  45% [snv_2]   
  |                                                                 
  |.............................                              |  48%           
  |                                                                 
  |..............................                             |  52% [snv_2.1] 
  |                                                                 
  |................................                           |  55%           
  |                                                                 
  |..................................                         |  58% [snv2_new]
  |                                                                 
  |....................................                       |  61%           
  |                                                                 
  |......................................                     |  65% [snv2]    
  |                                                                 
  |........................................                   |  68%           
  |                                                                 
  |..........................................                 |  71% [snv22_1] 
  |                                                                 
  |............................................               |  74%           
  |                                                                 
  |..............................................             |  77% [eda_4]   
  |                                                                 
  |................................................           |  81%           
  |                                                                 
  |.................................................          |  84% [e4.1]    
  |                                                                 
  |...................................................        |  87%           
  |                                                                 
  |.....................................................      |  90% [e4.1.1]  
  |                                                                 
  |.......................................................    |  94%           
  |                                                                 
  |.........................................................  |  97% [bp1]     
  |                                                                 
  |...........................................................| 100%           
                                                                                                     
output file: /Users/angelaqian/6510 Customer Analytics  2/IBMProject/report_tmp_1.knit.md
/Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/pandoc +RTS -K512m -RTS '/Users/angelaqian/6510 Customer Analytics  2/IBMProject/report_tmp_1.knit.md' --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc54886b99f73.html --lua-filter /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 6 --template /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=cerulean --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /var/folders/nf/rbjfhmsn43d63srpp2byb0qr0000gn/T//RtmpeYztIP/rmarkdown-str54883b6394bb.html 

Output created: EDAReport.html


Report is generated at "/Users/angelaqian/6510 Customer Analytics  2/IBMProject/EDAReport".

EDA Link

12 EDA Analysis:

  • IntakeYear

    The bar plot indicates that the year 2017 had a significantly higher intake rate at 28%, compared to other years. This spike could be attributed to an increase in owner surrenders, potentially due to behavioral issues with the dogs.

  • Stray

    The Stray category shows a high adoptability rate of 72%, which may suggest that being a stray does not necessarily imply behavioral or health issues. This perception likely correlates with higher adoption rates, as many potential owners prefer to adopt pets without pre-existing problems.

  • IsHighEnergyBreed

    The bar plot for IsHighEnergyBreed reveals a 100% lower likelihood of adoption (0). This outcome is likely because high-energy breeds are often associated with behavioral challenges, such as aggression or excessive activity, which can deter potential adopters.

  • IsSmallCuddlyBreed

    The IsSmallCuddlyBreed plot shows a 99% lower likelihood of adoption (0). This may be due to the fact that smaller breeds are more prone to health issues, such as respiratory problems, particularly in short-nosed and flatter-faced breeds.

13 Conclusion

The dependent variable, IsAdopted, appears to be highly imbalanced, with the model predominantly focusing on the majority class of adoption. This imbalance results in independent variables being highly correlated or multicollinearity, which dilutes their individual impact. As a result, many variables become insignificant when included in the model, despite being significant predictors in isolation.