options(repos = c(CRAN = "https://cran.rstudio.com/"))AAC 6510 Dataset Project
1 Quarto
2 Introduction to our Dataset
This study explores the most important factors that influence the adoption of shelter animals. By understanding the key characteristics that impact adoption decisions, this analysis aims to provide valuable insights to help animal shelters improve their strategies and increase adoption rates.
2.1 Research Question
How do various animal traits—such as breed, age, sex, intake condition, color, and intake type—affect the likelihood of being adopted from shelters?
2.2 Dependent Variables
The dependent variable in this study is:
Likelihood of Adoption
This is a binary outcome that indicates whether an animal was adopted or not:
1 = The animal was adopted.
0 = The animal was not adopted (e.g., transferred to another shelter).
This variable is the central focus of our analysis, as we aim to understand which animal characteristics (independent variables) are associated with a higher likelihood of adoption. By identifying these key factors, we can provide insights to help shelters improve adoption strategies and outcomes.
2.3 Independent Variables
The independent variables in this study are the characteristics of the animals that may influence their likelihood of being adopted. These variables were carefully chosen based on shelter data and include the following:
1. Intake Type - Describes how the animal came into the shelter.
Stray: The animal was found without an owner.
Public Assist: The animal was brought in through public assistance.
Owner Surrender: The previous owner gave up the animal.
2. Intake Condition - The physical or medical state of the animal at the time of intake:
Normal: The animal appeared healthy.
Injured: The animal had visible injuries.
Aged: The animal was considered older.
3. Age Upon Intake - The age of the animal when it entered the shelter, measured in years.
5. Breed Type - The breed classification of the animal:
Popular Family Dog: Includes breeds like Labrador Retrievers and Golden Retrievers that are often considered family-friendly.
Protective Breed: Breeds known for their protective instincts, such as German Shepherds and Doberman Pinschers.
High-Energy Breed: Breeds like Huskies and Malamutes that are known for being active.
Small Cuddly Breed: Small-sized breeds like Chihuahuas and Yorkshire Terriers.
6. Intake Year - The year the animal was taken into the shelter, which helps identify trends over time.
7. Intake as Stray or Public Assist - Binary variables indicating if the animal was a stray or brought in through public assistance.
By analyzing these variables, the study identifies which traits—like breed, age, or condition—are most strongly associated with adoption outcomes. This helps shelters focus on improving the adoption chances for animals that might be overlooked.
3 Loading Packages
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(rio)
library(dplyr)4 Running Code
DataAdoptionPets <- import('aac_intakes_outcomes.csv') |>
clean_names("upper_camel")
head(DataAdoptionPets) AgeUponOutcome AnimalIdOutcome DateOfBirth OutcomeSubtype OutcomeType
1 10 years A006100 7/9/07 0:00 Return to Owner
2 7 years A006100 7/9/07 0:00 Return to Owner
3 6 years A006100 7/9/07 0:00 Return to Owner
4 10 years A047759 4/2/04 0:00 Partner Transfer
5 16 years A134067 10/16/97 0:00 Return to Owner
6 15 years A141142 6/1/98 0:00 Return to Owner
SexUponOutcome AgeUponOutcomeDays AgeUponOutcomeYears AgeUponOutcomeAgeGroup
1 Neutered Male 3650 10 (7.5, 10.0]
2 Neutered Male 2555 7 (5.0, 7.5]
3 Neutered Male 2190 6 (5.0, 7.5]
4 Neutered Male 3650 10 (7.5, 10.0]
5 Neutered Male 5840 16 (15.0, 17.5]
6 Spayed Female 5475 15 (12.5, 15.0]
OutcomeDatetime OutcomeMonth OutcomeYear OutcomeMonthyear OutcomeWeekday
1 12/7/17 14:07 12 2017 2017-12 Thursday
2 12/20/14 16:35 12 2014 2014-12 Saturday
3 3/8/14 17:10 3 2014 2014-03 Saturday
4 4/7/14 15:12 4 2014 2014-04 Monday
5 11/16/13 11:54 11 2013 2013-11 Saturday
6 11/17/13 11:40 11 2013 2013-11 Sunday
OutcomeHour OutcomeNumber DobYear DobMonth DobMonthyear AgeUponIntake
1 0 1 2007 7 2017-12 10 years
2 16 2 2007 7 2014-12 7 years
3 17 3 2007 7 2014-03 6 years
4 15 1 2004 4 2014-04 10 years
5 11 1 1997 10 2013-11 16 years
6 11 1 1998 6 2013-11 15 years
AnimalIdIntake AnimalType Breed Color
1 A006100 Dog Spinone Italiano Mix Yellow/White
2 A006100 Dog Spinone Italiano Mix Yellow/White
3 A006100 Dog Spinone Italiano Mix Yellow/White
4 A047759 Dog Dachshund Tricolor
5 A134067 Dog Shetland Sheepdog Brown/White
6 A141142 Dog Labrador Retriever/Pit Bull Black/White
FoundLocation IntakeCondition IntakeType
1 Colony Creek And Hunters Trace in Austin (TX) Normal Stray
2 8700 Research Blvd in Austin (TX) Normal Public Assist
3 8700 Research in Austin (TX) Normal Public Assist
4 Austin (TX) Normal Owner Surrender
5 12034 Research Blvd in Austin (TX) Injured Public Assist
6 Austin (TX) Aged Stray
SexUponIntake Count AgeUponIntakeDays AgeUponIntakeYears
1 Neutered Male 1 3650 10
2 Neutered Male 1 2555 7
3 Neutered Male 1 2190 6
4 Neutered Male 1 3650 10
5 Neutered Male 1 5840 16
6 Spayed Female 1 5475 15
AgeUponIntakeAgeGroup IntakeDatetime IntakeMonth IntakeYear IntakeMonthyear
1 (7.5, 10.0] 12/7/17 0:00 12 2017 2017-12
2 (5.0, 7.5] 12/19/14 10:21 12 2014 2014-12
3 (5.0, 7.5] 3/7/14 14:26 3 2014 2014-03
4 (7.5, 10.0] 4/2/14 15:55 4 2014 2014-04
5 (15.0, 17.5] 11/16/13 9:02 11 2013 2013-11
6 (12.5, 15.0] 11/16/13 14:46 11 2013 2013-11
IntakeWeekday IntakeHour IntakeNumber TimeInShelter
1 Thursday 14 1 0 days 14:07:00.000000000
2 Friday 10 2 1 days 06:14:00.000000000
3 Friday 14 3 1 days 02:44:00.000000000
4 Wednesday 15 1 4 days 23:17:00.000000000
5 Saturday 9 1 0 days 02:52:00.000000000
6 Saturday 14 1 0 days 20:54:00.000000000
TimeInShelterDays
1 0.5881944
2 1.2597222
3 1.1138889
4 4.9701389
5 0.1194444
6 0.8708333
5 Directed Acrylical Graph (DAG)
library(ggdag)
Attaching package: 'ggdag'
The following object is masked from 'package:stats':
filter
library(ggplot2)
set.seed(123)
dag <- dagify(
IsAdopted ~ Breed + IntakeCondition + IntakeYear + IntakeType + AgeUponIntakeYears,
IntakeCondition ~ IntakeType,
OutcomeType ~ IntakeCondition + IntakeType + Breed,
outcome = "IsAdopted"
)
ggdag_status(dag, node_size = 22, text_size = 2) +
theme_void()6 Cleaning Data
# Cleaning Data for Selected Variables
DataAdoptionPetsYes <- DataAdoptionPets %>% select(OutcomeType, AnimalType, Breed, IntakeCondition, IntakeType, Color, AgeUponIntakeYears, IntakeYear)
# Filtering for Dog's only
filtered_data <- DataAdoptionPetsYes %>% filter(grepl("Dog", AnimalType))
filtered_data <- DataAdoptionPetsYes %>%
filter(str_trim(OutcomeType) %in% c("Adoption", "Transfer"))
# View the filtered data
head(filtered_data) OutcomeType AnimalType Breed IntakeCondition
1 Transfer Dog Dachshund Normal
2 Transfer Cat American Shorthair Mix Normal
3 Adoption Dog Dachshund Mix Normal
4 Adoption Dog Chihuahua Shorthair Aged
5 Adoption Dog Labrador Retriever/Beagle Normal
6 Transfer Dog Rat Terrier Mix Normal
IntakeType Color AgeUponIntakeYears IntakeYear
1 Owner Surrender Tricolor 10 2014
2 Stray Brown Tabby/White 14 2014
3 Owner Surrender Black/Tan 16 2013
4 Public Assist Tan/Black 17 2016
5 Stray Sable/White 15 2015
6 Stray Gray/White 15 2016
7 Selecting and Creating New Variables
#Filtering Breeds
library(dplyr)
DogDataYes <- filtered_data %>%
mutate(
is_popular_family_dog = ifelse(Breed %in% c("Labrador Retriever", "Labrador Retriever Mix", "Golden Retriever", "Beagle", "Pug", "French Bulldog"), 1, 0),
is_protective_breed = ifelse(Breed %in% c("German Shepherd", "Doberman Pinscher", "Rottweiler"), 1, 0),
is_high_energy_breed = ifelse(Breed %in% c("Husky, Siberian", "Alaskan Malamute", "Samoyed"), 1, 0),
is_small_cuddly_breed = ifelse(Breed %in% c("Chihuahua", "Yorkshire Terrier", "Poodle, Toy"), 1, 0),
)library(dplyr)
library(tidyr)
DogYesData <- DogDataYes %>%
mutate(is_normal = ifelse(IntakeCondition == "Normal", 1, 0),
is_injured = ifelse(IntakeCondition == "Injured", 1, 0),
is_aged = ifelse(IntakeCondition == "Aged", 1, 0))library(dplyr)
library(tidyr)
IntakeType <- DogYesData %>%
mutate(is_stray = ifelse(IntakeType == "Stray", 1, 0),
is_publicAsst = ifelse(IntakeType == "Public Assist", 1, 0),
is_surrender = ifelse(IntakeType == "Surrender", 1, 0)
)library(dplyr)
library(tidyr)
library(janitor)
DogDataYes <- DogDataYes %>% clean_names(case = "upper_camel")
FilteredData <- DogDataYes %>%
clean_names(case = "upper_camel") %>%
mutate(
is_adopted = ifelse(OutcomeType == "Adoption", 1,
ifelse(OutcomeType == "Transfer", 0, NA))
) %>%
filter(
!is.na(is_adopted),
AnimalType == "Dog",
IntakeType %in% c("Stray", "Public Assist", "Owner Surrender"),
IntakeCondition %in% c("Normal", "Injured", "Aged", "Sick"),
Breed %in% c(
"Labrador Retriever", "Labrador Retriever Mix", "Golden Retriever",
"Beagle", "Pug", "French Bulldog", "German Shepherd", "Doberman Pinscher",
"Rottweiler", "Husky, Siberian", "Alaskan Malamute", "Samoyed",
"Chihuahua", "Yorkshire Terrier", "Poodle, Toy"
)
) %>%
select(
AnimalType, Breed, IntakeCondition, IntakeType,
AgeUponIntakeYears, IntakeYear, OutcomeType
) %>%
mutate(
IsStray = ifelse(IntakeType == "Stray", 1, 0),
IsPublicAssist = ifelse(IntakeType == "Public Assist", 1, 0),
IsOwnerSurrender = ifelse(IntakeType == "Owner Surrender", 1, 0),
IsEuthanasiaReq = ifelse(IntakeType == "Euthanasia Request", 1, 0),
IsNormal = ifelse(IntakeCondition == "Normal", 1, 0),
IsInjured = ifelse(IntakeCondition == "Injured", 1, 0),
IsAged = ifelse(IntakeCondition == "Aged", 1, 0),
IsSick = ifelse(IntakeCondition == "Sick", 1, 0),
IsAdopted = ifelse(OutcomeType == "Adoption", 1, 0),
IsTransferred = ifelse(OutcomeType == "Transfer", 1, 0),
IsPopularFamilyDog = ifelse(Breed %in% c("Labrador Retriever", "Labrador Retriever Mix", "Golden Retriever", "Beagle", "Pug", "French Bulldog"), 1, 0),
IsProtectiveBreed = ifelse(Breed %in% c("German Shepherd", "Doberman Pinscher", "Rottweiler"), 1, 0),
IsHighEnergyBreed = ifelse(Breed %in% c("Husky", "Siberian", "Alaskan Malamute", "Samoyed"), 1, 0),
IsSmallCuddlyBreed = ifelse(Breed %in% c("Chihuahua", "Yorkshire Terrier", "Poodle, Toy"), 1, 0),
)
head(FilteredData) AnimalType Breed IntakeCondition IntakeType
1 Dog Labrador Retriever Mix Normal Stray
2 Dog Labrador Retriever Mix Normal Owner Surrender
3 Dog Labrador Retriever Mix Normal Owner Surrender
4 Dog Labrador Retriever Mix Sick Owner Surrender
5 Dog Labrador Retriever Mix Normal Owner Surrender
6 Dog Labrador Retriever Mix Normal Owner Surrender
AgeUponIntakeYears IntakeYear OutcomeType IsStray IsPublicAssist
1 15 2015 Transfer 1 0
2 12 2013 Adoption 0 0
3 13 2017 Adoption 0 0
4 11 2016 Adoption 0 0
5 10 2014 Transfer 0 0
6 9 2014 Adoption 0 0
IsOwnerSurrender IsEuthanasiaReq IsNormal IsInjured IsAged IsSick IsAdopted
1 0 0 1 0 0 0 0
2 1 0 1 0 0 0 1
3 1 0 1 0 0 0 1
4 1 0 0 0 0 1 1
5 1 0 1 0 0 0 0
6 1 0 1 0 0 0 1
IsTransferred IsPopularFamilyDog IsProtectiveBreed IsHighEnergyBreed
1 1 1 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 1 1 0 0
6 0 1 0 0
IsSmallCuddlyBreed
1 0
2 0
3 0
4 0
5 0
6 0
8 Logistic Regression
8.1 Attempt 1
LogModel <- glm( IsAdopted ~ IsStray + IsPublicAssist + IsOwnerSurrender + IsEuthanasiaReq + IsNormal + IsInjured + IsAged + IsSick + AgeUponIntakeYears + IsNeuteredMale + IsSpayedFemale + IsPopularFamilyDog + IsProtectiveBreed + IsHighEnergyBreed + IsSmallCuddlyBreed + IsTransferred, data = FilteredData, family = binomial() ) summary(LogModel)
9 Finalized Attempt
LogModel <- glm(
IsAdopted ~ IsStray + IsPublicAssist + IsNormal + IsInjured + IsAged +
AgeUponIntakeYears +
IsPopularFamilyDog + IsProtectiveBreed + IsHighEnergyBreed + IsTransferred,
data = FilteredData,
family = binomial()
)Warning: glm.fit: algorithm did not converge
summary(LogModel)
Call:
glm(formula = IsAdopted ~ IsStray + IsPublicAssist + IsNormal +
IsInjured + IsAged + AgeUponIntakeYears + IsPopularFamilyDog +
IsProtectiveBreed + IsHighEnergyBreed + IsTransferred, family = binomial(),
data = FilteredData)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.657e+01 9.030e+04 0.000 1.000
IsStray 7.190e-10 1.421e+04 0.000 1.000
IsPublicAssist 4.806e-10 3.578e+04 0.000 1.000
IsNormal -1.276e-10 4.003e+04 0.000 1.000
IsInjured -3.221e-10 5.359e+04 0.000 1.000
IsAged -5.529e-10 1.004e+05 0.000 1.000
AgeUponIntakeYears 5.783e-11 2.606e+03 0.000 1.000
IsPopularFamilyDog 8.393e-10 8.054e+04 0.000 1.000
IsProtectiveBreed 1.157e-09 8.450e+04 0.000 1.000
IsHighEnergyBreed 6.005e-10 2.643e+05 0.000 1.000
IsTransferred -5.313e+01 1.289e+04 -0.004 0.997
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.5762e+03 on 3714 degrees of freedom
Residual deviance: 2.1553e-08 on 3704 degrees of freedom
AIC: 22
Number of Fisher Scoring iterations: 25
Interpretation of Logistic Regression
The logistic regression results show that many variables in the dataset are highly correlated with each other, causing issues with the model. This is known as multicollinearity, where strong relationships between predictors make it difficult to determine their individual effects.
In our first attempt, some predictor variables resulted in NA values for their coefficients and p-values due to redundancy. These variables were removed to improve the model. However, multicollinearity still caused unstable coefficients and insignificant p-values.
The result of multicollinearity makes p-values unreliable in this model, even for important variables. Coefficients become unstable, showing inconsistent effects.
10 Correlation Matrix
library(corrplot)corrplot 0.95 loaded
library(tidyverse)
library(janitor)
DataForCorAnalysis <- FilteredData %>%
mutate(
OutcomeType_Adoption = ifelse(OutcomeType == "Adoption", 1, 0),
) %>%
select(
OutcomeType_Adoption,
IsStray, IsPublicAssist, IsOwnerSurrender, IsNormal,
AgeUponIntakeYears, IntakeYear
) %>%
select(where(is.numeric))
CorMatrix <- cor(DataForCorAnalysis, use = "complete.obs")
corrplot(CorMatrix, method = "shade", type = "full",
addCoef.col = "black", number.cex = 0.7, order = "original", diag = FALSE)Analysis of Selected Important Variables
AgeUponIntakeYears and IsNormal:
These two variables show a strong negative correlation, suggesting that animals classified as “normal” tend to be older at the time of intake. This may reduce their likelihood of adoption, as potential adopters often prefer younger animals.
IsNormal and OutcomeType_Adoption:
There is a weak positive correlation between these variables, indicating that animals categorized as “normal” have a slightly higher chance of adoption. This implies that a “normal” intake condition, while not strongly impactful, still slightly favors adoption outcomes.
IsPublicAssist and OutcomeType_Adoption:
A strong negative correlation exists between these two variables. This indicates that animals coming from public assist programs are less likely to be adopted. Possible reasons for this trend could include health conditions or behavioral issues associated with animals surrendered through such programs.
11 EDA Report
library(SmartEDA)Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
ExpReport(FilteredData, op_file = "EDAReport")
processing file: report_tmp_1.Rmd
|
| | 0%
|
|.. | 3%
|
|.... | 6% [setup]
|
|...... | 10%
|
|........ | 13% [od_1]
|
|.......... | 16%
|
|........... | 19% [od_2]
|
|............. | 23%
|
|............... | 26% [od_3]
|
|................. | 29%
|
|................... | 32% [od_32]
|
|..................... | 35%
|
|....................... | 39% [snv_all]
|
|......................... | 42%
|
|........................... | 45% [snv_2]
|
|............................. | 48%
|
|.............................. | 52% [snv_2.1]
|
|................................ | 55%
|
|.................................. | 58% [snv2_new]
|
|.................................... | 61%
|
|...................................... | 65% [snv2]
|
|........................................ | 68%
|
|.......................................... | 71% [snv22_1]
|
|............................................ | 74%
|
|.............................................. | 77% [eda_4]
|
|................................................ | 81%
|
|................................................. | 84% [e4.1]
|
|................................................... | 87%
|
|..................................................... | 90% [e4.1.1]
|
|....................................................... | 94%
|
|......................................................... | 97% [bp1]
|
|...........................................................| 100%
output file: /Users/angelaqian/6510 Customer Analytics 2/IBMProject/report_tmp_1.knit.md
/Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/pandoc +RTS -K512m -RTS '/Users/angelaqian/6510 Customer Analytics 2/IBMProject/report_tmp_1.knit.md' --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc54886b99f73.html --lua-filter /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 6 --template /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=cerulean --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /var/folders/nf/rbjfhmsn43d63srpp2byb0qr0000gn/T//RtmpeYztIP/rmarkdown-str54883b6394bb.html
Output created: EDAReport.html
Report is generated at "/Users/angelaqian/6510 Customer Analytics 2/IBMProject/EDAReport".
12 EDA Analysis:
IntakeYear
The bar plot indicates that the year 2017 had a significantly higher intake rate at 28%, compared to other years. This spike could be attributed to an increase in owner surrenders, potentially due to behavioral issues with the dogs.
Stray
The Stray category shows a high adoptability rate of 72%, which may suggest that being a stray does not necessarily imply behavioral or health issues. This perception likely correlates with higher adoption rates, as many potential owners prefer to adopt pets without pre-existing problems.
IsHighEnergyBreed
The bar plot for IsHighEnergyBreed reveals a 100% lower likelihood of adoption (0). This outcome is likely because high-energy breeds are often associated with behavioral challenges, such as aggression or excessive activity, which can deter potential adopters.
IsSmallCuddlyBreed
The IsSmallCuddlyBreed plot shows a 99% lower likelihood of adoption (0). This may be due to the fact that smaller breeds are more prone to health issues, such as respiratory problems, particularly in short-nosed and flatter-faced breeds.
13 Conclusion
The dependent variable, IsAdopted, appears to be highly imbalanced, with the model predominantly focusing on the majority class of adoption. This imbalance results in independent variables being highly correlated or multicollinearity, which dilutes their individual impact. As a result, many variables become insignificant when included in the model, despite being significant predictors in isolation.