The purpose of this analysis is to find out the key factors that determines whether a prospect patient will end up booking/keeping the appointments. The end goal of the analysis is to train a classification model to correctly predict whether a patient will book an appointment based on certain information.
The dataset used in this analysis is from SQL database, tables used are:
The data points includes some patient demographic information such as:
It also includes other key factors might play important role, such as:
Various techniques are used in the analysis, steps as follow:
Loading data and review top 1000 rows.
library(readr)
library(ggplot2)
library(dplyr)
library(reshape2)
library(MASS)
library(DT)
library(gridExtra)
library(stargazer)
library(stringr)
library(yardstick)
Patient_Demographic_Analysis <- read_csv("Patient Demographic Analysis.csv")
datatable(Patient_Demographic_Analysis[1:1000,-1],extensions = 'Scroller', rownames = FALSE,
options = list(deferRender = TRUE,scroller = TRUE, scrollX = TRUE,scrollY = "500px")) %>%
formatDate(c("CreateDate","date_of_birth"))
The data needs to be cleaned before further analysis. Data cleanning includes the following:
After cleaning the data, we selected columns that needed for the analysis
Patient_Demographic_Analysis_cleaned <- Patient_Demographic_Analysis %>%
dplyr::select(person_nbr,age,language,gender,Marketing_Source,Distance,Nextgen_Financial_Class,Booked,Kept) %>%
mutate(gender = as.factor(gender),
language = as.factor(language),
Marketing_Source = as.factor(Marketing_Source),
Nextgen_Financial_Class = as.factor(Nextgen_Financial_Class),
Booked = as.factor(Booked),
Kept = as.factor(Kept)) %>%
filter(Distance <= 50 & age < 100 & age > 10 & Nextgen_Financial_Class != "WC / Auto")
datatable(Patient_Demographic_Analysis_cleaned[1:1000,],extensions = 'Scroller', rownames = FALSE,
options = list(deferRender = TRUE,scroller = TRUE, scrollX = TRUE,scrollY = "500px"))
Descriptive Statistics of the data
summary(Patient_Demographic_Analysis_cleaned)
## person_nbr age language gender
## Min. :555464 Min. :11.0 English:69312 F:60792
## 1st Qu.:588730 1st Qu.:48.0 Other : 1761 M:19448
## Median :642886 Median :58.0 Spanish: 9167
## Mean :650328 Mean :58.1
## 3rd Qu.:702926 3rd Qu.:69.0
## Max. :771811 Max. :99.0
##
## Marketing_Source Distance Nextgen_Financial_Class
## Internet Website : 9988 Min. : 0.000 BCBS :20878
## Other : 8966 1st Qu.: 2.460 Medicare :18256
## Referral Friend / Family: 6381 Median : 4.660 Medicaid :10613
## Referral Physician :50604 Mean : 7.155 Commercial: 9411
## SignatureForum : 4301 3rd Qu.: 8.470 UHC : 5979
## Max. :49.990 Aetna : 5370
## (Other) : 9733
## Booked Kept
## 0: 7985 0:17390
## 1:72255 1:62850
##
##
##
##
##
Since we are trying to predict whether a patient will book appointment or not, we want to evaluate how the other variables changes when patient booked and when patient didn’t book appointment.
We start with numeric variables: age and distance
We can see that the distance is highly right skewed, which means that there are more patients towards the lower value.
none <- element_blank()
formatter <- function(...){
function(x) format(round(x, 0), ...)
}
ggplot(Patient_Demographic_Analysis_cleaned,aes(x = Distance)) +
geom_histogram() +
labs(title="Patient distance to clinic Histogram",
x ="Distance", y = "Count") +
theme(legend.position = "bottom",
strip.text = element_text(face = "bold",size = 10),
panel.background = none,
legend.title = element_text(face = "bold",size = 9),
legend.text = element_text(size = 9),
legend.background = none,
axis.title=element_text(face = "bold",size=9),
axis.text.x = element_text(face = "bold",size = 9),
axis.text.y = element_text(face = "bold",size = 9),
plot.title = element_text(face = "bold",size=9,hjust=0.5),
plot.caption = element_text(hjust = 0)
)
Then we want to explore how much the meadian of distance is different when patient booked appointment or when they don’t.
For this, we are using boxplot to visualize the data. Below is a brief instruction on how to read a boxplot
Definitions
Median: The median (middle quartile) marks the mid-point of the data and is shown by the line that divides the box into two parts. Half the scores are greater than or equal to this value and half are less.
Inter-quartile range: The middle “box” represents the middle 50% of scores for the group. The range of scores from lower to upper quartile is referred to as the inter-quartile range. The middle 50% of scores fall within the inter-quartile range.
Upper quartile: Seventy-five percent of the scores fall below the upper quartile.
Lower quartile: Twenty-five percent of scores fall below the lower quartile.
Whiskers: The upper and lower whiskers represent scores outside the middle 50%. Whiskers often (but not always) stretch over a wider range of scores than the middle quartile groups.
Outliers: Points outside the whiskers, which is either greater or smaller than 1.5 Inter-quartile range
ggplot(Patient_Demographic_Analysis_cleaned,aes(x = Booked, y = Distance)) +
geom_boxplot() +
labs(title="Patient distance to clinic By Booked",
x ="Booked or Not", y = "Distance") +
theme(legend.position = "bottom",
strip.text = element_text(face = "bold",size = 10),
panel.background = none,
legend.title = element_text(face = "bold",size = 9),
legend.text = element_text(size = 9),
legend.background = none,
axis.title=element_text(face = "bold",size=9),
axis.text.x = element_text(face = "bold",size = 9),
axis.text.y = element_text(face = "bold",size = 9),
plot.title = element_text(face = "bold",size=9,hjust=0.5),
plot.caption = element_text(hjust = 0)
) +
scale_x_discrete(labels = c("Not Booked","Booked"))
From the boxplot above, we can see that there are not much difference in distance for those who booked appointments and those who doesn’t.
Then we look at the same for age.
First we look at the distribution of patient age, we can see from the histogram that age follows a normal distribution with a mean in the high 50s.
ggplot(Patient_Demographic_Analysis_cleaned,aes(x = age)) +
geom_histogram() +
labs(title="Patient age Histogram",
x ="Age", y = "Count") +
theme(legend.position = "bottom",
strip.text = element_text(face = "bold",size = 10),
panel.background = none,
legend.title = element_text(face = "bold",size = 9),
legend.text = element_text(size = 9),
legend.background = none,
axis.title=element_text(face = "bold",size=9),
axis.text.x = element_text(face = "bold",size = 9),
axis.text.y = element_text(face = "bold",size = 9),
plot.title = element_text(face = "bold",size=9,hjust=0.5),
plot.caption = element_text(hjust = 0)
)
ggplot(Patient_Demographic_Analysis_cleaned,aes(x = Booked, y = age)) +
geom_boxplot() +
labs(title="Patient age By Booked",
x ="Booked or Not", y = "Age") +
theme(legend.position = "bottom",
strip.text = element_text(face = "bold",size = 10),
panel.background = none,
legend.title = element_text(face = "bold",size = 9),
legend.text = element_text(size = 9),
legend.background = none,
axis.title=element_text(face = "bold",size=9),
axis.text.x = element_text(face = "bold",size = 9),
axis.text.y = element_text(face = "bold",size = 9),
plot.title = element_text(face = "bold",size=9,hjust=0.5),
plot.caption = element_text(hjust = 0)
) +
scale_x_discrete(labels = c("Not Booked","Booked"))
Similarly, we can see that elder people are more likely to book appointments.
The next step would be to look at all the categorical variables using stacked bar plot. We want to see for each factor such as gender, language, how the number of patients in each bucket different for booked and not booked.
p1 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) +
geom_bar(aes(y = (..count..),fill = language),position="fill") +
labs(title="Count % of Language by Booked",
x ="Booked or Not", y = "%") +
theme(legend.position = "right",
strip.text = element_text(face = "bold",size = 10),
panel.background = none,
legend.title = element_text(face = "bold",size = 9),
legend.text = element_text(size = 9),
legend.background = none,
axis.title=element_text(face = "bold",size=9),
axis.text.x = element_text(face = "bold",size = 9),
axis.text.y = element_text(face = "bold",size = 9),
plot.title = element_text(face = "bold",size=9,hjust=0.5),
plot.caption = element_text(hjust = 0)
) +
scale_x_discrete(labels = c("Not Booked","Booked")) +
scale_fill_discrete(name = "Language") +
scale_y_continuous(labels = scales::percent)
p2 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) +
geom_bar(aes(y = (..count..),fill = gender),position="fill") +
labs(title="Count % of Gender by Booked",
x ="Booked or Not", y = "%") +
theme(legend.position = "right",
strip.text = element_text(face = "bold",size = 10),
panel.background = none,
legend.title = element_text(face = "bold",size = 9),
legend.text = element_text(size = 9),
legend.background = none,
axis.title=element_text(face = "bold",size=9),
axis.text.x = element_text(face = "bold",size = 9),
axis.text.y = element_text(face = "bold",size = 9),
plot.title = element_text(face = "bold",size=9,hjust=0.5),
plot.caption = element_text(hjust = 0)
) +
scale_x_discrete(labels = c("Not Booked","Booked")) +
scale_fill_discrete(name = "Gender") +
scale_y_continuous(labels = scales::percent)
p3 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) +
geom_bar(aes(y = (..count..),fill = Marketing_Source),position="fill") +
labs(title="Count % of Marketing Source by Booked",
x ="Booked or Not", y = "%") +
theme(legend.position = "right",
strip.text = element_text(face = "bold",size = 10),
panel.background = none,
legend.title = element_text(face = "bold",size = 9),
legend.text = element_text(size = 9),
legend.background = none,
axis.title=element_text(face = "bold",size=9),
axis.text.x = element_text(face = "bold",size = 9),
axis.text.y = element_text(face = "bold",size = 9),
plot.title = element_text(face = "bold",size=9,hjust=0.5),
plot.caption = element_text(hjust = 0)
) +
scale_x_discrete(labels = c("Not Booked","Booked")) +
scale_fill_discrete(name = "Source") +
scale_y_continuous(labels = scales::percent)
p4 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) +
geom_bar(aes(y = (..count..),fill = Nextgen_Financial_Class),position="fill") +
labs(title="Count % of Insurance by Booked",
x ="Booked or Not", y = "%") +
theme(legend.position = "right",
strip.text = element_text(face = "bold",size = 10),
panel.background = none,
legend.title = element_text(face = "bold",size = 9),
legend.text = element_text(size = 9),
legend.background = none,
axis.title=element_text(face = "bold",size=9),
axis.text.x = element_text(face = "bold",size = 9),
axis.text.y = element_text(face = "bold",size = 9),
plot.title = element_text(face = "bold",size=9,hjust=0.5),
plot.caption = element_text(hjust = 0)
) +
scale_x_discrete(labels = c("Not Booked","Booked")) +
scale_fill_discrete(name = "Insurance") +
scale_fill_brewer(palette="Spectral") +
scale_y_continuous(labels = scales::percent)
grid.arrange(p1, p2, p3, p4, ncol=2)
From the barplot above, we can see that:
Now that we have a better understanding of the data, we can start to build the model. The goal of the model is to use the training data to build a logistic regression model to predict whether a patient will book a appointment or not based on the demographic information given in the data.
set.seed(100)
trainingRows <- sample(1:nrow(Patient_Demographic_Analysis_cleaned), 0.7*nrow(Patient_Demographic_Analysis_cleaned))
training <- Patient_Demographic_Analysis_cleaned[trainingRows, ]
test <- Patient_Demographic_Analysis_cleaned[-trainingRows, ]
model_booked <- glm(Booked ~ age + language + gender + Marketing_Source + Distance + Nextgen_Financial_Class,
data = training,family = "binomial")
output_booked <- summary(model_booked)
output_booked
##
## Call:
## glm(formula = Booked ~ age + language + gender + Marketing_Source +
## Distance + Nextgen_Financial_Class, family = "binomial",
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4245 0.2275 0.3354 0.3742 2.5424
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.800022 0.105171 26.624 < 2e-16
## age 0.007137 0.001287 5.544 2.95e-08
## languageOther -1.681126 0.071296 -23.579 < 2e-16
## languageSpanish 1.156234 0.069273 16.691 < 2e-16
## genderM -0.157622 0.037482 -4.205 2.61e-05
## Marketing_SourceOther -2.598264 0.065301 -39.789 < 2e-16
## Marketing_SourceReferral Friend / Family 0.703692 0.121381 5.797 6.74e-09
## Marketing_SourceReferral Physician -0.659061 0.063347 -10.404 < 2e-16
## Marketing_SourceSignatureForum -1.689350 0.075421 -22.399 < 2e-16
## Distance 0.009063 0.002129 4.256 2.08e-05
## Nextgen_Financial_ClassBCBS 0.137032 0.064916 2.111 0.034780
## Nextgen_Financial_ClassCigna 0.132267 0.097003 1.364 0.172714
## Nextgen_Financial_ClassCommercial 0.268118 0.074985 3.576 0.000349
## Nextgen_Financial_ClassMedicaid 0.770113 0.082679 9.314 < 2e-16
## Nextgen_Financial_ClassMedicare 0.089951 0.069658 1.291 0.196587
## Nextgen_Financial_ClassSelf Pay -1.839504 0.071916 -25.579 < 2e-16
## Nextgen_Financial_ClassTricare 0.288482 0.135888 2.123 0.033759
## Nextgen_Financial_ClassUHC 0.160851 0.081702 1.969 0.048983
##
## (Intercept) ***
## age ***
## languageOther ***
## languageSpanish ***
## genderM ***
## Marketing_SourceOther ***
## Marketing_SourceReferral Friend / Family ***
## Marketing_SourceReferral Physician ***
## Marketing_SourceSignatureForum ***
## Distance ***
## Nextgen_Financial_ClassBCBS *
## Nextgen_Financial_ClassCigna
## Nextgen_Financial_ClassCommercial ***
## Nextgen_Financial_ClassMedicaid ***
## Nextgen_Financial_ClassMedicare
## Nextgen_Financial_ClassSelf Pay ***
## Nextgen_Financial_ClassTricare *
## Nextgen_Financial_ClassUHC *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 36489 on 56167 degrees of freedom
## Residual deviance: 29306 on 56150 degrees of freedom
## AIC: 29342
##
## Number of Fisher Scoring iterations: 6
# stargazer(model_booked,type = "text")
From the model output, we can see that most of the variables are statistically significant as the p values are less than 0.01.
For numeric variables, we can see that:
For categorical variables, we want to plot bar plot to see how each level compare to their base level
For gender, since there’s only two levels and the base level is Female and the coefficient for male is negative, we can conclude that male has a lower probability to book appointment compare to female
marketing_source <- as.data.frame(model_booked$coefficients[6:9]) %>%
tibble::rownames_to_column("Marketing_Source") %>%
rename(value = 2) %>%
mutate(Marketing_Source = substr(Marketing_Source,17,nchar(Marketing_Source)))
# View(marketing_source)
p1 <- ggplot(marketing_source,aes(x = reorder(Marketing_Source,-value), y = value)) +
geom_bar(stat="identity",fill = "blue") +
labs(title="Marketing source coefficient importance",
x ="Marketing Source", y = "") +
theme(legend.position = "right",
strip.text = element_text(face = "bold",size = 10),
panel.background = none,
legend.title = element_text(face = "bold",size = 9),
legend.text = element_text(size = 9),
legend.background = none,
axis.title=element_text(face = "bold",size=9),
axis.text.x = element_text(face = "bold",size = 9),
axis.text.y = element_text(face = "bold",size = 9),
plot.title = element_text(face = "bold",size=9,hjust=0.5),
plot.caption = element_text(hjust = 0)
)
insurance <- as.data.frame(model_booked$coefficients[11:18]) %>%
tibble::rownames_to_column("Insurance") %>%
rename(value = 2) %>%
mutate(Insurance = substr(Insurance,24,nchar(Insurance)))
# View(marketing_source)
p2 <- ggplot(insurance,aes(x = reorder(Insurance,-value), y = value)) +
geom_bar(stat="identity",fill = "red") +
labs(title="Insurance coefficient importance",
x ="Insurance", y = "") +
theme(legend.position = "right",
strip.text = element_text(face = "bold",size = 10),
panel.background = none,
legend.title = element_text(face = "bold",size = 9),
legend.text = element_text(size = 9),
legend.background = none,
axis.title=element_text(face = "bold",size=9),
axis.text.x = element_text(face = "bold",size = 9),
axis.text.y = element_text(face = "bold",size = 9),
plot.title = element_text(face = "bold",size=9,hjust=0.5),
plot.caption = element_text(hjust = 0)
)
grid.arrange(p1,p2,ncol = 2)
From the barplot above, we can see that
After we train the model, we want to see how the model perform using the test dataset.
pred_booked <- predict(model_booked, test,type="response")
pred_booked <- ifelse(pred_booked > 0.5, 1, 0)
actual_booked <- as.data.frame(test$Booked)
cm_booked <- cbind(pred_booked,actual_booked) %>%
rename(predicted = 1,
actual = 2) %>%
mutate(predicted = as.factor(predicted))
cm_booked <- conf_mat(cm_booked, actual, predicted)
# pred_booked[1:20]
# test$Booked[1:20]
# cm_booked <- as.data.frame(table(pred_booked, test$Booked)) %>%
# rename(Predicted = 1,
# Actual = 2)
tab_book <- table(pred_booked, test$Booked)
autoplot(cm_booked, type = "heatmap") +
scale_fill_gradient(low="#D6EAF8",high = "#2E86C1") +
labs(title="Prediction vs Actual Confusion Matrix",
x ="Actual", y = "Prediction") +
theme(legend.position = "right",
strip.text = element_text(face = "bold",size = 10),
panel.background = none,
legend.title = element_text(face = "bold",size = 9),
legend.text = element_text(size = 9),
legend.background = none,
axis.title=element_text(face = "bold",size=9),
axis.text.x = element_text(face = "bold",size = 9),
axis.text.y = element_text(face = "bold",size = 9),
plot.title = element_text(face = "bold",size=9,hjust=0.5),
plot.caption = element_text(hjust = 0)
)
## misclassification rate
print(paste0("Misclassification Rate: ",100 - round(sum(diag(tab_book))/sum(tab_book)*100),"%"))
## [1] "Misclassification Rate: 9%"
print(paste0("Misclassification Rate for not booked patients: ",100 - round(tab_book[1,1]/sum(tab_book[1,])*100),"%"))
## [1] "Misclassification Rate for not booked patients: 34%"
print(paste0("Misclassification Rate for booked patients: ",100 - round(tab_book[2,2]/sum(tab_book[2,])*100),"%"))
## [1] "Misclassification Rate for booked patients: 9%"
From the plot above, we can see that overall, the model has a 91% accuracy to classify patients into booked and not booked group. The model does a better job in classifying booked patients than not booked patients. For predicting not booked patients, the misclassification rate is 34%, for booked patients, it’s only 8%.
From the analysis above, we can see that patient demographics plays an important role in whether a new patient will book appointment or not. Older female are the main targets for potential booking patients. Additionally, patients from marketing source such as referral friends / family and Internet or from insurance such as Medicaid and Tricare are more likely to book appointments. This information can be used to better target potential customers or perform better outbound call efforts by reaching out to the correct group of patients.