Introduction

The purpose of this analysis is to find out the key factors that determines whether a prospect patient will end up booking/keeping the appointments. The end goal of the analysis is to train a classification model to correctly predict whether a patient will book an appointment based on certain information.

Data Source

The dataset used in this analysis is from SQL database, tables used are:

The data points includes some patient demographic information such as:

It also includes other key factors might play important role, such as:

Methodology

Various techniques are used in the analysis, steps as follow:

Data Load

Loading data and review top 1000 rows.

library(readr)
library(ggplot2)
library(dplyr)
library(reshape2)
library(MASS)
library(DT)
library(gridExtra)
library(stargazer)
library(stringr)
library(yardstick)

Patient_Demographic_Analysis <- read_csv("Patient Demographic Analysis.csv")
datatable(Patient_Demographic_Analysis[1:1000,-1],extensions = 'Scroller', rownames = FALSE,
          options = list(deferRender = TRUE,scroller = TRUE, scrollX = TRUE,scrollY = "500px")) %>%
  formatDate(c("CreateDate","date_of_birth"))

Data Cleanning

The data needs to be cleaned before further analysis. Data cleanning includes the following:

After cleaning the data, we selected columns that needed for the analysis

Patient_Demographic_Analysis_cleaned <- Patient_Demographic_Analysis %>%
  dplyr::select(person_nbr,age,language,gender,Marketing_Source,Distance,Nextgen_Financial_Class,Booked,Kept) %>%
  mutate(gender = as.factor(gender),
         language = as.factor(language),
         Marketing_Source = as.factor(Marketing_Source),
         Nextgen_Financial_Class = as.factor(Nextgen_Financial_Class),
         Booked = as.factor(Booked),
         Kept = as.factor(Kept)) %>%
  filter(Distance <= 50 & age < 100 & age > 10 & Nextgen_Financial_Class != "WC / Auto")
datatable(Patient_Demographic_Analysis_cleaned[1:1000,],extensions = 'Scroller', rownames = FALSE,
          options = list(deferRender = TRUE,scroller = TRUE, scrollX = TRUE,scrollY = "500px"))

Exploratory Data Analysis

Descriptive Statistics of the data

summary(Patient_Demographic_Analysis_cleaned)
##    person_nbr          age          language     gender   
##  Min.   :555464   Min.   :11.0   English:69312   F:60792  
##  1st Qu.:588730   1st Qu.:48.0   Other  : 1761   M:19448  
##  Median :642886   Median :58.0   Spanish: 9167            
##  Mean   :650328   Mean   :58.1                            
##  3rd Qu.:702926   3rd Qu.:69.0                            
##  Max.   :771811   Max.   :99.0                            
##                                                           
##                  Marketing_Source    Distance      Nextgen_Financial_Class
##  Internet Website        : 9988   Min.   : 0.000   BCBS      :20878       
##  Other                   : 8966   1st Qu.: 2.460   Medicare  :18256       
##  Referral Friend / Family: 6381   Median : 4.660   Medicaid  :10613       
##  Referral Physician      :50604   Mean   : 7.155   Commercial: 9411       
##  SignatureForum          : 4301   3rd Qu.: 8.470   UHC       : 5979       
##                                   Max.   :49.990   Aetna     : 5370       
##                                                    (Other)   : 9733       
##  Booked    Kept     
##  0: 7985   0:17390  
##  1:72255   1:62850  
##                     
##                     
##                     
##                     
## 

Since we are trying to predict whether a patient will book appointment or not, we want to evaluate how the other variables changes when patient booked and when patient didn’t book appointment.

We start with numeric variables: age and distance

We can see that the distance is highly right skewed, which means that there are more patients towards the lower value.

none <- element_blank()
formatter <- function(...){
  function(x) format(round(x, 0), ...)
}
ggplot(Patient_Demographic_Analysis_cleaned,aes(x = Distance)) + 
  geom_histogram() +   
  labs(title="Patient distance to clinic Histogram",
       x ="Distance", y = "Count") +
  theme(legend.position = "bottom",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  )

Then we want to explore how much the meadian of distance is different when patient booked appointment or when they don’t.

For this, we are using boxplot to visualize the data. Below is a brief instruction on how to read a boxplot

Definitions

ggplot(Patient_Demographic_Analysis_cleaned,aes(x = Booked, y = Distance)) + 
  geom_boxplot() +   
  labs(title="Patient distance to clinic By Booked",
       x ="Booked or Not", y = "Distance") +
  theme(legend.position = "bottom",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked"))

From the boxplot above, we can see that there are not much difference in distance for those who booked appointments and those who doesn’t.

Then we look at the same for age.

First we look at the distribution of patient age, we can see from the histogram that age follows a normal distribution with a mean in the high 50s.

ggplot(Patient_Demographic_Analysis_cleaned,aes(x = age)) + 
  geom_histogram() +   
  labs(title="Patient age Histogram",
       x ="Age", y = "Count") +
  theme(legend.position = "bottom",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  )

ggplot(Patient_Demographic_Analysis_cleaned,aes(x = Booked, y = age)) + 
  geom_boxplot() +   
  labs(title="Patient age By Booked",
       x ="Booked or Not", y = "Age") +
  theme(legend.position = "bottom",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked"))

Similarly, we can see that elder people are more likely to book appointments.

The next step would be to look at all the categorical variables using stacked bar plot. We want to see for each factor such as gender, language, how the number of patients in each bucket different for booked and not booked.

p1 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) + 
    geom_bar(aes(y = (..count..),fill = language),position="fill") +   
  labs(title="Count % of Language by Booked",
       x ="Booked or Not", y = "%") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked")) +
  scale_fill_discrete(name = "Language") +
  scale_y_continuous(labels = scales::percent)
  
p2 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) + 
    geom_bar(aes(y = (..count..),fill = gender),position="fill") +   
  labs(title="Count % of Gender by Booked",
       x ="Booked or Not", y = "%") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked")) +
  scale_fill_discrete(name = "Gender") +
  scale_y_continuous(labels = scales::percent)

p3 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) + 
    geom_bar(aes(y = (..count..),fill = Marketing_Source),position="fill") +   
  labs(title="Count % of Marketing Source by Booked",
       x ="Booked or Not", y = "%") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked")) +
  scale_fill_discrete(name = "Source") +
  scale_y_continuous(labels = scales::percent)


p4 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) + 
    geom_bar(aes(y = (..count..),fill = Nextgen_Financial_Class),position="fill") +   
  labs(title="Count % of Insurance by Booked",
       x ="Booked or Not", y = "%") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked")) +
  scale_fill_discrete(name = "Insurance") + 
  scale_fill_brewer(palette="Spectral") +
  scale_y_continuous(labels = scales::percent)

grid.arrange(p1, p2, p3, p4, ncol=2)

From the barplot above, we can see that:

Model Training

Now that we have a better understanding of the data, we can start to build the model. The goal of the model is to use the training data to build a logistic regression model to predict whether a patient will book a appointment or not based on the demographic information given in the data.

set.seed(100)
trainingRows <- sample(1:nrow(Patient_Demographic_Analysis_cleaned), 0.7*nrow(Patient_Demographic_Analysis_cleaned))
training <- Patient_Demographic_Analysis_cleaned[trainingRows, ]
test <- Patient_Demographic_Analysis_cleaned[-trainingRows, ]
model_booked <- glm(Booked ~ age + language + gender + Marketing_Source + Distance + Nextgen_Financial_Class, 
                         data = training,family = "binomial") 
output_booked <- summary(model_booked)
output_booked
## 
## Call:
## glm(formula = Booked ~ age + language + gender + Marketing_Source + 
##     Distance + Nextgen_Financial_Class, family = "binomial", 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4245   0.2275   0.3354   0.3742   2.5424  
## 
## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               2.800022   0.105171  26.624  < 2e-16
## age                                       0.007137   0.001287   5.544 2.95e-08
## languageOther                            -1.681126   0.071296 -23.579  < 2e-16
## languageSpanish                           1.156234   0.069273  16.691  < 2e-16
## genderM                                  -0.157622   0.037482  -4.205 2.61e-05
## Marketing_SourceOther                    -2.598264   0.065301 -39.789  < 2e-16
## Marketing_SourceReferral Friend / Family  0.703692   0.121381   5.797 6.74e-09
## Marketing_SourceReferral Physician       -0.659061   0.063347 -10.404  < 2e-16
## Marketing_SourceSignatureForum           -1.689350   0.075421 -22.399  < 2e-16
## Distance                                  0.009063   0.002129   4.256 2.08e-05
## Nextgen_Financial_ClassBCBS               0.137032   0.064916   2.111 0.034780
## Nextgen_Financial_ClassCigna              0.132267   0.097003   1.364 0.172714
## Nextgen_Financial_ClassCommercial         0.268118   0.074985   3.576 0.000349
## Nextgen_Financial_ClassMedicaid           0.770113   0.082679   9.314  < 2e-16
## Nextgen_Financial_ClassMedicare           0.089951   0.069658   1.291 0.196587
## Nextgen_Financial_ClassSelf Pay          -1.839504   0.071916 -25.579  < 2e-16
## Nextgen_Financial_ClassTricare            0.288482   0.135888   2.123 0.033759
## Nextgen_Financial_ClassUHC                0.160851   0.081702   1.969 0.048983
##                                             
## (Intercept)                              ***
## age                                      ***
## languageOther                            ***
## languageSpanish                          ***
## genderM                                  ***
## Marketing_SourceOther                    ***
## Marketing_SourceReferral Friend / Family ***
## Marketing_SourceReferral Physician       ***
## Marketing_SourceSignatureForum           ***
## Distance                                 ***
## Nextgen_Financial_ClassBCBS              *  
## Nextgen_Financial_ClassCigna                
## Nextgen_Financial_ClassCommercial        ***
## Nextgen_Financial_ClassMedicaid          ***
## Nextgen_Financial_ClassMedicare             
## Nextgen_Financial_ClassSelf Pay          ***
## Nextgen_Financial_ClassTricare           *  
## Nextgen_Financial_ClassUHC               *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 36489  on 56167  degrees of freedom
## Residual deviance: 29306  on 56150  degrees of freedom
## AIC: 29342
## 
## Number of Fisher Scoring iterations: 6
# stargazer(model_booked,type = "text")

From the model output, we can see that most of the variables are statistically significant as the p values are less than 0.01.

For numeric variables, we can see that:

For categorical variables, we want to plot bar plot to see how each level compare to their base level

For gender, since there’s only two levels and the base level is Female and the coefficient for male is negative, we can conclude that male has a lower probability to book appointment compare to female

marketing_source <- as.data.frame(model_booked$coefficients[6:9]) %>%
  tibble::rownames_to_column("Marketing_Source") %>%
  rename(value = 2) %>%
  mutate(Marketing_Source = substr(Marketing_Source,17,nchar(Marketing_Source)))
# View(marketing_source)
p1 <- ggplot(marketing_source,aes(x = reorder(Marketing_Source,-value), y = value)) + 
  geom_bar(stat="identity",fill = "blue") + 
  labs(title="Marketing source coefficient importance",
       x ="Marketing Source", y = "") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) 
insurance <- as.data.frame(model_booked$coefficients[11:18]) %>%
  tibble::rownames_to_column("Insurance") %>%
  rename(value = 2) %>%
  mutate(Insurance = substr(Insurance,24,nchar(Insurance)))
# View(marketing_source)
p2 <- ggplot(insurance,aes(x = reorder(Insurance,-value), y = value)) + 
  geom_bar(stat="identity",fill = "red") + 
  labs(title="Insurance coefficient importance",
       x ="Insurance", y = "") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) 
grid.arrange(p1,p2,ncol = 2)

From the barplot above, we can see that

Model testing and prediction

After we train the model, we want to see how the model perform using the test dataset.

pred_booked <- predict(model_booked, test,type="response")
pred_booked <- ifelse(pred_booked > 0.5, 1, 0)
actual_booked <- as.data.frame(test$Booked)
cm_booked <- cbind(pred_booked,actual_booked) %>%
  rename(predicted = 1,
         actual = 2) %>%
mutate(predicted = as.factor(predicted))
cm_booked <- conf_mat(cm_booked, actual, predicted)
# pred_booked[1:20]
# test$Booked[1:20]
# cm_booked <- as.data.frame(table(pred_booked, test$Booked)) %>%
#   rename(Predicted = 1,
#          Actual = 2)
tab_book <- table(pred_booked, test$Booked)
autoplot(cm_booked, type = "heatmap") +
  scale_fill_gradient(low="#D6EAF8",high = "#2E86C1") + 
  labs(title="Prediction vs Actual Confusion Matrix",
       x ="Actual", y = "Prediction") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) 

## misclassification rate
print(paste0("Misclassification Rate: ",100 - round(sum(diag(tab_book))/sum(tab_book)*100),"%"))
## [1] "Misclassification Rate: 9%"
print(paste0("Misclassification Rate for not booked patients: ",100 - round(tab_book[1,1]/sum(tab_book[1,])*100),"%"))
## [1] "Misclassification Rate for not booked patients: 34%"
print(paste0("Misclassification Rate for booked patients: ",100 - round(tab_book[2,2]/sum(tab_book[2,])*100),"%"))
## [1] "Misclassification Rate for booked patients: 9%"

From the plot above, we can see that overall, the model has a 91% accuracy to classify patients into booked and not booked group. The model does a better job in classifying booked patients than not booked patients. For predicting not booked patients, the misclassification rate is 34%, for booked patients, it’s only 8%.

Conclusion

From the analysis above, we can see that patient demographics plays an important role in whether a new patient will book appointment or not. Older female are the main targets for potential booking patients. Additionally, patients from marketing source such as referral friends / family and Internet or from insurance such as Medicaid and Tricare are more likely to book appointments. This information can be used to better target potential customers or perform better outbound call efforts by reaching out to the correct group of patients.