Patient Demographic Analysis

Introduction

The purpose of this analysis is to find out the key factors that determines whether a prospect patient will end up booking/keeping the appointments. The end goal of the analysis is to train a classification model to correctly predict whether a patient will book an appointment based on certain information.

Data Source

The dataset used in this analysis is from SQL database, tables used are:

person table
appointment table
location
ZipCode Census (use to get closest distance for a given zip code)
person source
person primary payer (get patient insurance)

The data points includes some patient demographic information such as:

Gender
Age
Language

It also includes other key factors might play important role, such as:

Patient marketing source
Distance to closest clinic
Patient insurance

Methodology

Various techniques are used in the analysis, steps as follow:

Exploratory Data Analysis
Data Visualization
Logistic Regression Model Training
Testing and Validation
Prediction

Data Load

Loading data and review top 1000 rows.

library(readr)
library(ggplot2)
library(dplyr)
library(reshape2)
library(MASS)
library(DT)
library(gridExtra)
library(stargazer)
library(stringr)
library(yardstick)

Patient_Demographic_Analysis <- read_csv("Patient Demographic Analysis.csv")
datatable(Patient_Demographic_Analysis[1:1000,-1],extensions = 'Scroller', rownames = FALSE,
          options = list(deferRender = TRUE,scroller = TRUE, scrollX = TRUE,scrollY = "500px")) %>%
  formatDate(c("CreateDate","date_of_birth"))

Data Cleanning

The data needs to be cleaned before further analysis. Data cleanning includes the following:

Transfer categorical variables such as gender, language, booked, kept to factor
Excludes age that is less than 10 yrs old and greater than 100 yrs old as these are likely entered by mistake
Excludes distance that is greater than 50 miles, these are outliers we need to exclude from the analysis

After cleaning the data, we selected columns that needed for the analysis

Patient_Demographic_Analysis_cleaned <- Patient_Demographic_Analysis %>%
  dplyr::select(person_nbr,age,language,gender,Marketing_Source,Distance,Nextgen_Financial_Class,Booked,Kept) %>%
  mutate(gender = as.factor(gender),
         language = as.factor(language),
         Marketing_Source = as.factor(Marketing_Source),
         Nextgen_Financial_Class = as.factor(Nextgen_Financial_Class),
         Booked = as.factor(Booked),
         Kept = as.factor(Kept)) %>%
  filter(Distance <= 50 & age < 100 & age > 10 & Nextgen_Financial_Class != "WC / Auto")
datatable(Patient_Demographic_Analysis_cleaned[1:1000,],extensions = 'Scroller', rownames = FALSE,
          options = list(deferRender = TRUE,scroller = TRUE, scrollX = TRUE,scrollY = "500px"))

Exploratory Data Analysis

Descriptive Statistics of the data

summary(Patient_Demographic_Analysis_cleaned)

##    person_nbr          age          language     gender   
##  Min.   :555464   Min.   :11.0   English:69312   F:60792  
##  1st Qu.:588730   1st Qu.:48.0   Other  : 1761   M:19448  
##  Median :642886   Median :58.0   Spanish: 9167            
##  Mean   :650328   Mean   :58.1                            
##  3rd Qu.:702926   3rd Qu.:69.0                            
##  Max.   :771811   Max.   :99.0                            
##                                                           
##                  Marketing_Source    Distance      Nextgen_Financial_Class
##  Internet Website        : 9988   Min.   : 0.000   BCBS      :20878       
##  Other                   : 8966   1st Qu.: 2.460   Medicare  :18256       
##  Referral Friend / Family: 6381   Median : 4.660   Medicaid  :10613       
##  Referral Physician      :50604   Mean   : 7.155   Commercial: 9411       
##  SignatureForum          : 4301   3rd Qu.: 8.470   UHC       : 5979       
##                                   Max.   :49.990   Aetna     : 5370       
##                                                    (Other)   : 9733       
##  Booked    Kept     
##  0: 7985   0:17390  
##  1:72255   1:62850  
##                     
##                     
##                     
##                     
##

Since we are trying to predict whether a patient will book appointment or not, we want to evaluate how the other variables changes when patient booked and when patient didn’t book appointment.

We start with numeric variables: age and distance

We can see that the distance is highly right skewed, which means that there are more patients towards the lower value.

none <- element_blank()
formatter <- function(...){
  function(x) format(round(x, 0), ...)
}
ggplot(Patient_Demographic_Analysis_cleaned,aes(x = Distance)) + 
  geom_histogram() +   
  labs(title="Patient distance to clinic Histogram",
       x ="Distance", y = "Count") +
  theme(legend.position = "bottom",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  )

Then we want to explore how much the meadian of distance is different when patient booked appointment or when they don’t.

For this, we are using boxplot to visualize the data. Below is a brief instruction on how to read a boxplot

Definitions

Median: The median (middle quartile) marks the mid-point of the data and is shown by the line that divides the box into two parts. Half the scores are greater than or equal to this value and half are less.
Inter-quartile range: The middle “box” represents the middle 50% of scores for the group. The range of scores from lower to upper quartile is referred to as the inter-quartile range. The middle 50% of scores fall within the inter-quartile range.
Upper quartile: Seventy-five percent of the scores fall below the upper quartile.
Lower quartile: Twenty-five percent of scores fall below the lower quartile.
Whiskers: The upper and lower whiskers represent scores outside the middle 50%. Whiskers often (but not always) stretch over a wider range of scores than the middle quartile groups.
Outliers: Points outside the whiskers, which is either greater or smaller than 1.5 Inter-quartile range

ggplot(Patient_Demographic_Analysis_cleaned,aes(x = Booked, y = Distance)) + 
  geom_boxplot() +   
  labs(title="Patient distance to clinic By Booked",
       x ="Booked or Not", y = "Distance") +
  theme(legend.position = "bottom",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked"))

From the boxplot above, we can see that there are not much difference in distance for those who booked appointments and those who doesn’t.

Then we look at the same for age.

First we look at the distribution of patient age, we can see from the histogram that age follows a normal distribution with a mean in the high 50s.

ggplot(Patient_Demographic_Analysis_cleaned,aes(x = age)) + 
  geom_histogram() +   
  labs(title="Patient age Histogram",
       x ="Age", y = "Count") +
  theme(legend.position = "bottom",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  )

ggplot(Patient_Demographic_Analysis_cleaned,aes(x = Booked, y = age)) + 
  geom_boxplot() +   
  labs(title="Patient age By Booked",
       x ="Booked or Not", y = "Age") +
  theme(legend.position = "bottom",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked"))

Similarly, we can see that elder people are more likely to book appointments.

The next step would be to look at all the categorical variables using stacked bar plot. We want to see for each factor such as gender, language, how the number of patients in each bucket different for booked and not booked.

p1 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) + 
    geom_bar(aes(y = (..count..),fill = language),position="fill") +   
  labs(title="Count % of Language by Booked",
       x ="Booked or Not", y = "%") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked")) +
  scale_fill_discrete(name = "Language") +
  scale_y_continuous(labels = scales::percent)
  
p2 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) + 
    geom_bar(aes(y = (..count..),fill = gender),position="fill") +   
  labs(title="Count % of Gender by Booked",
       x ="Booked or Not", y = "%") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked")) +
  scale_fill_discrete(name = "Gender") +
  scale_y_continuous(labels = scales::percent)

p3 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) + 
    geom_bar(aes(y = (..count..),fill = Marketing_Source),position="fill") +   
  labs(title="Count % of Marketing Source by Booked",
       x ="Booked or Not", y = "%") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked")) +
  scale_fill_discrete(name = "Source") +
  scale_y_continuous(labels = scales::percent)


p4 <- ggplot(Patient_Demographic_Analysis_cleaned, aes(Booked)) + 
    geom_bar(aes(y = (..count..),fill = Nextgen_Financial_Class),position="fill") +   
  labs(title="Count % of Insurance by Booked",
       x ="Booked or Not", y = "%") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) + 
  scale_x_discrete(labels = c("Not Booked","Booked")) +
  scale_fill_discrete(name = "Insurance") + 
  scale_fill_brewer(palette="Spectral") +
  scale_y_continuous(labels = scales::percent)

grid.arrange(p1, p2, p3, p4, ncol=2)

From the barplot above, we can see that:

Spanish patients are more likely to book appointments as more % of Spanish speaking patients are in the booked group than not booked group
Although female is the majority of patient population, looks like more male are booking appointment than they are not
physician referral, referral friend/family and internet have seen an increase % in booked group
Most self pay patients are not booking appointments,more medicare patients are booking appointments. On the other hand, Medicare, Commercial have more booking patients then not booking patients

Model Training

Now that we have a better understanding of the data, we can start to build the model. The goal of the model is to use the training data to build a logistic regression model to predict whether a patient will book a appointment or not based on the demographic information given in the data.

set.seed(100)
trainingRows <- sample(1:nrow(Patient_Demographic_Analysis_cleaned), 0.7*nrow(Patient_Demographic_Analysis_cleaned))
training <- Patient_Demographic_Analysis_cleaned[trainingRows, ]
test <- Patient_Demographic_Analysis_cleaned[-trainingRows, ]
model_booked <- glm(Booked ~ age + language + gender + Marketing_Source + Distance + Nextgen_Financial_Class, 
                         data = training,family = "binomial") 
output_booked <- summary(model_booked)
output_booked

## 
## Call:
## glm(formula = Booked ~ age + language + gender + Marketing_Source + 
##     Distance + Nextgen_Financial_Class, family = "binomial", 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4245   0.2275   0.3354   0.3742   2.5424  
## 
## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               2.800022   0.105171  26.624  < 2e-16
## age                                       0.007137   0.001287   5.544 2.95e-08
## languageOther                            -1.681126   0.071296 -23.579  < 2e-16
## languageSpanish                           1.156234   0.069273  16.691  < 2e-16
## genderM                                  -0.157622   0.037482  -4.205 2.61e-05
## Marketing_SourceOther                    -2.598264   0.065301 -39.789  < 2e-16
## Marketing_SourceReferral Friend / Family  0.703692   0.121381   5.797 6.74e-09
## Marketing_SourceReferral Physician       -0.659061   0.063347 -10.404  < 2e-16
## Marketing_SourceSignatureForum           -1.689350   0.075421 -22.399  < 2e-16
## Distance                                  0.009063   0.002129   4.256 2.08e-05
## Nextgen_Financial_ClassBCBS               0.137032   0.064916   2.111 0.034780
## Nextgen_Financial_ClassCigna              0.132267   0.097003   1.364 0.172714
## Nextgen_Financial_ClassCommercial         0.268118   0.074985   3.576 0.000349
## Nextgen_Financial_ClassMedicaid           0.770113   0.082679   9.314  < 2e-16
## Nextgen_Financial_ClassMedicare           0.089951   0.069658   1.291 0.196587
## Nextgen_Financial_ClassSelf Pay          -1.839504   0.071916 -25.579  < 2e-16
## Nextgen_Financial_ClassTricare            0.288482   0.135888   2.123 0.033759
## Nextgen_Financial_ClassUHC                0.160851   0.081702   1.969 0.048983
##                                             
## (Intercept)                              ***
## age                                      ***
## languageOther                            ***
## languageSpanish                          ***
## genderM                                  ***
## Marketing_SourceOther                    ***
## Marketing_SourceReferral Friend / Family ***
## Marketing_SourceReferral Physician       ***
## Marketing_SourceSignatureForum           ***
## Distance                                 ***
## Nextgen_Financial_ClassBCBS              *  
## Nextgen_Financial_ClassCigna                
## Nextgen_Financial_ClassCommercial        ***
## Nextgen_Financial_ClassMedicaid          ***
## Nextgen_Financial_ClassMedicare             
## Nextgen_Financial_ClassSelf Pay          ***
## Nextgen_Financial_ClassTricare           *  
## Nextgen_Financial_ClassUHC               *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 36489  on 56167  degrees of freedom
## Residual deviance: 29306  on 56150  degrees of freedom
## AIC: 29342
## 
## Number of Fisher Scoring iterations: 6

# stargazer(model_booked,type = "text")

From the model output, we can see that most of the variables are statistically significant as the p values are less than 0.01.

For numeric variables, we can see that:

age has a positive coefficient, which means the older the patient, the more likely they are going to book the appointment
distance also has a positive coefficient, which means the further the patient lives, the more likely they are going to book the appointment (debatable)

For categorical variables, we want to plot bar plot to see how each level compare to their base level

For gender, since there’s only two levels and the base level is Female and the coefficient for male is negative, we can conclude that male has a lower probability to book appointment compare to female

marketing_source <- as.data.frame(model_booked$coefficients[6:9]) %>%
  tibble::rownames_to_column("Marketing_Source") %>%
  rename(value = 2) %>%
  mutate(Marketing_Source = substr(Marketing_Source,17,nchar(Marketing_Source)))
# View(marketing_source)
p1 <- ggplot(marketing_source,aes(x = reorder(Marketing_Source,-value), y = value)) + 
  geom_bar(stat="identity",fill = "blue") + 
  labs(title="Marketing source coefficient importance",
       x ="Marketing Source", y = "") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) 
insurance <- as.data.frame(model_booked$coefficients[11:18]) %>%
  tibble::rownames_to_column("Insurance") %>%
  rename(value = 2) %>%
  mutate(Insurance = substr(Insurance,24,nchar(Insurance)))
# View(marketing_source)
p2 <- ggplot(insurance,aes(x = reorder(Insurance,-value), y = value)) + 
  geom_bar(stat="identity",fill = "red") + 
  labs(title="Insurance coefficient importance",
       x ="Insurance", y = "") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  ) 
grid.arrange(p1,p2,ncol = 2)

From the barplot above, we can see that

For marketing source, with Internet as baseline (value = 0), top 3 marketing sources are
- Referral Friend/Family
- Internet Website
- Referral Physician
For insurance, with Aetna as baseline (value = 0), top 3 insurances are
- Medicaid
- Tricare
- Commercial

Model testing and prediction

After we train the model, we want to see how the model perform using the test dataset.

pred_booked <- predict(model_booked, test,type="response")
pred_booked <- ifelse(pred_booked > 0.5, 1, 0)
actual_booked <- as.data.frame(test$Booked)
cm_booked <- cbind(pred_booked,actual_booked) %>%
  rename(predicted = 1,
         actual = 2) %>%
mutate(predicted = as.factor(predicted))
cm_booked <- conf_mat(cm_booked, actual, predicted)
# pred_booked[1:20]
# test$Booked[1:20]
# cm_booked <- as.data.frame(table(pred_booked, test$Booked)) %>%
#   rename(Predicted = 1,
#          Actual = 2)
tab_book <- table(pred_booked, test$Booked)
autoplot(cm_booked, type = "heatmap") +
  scale_fill_gradient(low="#D6EAF8",high = "#2E86C1") + 
  labs(title="Prediction vs Actual Confusion Matrix",
       x ="Actual", y = "Prediction") +
  theme(legend.position = "right",
        strip.text = element_text(face = "bold",size = 10),
        panel.background = none,
        legend.title = element_text(face = "bold",size = 9),
        legend.text = element_text(size = 9),
        legend.background = none,
        axis.title=element_text(face = "bold",size=9),
        axis.text.x = element_text(face = "bold",size = 9),
        axis.text.y = element_text(face = "bold",size = 9),
        plot.title = element_text(face = "bold",size=9,hjust=0.5),
        plot.caption = element_text(hjust = 0)
  )

## misclassification rate
print(paste0("Misclassification Rate: ",100 - round(sum(diag(tab_book))/sum(tab_book)*100),"%"))

## [1] "Misclassification Rate: 9%"

print(paste0("Misclassification Rate for not booked patients: ",100 - round(tab_book[1,1]/sum(tab_book[1,])*100),"%"))

## [1] "Misclassification Rate for not booked patients: 34%"

print(paste0("Misclassification Rate for booked patients: ",100 - round(tab_book[2,2]/sum(tab_book[2,])*100),"%"))

## [1] "Misclassification Rate for booked patients: 9%"

From the plot above, we can see that overall, the model has a 91% accuracy to classify patients into booked and not booked group. The model does a better job in classifying booked patients than not booked patients. For predicting not booked patients, the misclassification rate is 34%, for booked patients, it’s only 8%.

Conclusion

From the analysis above, we can see that patient demographics plays an important role in whether a new patient will book appointment or not. Older female are the main targets for potential booking patients. Additionally, patients from marketing source such as referral friends / family and Internet or from insurance such as Medicaid and Tricare are more likely to book appointments. This information can be used to better target potential customers or perform better outbound call efforts by reaching out to the correct group of patients.