The data from this analysis is from Kaggle New York City Airbnb Open Data. The dataset describes the listing activity and metrics in NYC, NY for 2019. It includes information such as the location of the listing properties, the neighbourhood of the properties, room type, price, minimum nights required, customer reviews and availability of the listing.
The purpose of this analysis is to perform exploratory data analysis as well as data visualization to understand how different factors influence the demand of the listing properties on Airbnb and ultimately using machine learning techniques to make predictions on the availability of the listing properties.
The following questions will be answered in the course of this analysis:
We start the analysis by importing necesssary libraries and loading the data. The libraries used in this analysis are:
# Load necessary library
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn')
from wordcloud import WordCloud
%matplotlib inline
# set default plot size
plt.rcParams["figure.figsize"] = (15,8)
# Load and preview data
ab_nyc = pd.read_csv("/Users/leo/Personal/GitHub/Airbnb NYC 2019 Data/AB_NYC_2019.csv")
ab_nyc.head()
After we load in the data, we used describe
to get a summary statistics of the numeric data. We can see that the data need some cleanning. There are some outliers for price and minimum_nights. Other columns such as number_of_reviews and calculated_host_listings_count are highly right skewed, so that we might need to transfer them into categorical variables.
First we need to drop a few columns from the dataset that are not useful.
# drop id and name columns
ab_nyc.drop(['id','name','host_id','host_name'],axis=1,inplace = True)
ab_nyc.describe()
Then we check na for each column, most of the columns doesn't have na other than the last_review and reviews_per_month for those listing properties doesn't have any reviews.
# Check each column for nas
ab_nyc.isnull().sum()
Then we want to remove the outliers for price and minimum_nights column. we calculate z score for both column and remove all records that have a z score greater than 3.
For the columns that are highly skewed, for example number_of_reviews and calculated_host_listings_count, we transfer them into categorical variable. Based on the summary statistics, we can see that the first 25 percentile of minimum_nights are 1, the median is 3 and the 75% percentile is 5, an reasonable categorization would be one night, two nights, three nights, four nights and five nights or more. Then we check we have enough and evenly distributed values in each of the group using groupby()
and size()
.
We'll do the same for calculated_host_listings_count and transfer the column into one listing, two listing and more than two listing.
# data cleanning
# remove outliers for price and minimun nights column
from scipy import stats
ab_nyc['z_price'] = np.abs(stats.zscore(ab_nyc['price']))
ab_nyc['z_min_nights'] = np.abs(stats.zscore(ab_nyc['minimum_nights']))
# remove z scroe that are greater than 3
ab_nyc_final = ab_nyc[(ab_nyc['z_price'] < 3)]
ab_nyc_final = ab_nyc_final[(ab_nyc_final['price'] > 3)]
ab_nyc_final = ab_nyc_final[(ab_nyc['z_min_nights'] < 3)]
# convert numneric variables into categorical variables
ab_nyc_final['minimum_nights_group'] = 'Others'
ab_nyc_final['minimum_nights_group'][ab_nyc_final['minimum_nights'] == 1] = 'one night'
ab_nyc_final['minimum_nights_group'][ab_nyc_final['minimum_nights'] == 2] = 'two nights'
ab_nyc_final['minimum_nights_group'][ab_nyc_final['minimum_nights'] == 3] = 'three nights'
ab_nyc_final['minimum_nights_group'][ab_nyc_final['minimum_nights'] == 4] = 'four nights'
ab_nyc_final['minimum_nights_group'][ab_nyc_final['minimum_nights'] > 4] = 'five nights or more'
# ab_nyc_final.groupby('minimum_nights_group').size()
ab_nyc_final['calculated_host_listings_count_group'] = 'Others'
ab_nyc_final['calculated_host_listings_count_group'][ab_nyc_final['calculated_host_listings_count'] == 1] = 'one listing'
ab_nyc_final['calculated_host_listings_count_group'][ab_nyc_final['calculated_host_listings_count'] == 2] = 'two listings'
ab_nyc_final['calculated_host_listings_count_group'][ab_nyc_final['calculated_host_listings_count'] > 2] = 'more than two listings'
# ab_nyc_final.groupby('calculated_host_listings_count_group').size()
Clean the dataset by removing intermediate columns, we get the final data for exploratory data analysis and visualization.
# remove unused columns
ab_nyc_final.drop(['z_price','z_min_nights','minimum_nights','last_review','neighbourhood',
'calculated_host_listings_count','reviews_per_month'],
axis = 1,inplace = True)
ab_nyc_final.head()
ab_nyc_final.describe()
For data visualization, we start with plotting a correlation matrix to explore the correlation between each numeric variable using corr()
function. then we plot it using seaborn heatmap.
ab_nyc_cor = ab_nyc_final.drop(['latitude','longitude'],axis=1).corr()
ab_nyc_cor
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(ab_nyc_cor, dtype=np.bool))
# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(15, 8))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(ab_nyc_cor, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5},annot = True)
From the plot, we can see that there's no strong correlation between each numeric variables. No correlation coefficient greater than 0.7.
Then we want to plot all the listing properties on the map to see where the most properties located and how their price differs.
For this task, we used plotly
and density_mapbox
by passing latitude, longitude and price.
From the plot, we can see that as expected, most properties are in Manhattan, on the south side of the Central Park, and also on the north side of Brooklyn around Willamsburg Bridge. These location offers the most convenient transportation for the tourists in that they are either in Manhattan or near it. By looking at the price, these locations also have the highest prices.
import plotly.express as px
lat = np.mean(ab_nyc_final['latitude'])
lon = np.mean(ab_nyc_final['longitude'])
fig = px.density_mapbox(ab_nyc_final, lat='latitude', lon='longitude', z='price', radius=2,
center=dict(lat = lat, lon = lon), zoom=10,
mapbox_style="carto-positron")
fig.show()
To further explore how location affect listing properties, the following code calculate the mean and median for price, number_of_reviews and availability_365.
We can see that Manhattan has the most listing and the highest median price, followed by Brooklyn. However, Brooklyn have a higher demand than Manhattan as the availability_365 mean and median is the lowest.
ab_nyc_final.groupby(['neighbourhood_group'])['price','number_of_reviews','availability_365'].agg(['count', 'mean','median'])
# ab_nyc_final.groupby(['neighbourhood_group']).agg({
# 'price': ['mean', 'count', 'median'],
# 'number_of_reviews': ['mean', 'count', 'median'],
# 'availability_365': ['mean', 'count', 'median']
# })
Then we will use boxplot
to visualize the median by different categorical variables to explore how the price and availabilty differ between groups
Boxplot of neighbourhood group and room type with price
# boxplot of neighbourhood group and price
# entire home/apt have the highest median price over other room types
# manhattan entire home/apt have the highest median price
sns.boxplot(x="neighbourhood_group", y="price",hue = "room_type",data=ab_nyc_final)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Boxplot of neighbourhood group and room type with availability
# manhattan and brooklyn private room have the lowest availability/highest demand overall
sns.boxplot(x="neighbourhood_group", y="availability_365",hue = "room_type",data=ab_nyc_final)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Boxplot of minimum nights group with availability
Four nights minimum are the most popular overall.
# four nights minimum are the most popular
sns.boxplot(x="minimum_nights_group", y="availability_365",data=ab_nyc_final,
order = ['one night','two nights','three nights','four nights','five nights or more'])
Boxplot of minimum nights group with price
One nights minimum is the most cheapest overall.
sns.boxplot(x="minimum_nights_group", y="price",data=ab_nyc_final,
order = ['one night','two nights','three nights','four nights','five nights or more'])
The last visualization we will use pairplot
in seaborn to explore the distribution and correlation of all numeric variables. As we seen earlier, there's no strong linear correlation between each individual variable. However, we can see that there might be some non-linear correlation between price/number of review.
sns.pairplot(ab_nyc_final.drop(['latitude','longitude'],axis=1))
The ultimate goal of this analysis is to predict the availability of a listing property based on its location, price and other metrics. We will try different regression to fit the data and find out which model best fit the data and make the most accurate prediction.
The first regression method used is multiple linear regression.
Before fitting the model, we need to transfer the categorical variables into binary using one hot encoding.
ab_nyc_model = ab_nyc_final.drop(['latitude','longitude'],axis = 1)
ab_nyc_model.head()
# Building the model
# first convert categorical variables to dummy variables using one hot encoding
categorical_var = ['neighbourhood_group','room_type','minimum_nights_group','calculated_host_listings_count_group']
# create dummy variables for all the other categorical variables
for variable in categorical_var:
# # fill missing data
# recruit[variable].fillna('Missing',inplace=True)
# create dummy variables for given columns
dummies = pd.get_dummies(ab_nyc_model[variable],prefix=variable)
# update data and drop original columns
ab_nyc_model = pd.concat([ab_nyc_model,dummies],axis=1)
ab_nyc_model.drop([variable],axis=1,inplace=True)
ab_nyc_model.head()
This will be our final dataset to fit the model.
Then we split the dataset into training and testing dataset with 70/30 split. We'll use the regression module in both statsmodels
and sklearn
x = ab_nyc_model.drop(['availability_365'], axis=1)
y = ab_nyc_model['availability_365'].astype(float)
# split train and test dataset
train_x, test_x, train_y, test_y = train_test_split(x,y , test_size=0.3, random_state=42)
print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)
After fitting the model, we can see that the R square is 0.23, which is not very high, given that all of the variables are significant with a p value less than 0.05.
Looking at the coefficients, we can see that both price and number_of_review is positive, which means the higher both these two are, the least demand for the listing property will be. Manhattan and Brooklyn have a negative coefficient, which means properties in these two locations have a higher demand. Additionally, entire roo,/apt is the most popular room type.
# training using statmodel
linear_model_sm = sm.OLS(train_y,sm.tools.add_constant(train_x).astype(float))
results_sm = linear_model_sm.fit()
print(results_sm.summary())
Then we fit the regression using Sklearn, the result is similar with a low model score of 0.23. this suggest that maybe the data doesn't fit a linear relationship because the prediction have a high variance from the actual value. We may need to try to use other non-linear regression techniques.
# using sklearn
linear_model_sk = LinearRegression()
linear_model_sk.fit(train_x, train_y)
linear_model_sk.score(test_x, test_y)
pred_y = linear_model_sk.predict(test_x)
df = pd.DataFrame({'Actual': test_y, 'Predicted': pred_y})
df.head(30)
Random forest is a popular machine learning techniques that can not only be used in classification problems, but also be used in regression problems. To make random forest uniques is that it can also be used to fit non-linear relationship.
We first start an instance of the model with a randomly chosen parameters. These prameters will later be tuned to yield optimized results.
# random forest regressor for non-linear regression
rf_regressor = RandomForestRegressor(n_estimators=100,random_state=0)
rf_regressor.fit(train_x,train_y)
The random forest has a score of 0.77, which is significantly higher than linear regression.
rf_regressor.score(train_x,train_y)
To see which feature have the most influence, the bar plot below plots each features from the most significant to the least.
We can see that the price and number_of_review have the most influence on availability of the listing.
feature_importance = pd.Series(rf_regressor.feature_importances_,index=x.columns)
feature_importance = feature_importance.sort_values()
feature_importance.plot(kind='barh')
# parameter tunning
# # of trees trained parameter tunning
results_rf = []
n_estimator_options = [30,50,100,200,500,1000,2000]
for trees in n_estimator_options:
model = RandomForestRegressor(trees,oob_score=True,n_jobs=-1,random_state=42)
model.fit(train_x,train_y)
print(trees," trees")
score = model.score(train_x,train_y)
print(score)
results_rf.append(score)
print("")
pd.Series(results_rf,n_estimator_options).plot()
# use 500 trees
We will choose auto option.
# max number of features parameter tunning
results_rf = []
max_features_options = ['auto',None,'sqrt','log2',0.9,0.2]
for max_features in max_features_options:
model = RandomForestRegressor(n_estimators=500,oob_score=True,n_jobs=-1,
random_state=42,max_features=max_features)
model.fit(train_x,train_y)
print(max_features," option")
score = model.score(train_x,train_y)
print(score)
results_rf.append(score)
print("")
pd.Series(results_rf,max_features_options).plot(kind='barh')
# use auto option
Based on the parameter tunning, we will use n_estimators=500
and max_features='auto'
to fit the random forest regressor. The final model has a score of 0.78
# final model using the parameter tuning
rf_regressor = RandomForestRegressor(n_estimators=500,oob_score=True,n_jobs=-1,
random_state=42,max_features='auto')
rf_regressor.fit(train_x,train_y)
rf_regressor.score(train_x,train_y)
The following output shows the actual and the predicted value.
pred_y = rf_regressor.predict(test_x)
df = pd.DataFrame({'Actual': test_y, 'Predicted': pred_y})
df.head(30)
fig, ax = plt.subplots()
ax.scatter(test_y, pred_y)
ax.plot([test_y.min(), test_y.max()], [test_y.min(), test_y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
Through this analysis, we have a better idea on the key factors that influences the demand of an airbnb listing property. Tourists/customers prefer location close to downtown, lower price and entire room which offers them more privacy when touring the city. These can all be taken into consideration for airbnb hosts when posting their properties online.