In this project, I used K Nearest Neighbors algorithms for listing price predictions. Usually, for the house price prediction, KNN is one of the most common machine learning algorithms used.
First thing to do is to select the features you want to use for house price predictions.
# Extracting the numeric columns in dataset
numeric_listing_df = listings_df._get_numeric_data()
numeric_listing_df.info()# Droping irrelevant columns
numeric_listing_df = numeric_listing_df.drop(['longitude','latitude','thumbnail_url','medium_url','xl_picture_url','host_acceptance_rate','neighbourhood_group_cleansed','square_feet'],axis = 1)
We want to pick up the most relevant features for predictions, so we used heatmaps to find the correlations coefficients on target features and values.
So we pick numbers of bathrooms, bedrooms, beds and accommodates as modeling features.
# Test and Train set split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.4)# Import KNN model and begin with a trail value for K around 3
from sklearn.neighbors import KNeighborsRegressor
print(pred)#Plotting the distribution plot
Although it is a very precise prediction, I still want to find out an optimal value of K.
# Calculate RMSE for finding the optimal value of K
from sklearn.metrics import mean_squared_error
from sklearn import neighbors
from math import sqrt
for k in mylist:
print('RMSE test value for k= ' , k , 'is:', error, 'nRMSE train value for k is = ',error1)Minimum R squared test error= 188.09725302410044
Minimum R squared train error= 76.31311262752186
Optimal K value= 33
According to the result above, the optimal value of K would be 33
First, let’s take a look into the overview of the listing price:
Minimum price per listing is 0$.
Maximum price per listing is 25000$
Average price per listing is 213$.
Not surprised that the listing price range could be such a big difference. However, it is still quite interesting that the minimum price per listing could be as low as $0, which is not quite reasonable for me. But I am certainly interested in the Airbnb page of this listing.
It is clearly illustrated in the graph that the majority of listing price is concentrated around 0~100 USD. Nearly 80% of the reservation price is below $300, very affordable to the tourist(like me).
In order to determine which part in Los Angeles has the more expensive listing price, we take a deeper look into listing price dependence on the neighborhood.
The Venice, Hollywood, downtown, long beach and Santa Monica are the top 5 areas with the most listing available.
The Rolling hills, bel-air, malibu, Beverly crest, and Hollywood Hills Wests are the top 5 areas with the most expensive listing price. These are the costly places in Los Angeles, so no wonder the reservation price is expensive as well.
For a further explanation of the listing situation, I used maps to analyze the spread of listings in Los Angeles and the relation of the neighborhood with folium.
# Import folium heatmap library
from folium import plugins
from folium.plugins import HeatMap# Plot heatmap of listing
m = folium.Map(location=[34.05, -118.24], zoom_start = 11)heat_data = [[row['latitude'],row['longitude']] for index, row in
listings_df[['latitude', 'longitude']].iterrows()]hh = HeatMap(heat_data).add_to(m)
For a more clear view of the listing heatmap, please check out my Github blog.
In order to find out the busiest times of the year, I decided to look at the listing price changes within 2019. Based on common sense, the reservation price increase with a higher demand for listing reservation.
According to the diagram above the listing price raise significantly in December. In the meanwhile, there is also a small peak during the summer. This is probably due to the summer and winter holidays. However, it is still obvious that winter is also one of the busiest times to visit Los Angeles.
As I mentioned in the beginning, I want to find the most relevant feature on deciding the listing house prices. It is useful to look at correlations between price and other features from the dataset to find which are the most relevant features on house rental prices.
Among these selected features, From the correlation heatmap diagram, it is quite obvious that the price is correlated with house accommodations, which are the number of bathrooms, bedrooms, and beds.
In this project, I focus on three business problems and predicted the reservation price using the KNN machine learning algorithm regarding the Los Angeles Airbnb Sep.2019 dataset.
Based on the graph from price dependence on the neighborhood, The Venice, Hollywood, downtown, long beach and Santa Monica are the top 5 areas with the most listing available. The Rolling hills, bel-air, malibu, Beverly crest, and Hollywood Hills Wests are the top 5 areas with the most expensive listing price.
The busiest time in the year for visiting Los Angeles would be summer and winter hoildays. This is represented in the listing price change during the year.
The basic features of listing, which are the number of bathrooms, bedrooms, and beds, are the most correlated features with the listing price.
If you are more interested in how to code out these results, please check out my code in Github.