Filipe Sampaio Campos · Follow
5 min read · Jul 20, 2023
--
In this project, we conduct an exploratory data analysis and predictive modeling of real estate prices in California based on a real estate dataset. The goal is to understand the features that influence property prices and build regression models to predict prices based on those features.
This is a classic regression problem commonly used in introductions to machine learning and data science.
The dataset was obtained from a CSV file named housing.csv
. It contains information about real estate properties, including attributes such as the number of bedrooms, number of bathrooms, average income of the area, proximity to the ocean, and other relevant factors.
We perform an exploratory data analysis to understand the distribution of variables and identify potential relationships between attributes and property prices. We utilize Pandas, NumPy, and Seaborn libraries to visualize the information and derive insights.
df.describe().T
The first thing I did was look at the overall statistics of my variables in the dataset, just to have a general overview.
Afterward, I divided it into training and testing sets and plotted a histogram for all the variables.
X = df.drop(['median_house_value'], axis = 1)
y = df['median_house_value']
display(X, y)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
train_data = X_train.join(y_train)
display(train_data)
As we can see, some variables follow a highly atypical distribution. Because of this, I normalized all of them using the logarithm function.
train_data['total_rooms'] = np.log(train_data['total_rooms'] + 1)
train_data['total_bedrooms'] = np.log(train_data['total_bedrooms'] + 1)
train_data['population'] = np.log(train_data['population'] + 1)
train_data['households'] = np.log(train_data['households'] + 1)
I plotted a correlation matrix to analyze the features, and from here on, the entire project is focused on analysis based on the correlation matrix.
In this first matrix, we can see strong correlations between the following columns: total_rooms, total_bedrooms, population, and households. Additionally, there is a correlation between our target variable, median_house_value, and the variable median_income.
After analyzing the correlation matrix, I made the assumption that the column ocean_proximity would be highly relevant for both model training and analysis, as houses near the ocean are typically more expensive than those farther away. Therefore, I took the column and preprocessed it into a binary format to train the model and conduct further analysis.
train_data = train_data.join(pd.get_dummies(df['ocean_proximity'])).drop(['ocean_proximity'], axis = 1)
display(train_data)
And another correlation map.
Now we can see that our target variable has a correlation with being close to the beach; the closer the house is to the beach, the more expensive it is on average.
plt.figure(figsize = (15,8))
sns.scatterplot(x = 'longitude', y= 'latitude', data = train_data, hue = 'median_house_value', palette = 'coolwarm')
As we can see in this map of California, the points represent houses colored by their values, and we notice that the houses closer to the coast tend to be more expensive. This observation confirms the correlation we observed earlier between the target variable (median house value) and the proximity to the coast.
import folium
import geopandas as gpdcalifornia_shapefile = 'tl_2019_06_cousub.shp'
california_data = gpd.read_file(california_shapefile)
california_map = folium.Map(location=[36.7783, -119.4179], zoom_start=6)
folium.GeoJson(data=california_data,name='geojson').add_to(california_map)
california_map.save('california_map.html')
california_map
It seems that the most expensive houses are located in San Francisco and Long Beach in Los Angeles. These areas are known for their desirable locations, amenities, and attractions, which often lead to higher property values. This finding aligns with the general trend of real estate prices being influenced by factors such as location, proximity to the coast, and the overall desirability of the neighborhood.
Before training the models, we preprocess the data to handle missing values and adjust the scales of variables. We also transform the categorical variable ‘ocean_proximity’ into binary variables so that it can be used in the regression models.
train_data['bedrooms_ratio'] = train_data['total_bedrooms'] / train_data['total_rooms']
train_data['household_rooms'] = train_data['total_rooms'] / train_data['households']
display(train_data)train_data = train_data.join(pd.get_dummies(df['ocean_proximity'])).drop(['ocean_proximity'], axis = 1)
display(train_data)
We employ three different models to predict real estate prices:
- Linear Regression: A simple linear regression model that establishes a linear relationship between features and property prices.
- Random Forest: A machine learning model based on decision trees that creates multiple trees and combines their predictions to obtain a more accurate result.
- XGBoost: A gradient boosting machine learning algorithm known for its efficiency and accuracy in regression problems.
After training the models, we evaluate their performance using error metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and the coefficient of determination R². By comparing the results, we can determine which model exhibits better fit to the data and predictive capabilities.
models = [model, model_RF, model_XGB]for i in range(len(models)):
print(f'{models[i]}:')
y_train_pred = models[i].predict(X_train)
y_test_pred = models[i].predict(X_test)
mse_train = mean_squared_error(y_train, y_train_pred)
mae_train = mean_absolute_error(y_train, y_train_pred)
r2_train = r2_score(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
mae_test = mean_absolute_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)
print('Training MSE:', mse_train)
print('Validation MSE:', mse_test)
print('Training MAE:', mae_train)
print('Validation MAE:', mae_test)
print('Training R²:', r2_train)
print('Validation R²:', r2_test)
print()
print('Scores')
print('LinearRegression',model.score(X_test, y_test))
print('RandomForestRegressor',model_RF.score(X_test, y_test))
print('XGBRegressor',model_XGB.score(X_test, y_test))
Through this project, we were able to explore and analyze real estate data, perform appropriate data preprocessing, and build regression models to predict real estate prices in California. The knowledge gained can be applied in other areas and projects involving data analysis and modeling for value prediction.
I hope this project has been helpful in understanding the process of real estate data analysis and modeling and inspires you to conduct your own analyses in future projects. Thank you for reading, and until next time!