Aug 4, 2024 3 min read

Master Explainable AI with SHAP: Solving Kaggle's House Prices Dataset

Welcome to our blog! Today, we'll take a deep dive into solving Kaggle’s House Prices - Advanced Regression Techniques Dataset. Our focus will be on applying Explainable AI concepts using SHAP (SHapley Additive exPlanations) values. By the end of this tutorial, you'll understand how to interpret model predictions and make them more transparent and understandable.

What are SHAP Values?

SHAP values are a game-changing concept in Explainable AI, derived from Shapley values in cooperative game theory. Introduced by Lloyd Shapley in 1953, Shapley values provide a fair way to distribute gains (or costs) among players in a coalition game. When applied to machine learning, each feature in a dataset is considered a 'player', and the prediction made by the model is the 'gain'. SHAP values for each feature indicate its contribution to the final prediction.

Key Properties of SHAP Values

Efficiency: The sum of SHAP values for all features equals the difference between the actual prediction and the baseline prediction.
Symmetry: Identical contributions from two features result in identical SHAP values.
Dummy: A feature with no effect on the prediction has a SHAP value of zero.
Additivity: For combined models, SHAP values for each feature are the sum of their SHAP values in the individual models.

Implementing SHAP Values

Let's move on to the practical implementation of SHAP values using the Kaggle House Prices dataset and an XGBoost model.

Step 1: Import Libraries and Load Data

First, we import the necessary libraries and load the dataset.

# Import the necessary libraries
import pandas as pd
import numpy as np
import xgboost as xgb
import shap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load the dataset
train_df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
# Separate Features and Target Varables
X = train_df.drop(['SalePrice'], axis=1)
y = train_df['SalePrice']

Step 2: Encode Categorical Features

Since XGBoost requires numerical input, we need to encode categorical features.

# Encode categorical features
for column in X.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    X[column] = le.fit_transform(X[column])

Step 3: Split the Data and Train the Model

We then split the data into training and testing sets and train the XGBoost model.

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = xgb.XGBRegressor(n_estimators=100, max_depth=3, learning_rate=0.1)
model.fit(X_train, y_train)

Step 4: Explain Predictions with SHAP Values

Next, we use SHAP to explain the model's predictions.

# Initialize the SHAP explainer
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_test)

Visualizing SHAP Values

Summary Plot

A SHAP summary plot shows the distribution of SHAP values for each feature across all samples.

# Create a summary plot
shap.summary_plot(shap_values, X_test)

The summary plot provides a comprehensive view of feature importance. Higher-ranked features on the y-axis contribute more to the model’s predictions. The color coding indicates feature values, with red for high and blue for low.

For example, in our plot, 'OverallQual' and 'GrLivArea' are the most important features, with higher values pushing the predicted house prices higher.

Waterfall Plot

A waterfall plot provides a detailed breakdown of a single prediction.

# Choose an instance to explain
instance_index = 0

# Create a waterfall plot
shap.plots.waterfall(shap_values[instance_index])

This plot starts with the baseline prediction and shows how each feature's SHAP value contributes to the final prediction. For instance, 'GrLivArea' and 'OverallQual' significantly decrease the predicted price, whereas 'OverallCond' and 'YearRemodAdd' increase it.

Conclusion

In this tutorial, we covered the theory behind SHAP values and demonstrated their application in explaining model predictions. By using SHAP values, we can enhance the transparency and trustworthiness of machine learning models, making them more interpretable and understandable.

Resources:

Stay tuned for more tutorials and insights into the world of AI and Data Science. See you next time!

Alister George Luiz

Data Scientist

Dubai, UAE