Machine Learning models are great, they take input data used for a classification or regression task, learn from it, and when encountered by new data, they are able to give a prediction. There are several types of models that range in complexity, from a simple Linear Regression, which is used to estimate real values such as house prices, to a black box Neural Network model, which is used to solve complex tasks such as natural language understanding.
In this post, I will discuss how to achieve the same interpretability of a a simple Linear Regression model from a complex Tree Classifier/Random Forest model by using SHAP (SHapley Additive exPlanations).
Random Forest/XGBoost Model Output
Several aspiring data scientists, with myself included, have difficulties interpreting the results from complex models such as a Random Forest or XGBoost. These models have something called feature importance, which measures the impact of a feature in a model’s prediction. This feature importance can then be graphed to something similar to the plot below.
In the above example, it can be seen that the ‘Updates’ is the most impactful feature, followed by ‘Facebook Shares’, then ‘Goal’. It is not known how these features actually impact the model, for example, does having more Facebook shares beneficial to the output or is it the other way around? That can not be determined by just looking at the feature importance plot, this is the reason why SHAP was created.
“SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions” (SHAP Documentation).
SHAP is an open source module that can be tied to machine learning classifiers and can audit the results of such models. It does so by taking the model’s prediction and adding or changing a variable’s value and see how that affects the prediction. Then it can calculate which feature is the most and least helpful to the model, and to which degree, either positive or negative.
Below I’ll give you a hands on example on how SHAP can be used to improve a model’s interpretability.
Predicting a startup’s success using Tree Classifiers
The model that will be used in this example is one that I’ve made to accurately predict if a Kickstarter startup will be successful.
The dataset being analyzed in this example is called “Kickstarter dataset” and was taken from www.kaggle.com/tayoaki/kickstarter-dataset.
This is a balanced dataset which consists of 18,142 observations and 35 unique features.
To start, below are the libraries used:
import pandas as pd
import datetime as dt
import xgboost as xgb
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
and this is how the data set looks like:
It includes the following features: ‘Id’, ‘Name’, ‘Url’, ‘State’, ‘Currency’, ‘Top Category’, ‘Category’, ‘Creator’, ‘Location’, ‘Updates’, ‘Comments’, ‘Rewards’, ‘Goal’, ‘Pledged’, ‘Backers’, ‘Start’, ‘End’, ‘Duration in Days’, ‘Facebook Connected’, ‘Facebook Friends’, ‘Facebook Shares’, ‘Has Video’, ‘Latitude’, ‘Longitude’, ‘Start Timestamp (UTC)’, ‘End Timestamp (UTC)’, ‘Creator Bio’, ‘Creator Website’, ‘Creator — # Projects Created’, ‘Creator — # Projects Backed’, ‘# Videos’, ‘# Images’, ‘# Words (Description)’, ‘# Words (Risks and Challenges)’, ‘# FAQs’
Right off the bat, it can be seen that some columns will not make good predictors and some columns will need to be engineered in order to make data useful.
Dropped features: ‘Id’,’Name’,’Url’,’Creator’, ’Latitude’, ’Longitude’,
‘Start Timestamp (UTC)’,’End Timestamp (UTC)’,’Creator Bio’,
‘Location’, ’Currency’, ’Creator Website’, ’Pledged’, ’Backers’, ’Start’, ‘End’.
Engineered features: ‘Has_Creator_Website’, ‘Average_Pledge_per_Backer’, ‘Start_Month’, ‘End_Month’, ‘Was_missing_#_Projects_Backed’
After some cleaning, the data was ready for modeling. The model that will be shown here is a vanilla XGBoost model, which did surprisingly well in predicting if a startup will be successful or not. Below are the model’s results and feature importance plot:
As previously stated, it can be seen that the ‘Updates’ is the most impactful feature, followed by ‘Facebook Shares’, then ‘Goal’. It is not known how these features actually impact the model. Now let’s see what SHAP can tell us.
It is very simple to setup SHAP, the user only needs be initialized by feeding it the fitted classifier and getting the SHAP values.
#CALLING SHAP TREE EXPLAINER
explainer = shap.TreeExplainer(final_clf)#GETTING SHAP VALUES
shap_values = explainer.shap_values(X_train)#INITIATING SHAP
Now that we have the SHAP Values, we can start making some plots! Let’s begin by comparing the feature importance plots.
It can be seen how the two plots are different. SHAP determined that the model’s most important feature is ‘Goal’, followed by ‘Average_Pledge_per_Backer’ and ‘Facebook Shares’. Now let’s see how these features actually impact the model. The SHAP summary plot can further show the positive and negative relationships of the predictors with the target variable. This plot is made of all the dots in the train data. It demonstrates the following information:
- Feature importance: Variables are ranked in descending order.
- Impact: The horizontal location shows whether the effect of that value is associated with a higher or lower prediction.
- Original value: Color shows whether that variable is high (in red) or low (in blue) for that observation.
plt.title('FEATURE IMPORTANCE USING SHAP',fontsize = 15)
The following conclusions can be taken from the plot above.
Things that increase a Startup’s success rate:
- The more money an individual backer gives, the higher the chance of success.The number of individual backers isn’t as important.
- The more marketing and social media exposure/involvement (Number of Facebook shares and comments) the better.
- High number of projects created. An experienced entrepreneur has a higher chance than an amateur.
- The more rewards the better.
Things that decrease a Startup’s success rate:
- The higher the Goal, the lower the chance of a startup being successful.
- Long duration Campaigns.
- Campaigns that end close to the end of the year.
- Games and Fashion Categories.
SHAP also offers another plot called Dependence Plot. Let’s look at an example below.
plt.title('GOAL DEPENDENCE PLOT',fontsize = 15)
After taking a closer look into the ‘Goal Dependence Plot’, it can be clearly seen how high goals negatively impacts the model’s output. A good goal limit for a campaign is around $10,000.00.
There are many more plot such as Waterfall, Partial Dependence, and Scatter plots that I encourage you to look at. For more information on SHAP, access its documentation website at https://shap.readthedocs.io/en/latest/index.html.
If you are interested at taking a deeper look into the example above, feel free to take a peek at my notebook at https://github.com/gabrieljarosi/dsc-mod-3-project-v2-1-onl01-dtsc-pt-041320