This series of posts are a digest of the AAAI 2021 Tutorial: Explaining Machine Learning Predictions: State-of-the-art, Challenges, and Opportunities by Julius Adebayo - MIT, Hima Lakkaraju - Harvard University, and Sameer Singh - UC Irvine. Link: https://explainml-tutorial.github.io/aaai21
In the previous post, we looked at types of explanations and the motivations behind them, here we'll introduce post hoc explanations.
What are post hoc explanations?
Models that are too complex for humans to easily interpret, can be explained by passing them through explainers which can produce explanations in the form of linear models, shallow decision trees, or even visualizations that help stakeholders understand how the model works. Post hoc explanations approximate the behavior of a black-box by extracting relationships between feature values and the predictions.
Interpretable description of the model behavior in a local neighborhood. They explain individual predictions.
Note. The following methods are model-agnostic, meaning they do not have access to the internal structure of the model. In this way, the methods are not restricted to specific models, not libraries.
- Feature Importances. Identify important dimensions and present their relative importance. One method for feature importance is Lime: Makes a mini-dataset of perturbations and the effects each one has on the classification output. On this dataset, you can train a sparse linear regression where you get to see the most important parts of the input that made the classifier produce its output. The less important areas are greyed out. That final output from the regressor is our explantation because it will help us understand based on what features is the output generated - and, consequently, we will know if the output is based on the wrong features.
- Rule Based. Look at surfaces and find the sufficient conditions for the prediction to stay the same. As long as those conditions stay the same, the output will remain. Anchors, one example of a rule-based technique, gives us very clear rules on which the decision is based and, in turn, allows us to understand what to do if we need the output to change.
- Salience Maps. What parts of the input are most relevant for the model's prediction? Methods include input gradient - corresponding to the derivative of a particular logit score for a given class with respect to the input. This gradient is the same size of the input and you can visualize it in the form of a heatmap. It can be visually noisy and difficult to interpret -, smoothGrad - average input-gradients of noisy examples -, integrated gradients - compute a path integral from a baseline all the way to the input that you want to explain -, Modified backdrop approaches - compute feature relevance by modifying the back-propagation via positive aggregation.
- Prototype Approaches. Explain a model with synthetic or natural input examples. These approaches help to gain insights into the kind of input the model is most likely to misclassify, identify the input examples that are mislabeled, and the kind of input that activates an internal neuron. Example methods are training point ranking via influence functions - they answer the question: given an input can you rank the training points that have the most influence on the test loss for a particular new input? These approaches can be difficult to scale -, activation maximization - identify examples, synthetic or natural, that strongly activate a function (neuron) of interest.
- Counterfactuals. Capture what features need to be changed and by how much to flip a model's prediction? i.e., to reverse an unfavorable outcome. It helps provide a recourse for individuals who are affected by an algorithm's decision. Methods include minimum distance counterfactuals - the choice of the distance metric dictates what kind of counterfactuals are chosen -, feasible and least cost counterfactuals - end users can input constraints in order to generate feasible counterfactuals -, casually feasible counterfactuals - account for feature interactions when generating counterfactuals.
Post hoc explanations approximate the behavior of a black-box by extracting relationships between feature values and the predictions. Several local explanation methods are model-agnostic, meaning they do not have access to the internal structure of the model. Among these methods are feature importances, rule-based, saliency maps, and counterfactuals.
We will look at global explanations in the next part of this series.