Explainable and Trustworthy AI in production

Machine learning systems are getting more complex over time, for instance GPT-3 is a model with hundreds of billions of parameters that requires a cluster of machines to run. And many of them are often “black boxes” for regular users and their internal functioning is only understood by experienced data scientists.

A machine learning system that is deployed to production and whose prediction may affect a person’s life has to be trustworthy and make its decision process transparent to users. Trustworthiness in machine learning entails a lot of aspects: privacy, safety, robustness, fairness, explainability, transparency, value alignment, and social good.

This article focus on deploying Explainability for machine learning systems in production.

The need for Explainabile AI

Defined simply, Explainability is the extent to which the internal mechanics of an ML system can be explained in human terms. It is literally about explaining what is happening inside a model. Explainability is desirable for multiple reasons. In fact, by allowing users to verify the factors contributing to certain predictions, Explainability

Furthermore, the widespread usage of pre-trained models (especially in Computer Vision and lately in Natural Language Processing) can introduce harmful model bias during fine-tuning for a specific downstream task (e.g. image or text classification). This is because the data that the original model was pre-trained on is not controlled/curated by the downstream user. For example Word2Vec a popular pre-trained word embeddings is known to have serious gender bias and if used improperly can lead to serious discrimination.

Explainability goes hand in hand with other ML monitoring techniques like an anomaly or drift detection (learn more about model monitoring). In fact, they complement each other, for example, in case the model input is flagged as an outlier, explainability techniques can be used to assess the trustworthiness of the model prediction on this input.

Explainabilily techniques

The field of explainable AI is rich with different approaches and techniques, not all of them were created equal. Some are suitable for specific kinds of models others are generally applicable to any model (from neural net to tree-based models). The data modality also impacts the choice of the Explainabilily technique, in addition to the prediction task (e.g. regression vs classification).

To choose the right technique, it is also important to know the heuristic nature and the assumptions (e.g. background values) it makes during the process of explanation. Plus these techniques have different output and functioning (some require heavier computation than others).

Each of the available techniques has its strengths and pitfalls, but one can combine multiple approaches to provide a holistic explanation that sheds light on the impact of the training data (e.g. size or class unbalance) and relative feature importance. The latter, attempt to discover the key features to maintain the original prediction and by how much they can be distributed so the model changes its prediction.

Impact of the training data

Explanation techniques based on influence functions highlight which instances from the training set had the most impact on a specific prediction at inference time. An example of an influential instance is an outlier (see the following diagram).

Such techniques allow the user to check whether the most impactful training data contain relevant features compared to the instance we are trying to explain in production.

A linear model with one feature. Trained once on the full data and once without the influential instance. Removing the influential instance changes the fitted slope (weight/coefficient) drastically - source.

Feature importance

One way to check for Feature importance is by trying to find which features are key in the final model prediction for a given instance regardless of the values of the other features using Anchor explanations.

Another way, Feature attribution techniques evaluate the relative feature importance with respect to a model prediction. For example by trying to perturb the original instance to find the minimal change which will change the model prediction while still respecting the class-conditional data distribution.

Such techniques include:

Explainabilily in Production

As described earlier, not all Explainabilily techniques were created equals. Some require access to the model internals (e.g. Integrated Gradients requires access to the model gradients for a given input) thus the name white-box approaches. Others require nothing more than access to a prediction API thus the name black-box approaches.

The latter techniques are more convenient for production deployment as the model to explain is usually deployed in isolation as a service with a well-defined API (e.g. URL, request/response bodies).

The way one would use a Black-box to explain a model deployed in production is by repeatedly querying the model with a slightly perturbated version of the original input instance so that it creates an approximation of model inference behavior. The way the queries are constructed depends on the input instances and their perturbated versions, as well as the explanation output of the Black-box explainer.

In a production environment, such a setup can be deployed by having two different endpoints:

For scale reasons, it is advisable that:

The following sequence diagram illustrates such deployment and interactions between the endpoints.


Notice in the explanation loop, how the explainer tried to perturb the original input data xyz that using the ? charachter (e.g. modified version x?z) until the model predict a different label def than the original one abc.