Challenges of monitoring ML models in production

monitoring-dashboard

To ensure service continuity and a minimum SLA (Service Level Agreement), traditional applications are deployed along with a monitoring system. Such a system is used to log metrics like request frequency, latency, and server load to take actions like raising alerts in case the service is interrupted.

Similarly, as part of an MLOps paradigm, Machine Learning deployments need to be monitored to keep track of the model’s health and to take actions when performance metrics are degraded. We should not lose track of the fact that trained models come with performance metrics on offline datasets which do not guarantee performance when it goes live. Unfortunately, this task of monitoring models is very challenging as there is a lack of tools, systems, and even a common understanding among the MLOps community of what an ML monitoring system should look like.

However, there are tools that ML practitioners use during training that can be also used during model deployment, for instance, model performance metrics and model explainability techniques. But this is not enough as another dimension that needs to be monitored is the data itself that the model receives to generate predictions. Take as an example, a model which was trained on cat pictures but suddenly during deployment starts getting dog pictures (such problem is called Data drifting). Furthermore, the presence of outliers in the new data can significantly degrade the deployed model performance.

The following sections discuss key challenges of monitoring Machine Learning models in production.

Performance metrics

Labelling the data can be challenging since it is usually a very manual task that requires domain knowledge (e.g. medical images labeling) and as a result time consuming and expensive. But in case the labels can be made available (e.g. a timeseries forcasting task) it is still challenging to use them to calculate the model performance:

Proxy metrics

Unfortunately, it is not always possible to monitor the model performance on a live environment as this requires access to labels which can be impractical due to its operational or financial cost. In this case, monitoring the statistical characteristics of the model’s input and output data can be used instead as a proxy for monitoring the model performance.

Outlier values

Generalization of ML models is a well known problem that causes the model to perform poorly on unseen data. Outliers in the input data is a serious problem and should be flaged as anomalies. Choosing the right outlier detector for a specific application depends on:

Furthermore, the choice of outlier detector has implications on how it will be deployed:

Distribution shift

In contrast to outliers that usually refer to individual instances, data drift or shift detection uses statistical hypothesis test to detect when two samples are drawn from the same underlying distribution. In our case, the drift detector tries to identify when the distribution of the input data to the deployed model starts to diverge from the training data making the model predictions unreliable. One useful application of this is to help decide when the model in production needs to be retrained again.

Drift detectors can be classied in one of the following classes:

In case where data is high dimensional (e.g. images) one could attempt a dimensionality reduction step before running the hypothesis test.