Sunday, January 29, 2023

Persist [more] securely with `skops` and avoid `pickle`

In Python, the most common way to persist objects is to use the `pickle` module, and the format has been used by the machine learning community to store, share, and load trained machine learning models for many years. You can imagine one team building models, and another team handling the infrastructure and the devops/mlops of deploying the model to put it in production.

However, loading pickle files would pose many security risks and can lead to arbitrary code execution [1, 2]. So far using pickle files has worked well for many years for two reasons:

  • There wasn’t much of a need to load pickle files coming from untrusted sources
  • The issue was less known and hence not so commonly exploited

However, with the recent advances in machine learning, and it becoming mainstream and used in a variety of use cases, sharing persisted machine learning models has become more often, and more known, and the vulnerabilities have started to be exploited more frequently.

There have been several efforts in the community to create alternative storage formats to store machine learning models. One of those efforts is happening in the `skops` library, which can be used to store scikit-learn, XGBoost, LightGBM, and CatBoost models. Since version 0.3, `skops` allows users to store and load trained machine learning estimators without using `pickle` [3].

Users who are familiar with the topic, might wonder about the comparison between `skops` and `cloudpickle`, which is another library allowing users to store python objects. The important distinction is that `clouldpickle` extends `pickle` and allows users to store objects which cannot be stored with vanilla `pickle`. On the contrary, `skops` imposes limits on what can be stored and loaded with the format. This has allowed us to implement security measures which we couldn’t have using the `pickle` format. Also note that joblib is the same as pickle, but faster in certain cases and is commonly used in multiprocess / distributed settings.

When using `skops`, the library knows of a limited set of types and functions, and users have to explicitly allow types and functions to be loaded if they’re not known to the `skops` library, or else the load function would fail. If a certain file is coming from a trusted source, it can be loaded by explicitly setting a flag.

Now let’s have a look at how the API for saving and loading models looks like:

Developers can simply replace calls to `dump`/`dumps`/`load`/`loads` from `pickle` module to the ones available in `skops.io` module, with the exception that they need to mention what should be trusted while loading a file or an object.

If the source of a file is not trusted, one can get a list of unknown types as:

This `unknown_types` variable can then be passed to `load`/`loads` method:

For users who want to have a quick check and test if they can convert their existing pickle files, we have created a web application in the form of a Space on Hugging Face Hub which can be accessed under link: https://huggingface.co/spaces/adrin/pickle-to-skops

The above application uses Gradio [4], and in the above link, users can also access the source code of the application and run it locally or on their local infrastructure if they wish.

We hope this helps users share and accept files with more peace, and we are more than happy to hear your thoughts, issues, and feature requests on our issue tracker [5].

[1] https://peps.python.org/pep-0307/#security-issues

[2] https://github.com/moreati/pickle-fuzz

[3] https://skops.readthedocs.io/en/stable/persistence.html

[4] https://www.gradio.app/

[5] https://github.com/skops-dev/skops/issues

 

Latest