ML-OPS

For many researchers, the work on an ML model ends with handing over a Jupyter notebook. In the best case, it's cleaned up and can be directly turned into a script. In the worst case, it's cluttered with analysis and cells must be run in a specific order. But who guarantees that such code will work? That it will be better than the previous version? How will it be deployed? That there isn't a model error, such as data drift, because the data from your young respondents at Masaryk University suddenly fails on seniors? And how will a rollback to the old version be done if such an error is detected?

Deploying a new model into production has several steps, especially if you want to adhere to industrial standards. Proper implementation should help address the questions outlined in the previous paragraph.

Model Handoff Phase

The model code is handed over minimally, ideally without commented-out and dead code sections. Packages, typically Python packages in the ML world, are minimal, without redundancies, and fixed to specific versions. The code is loaded and a script is run over validation data. The set metric should show the same or higher numbers. Accuracy is not always the only metric; sometimes we might reduce it for higher speed or lower running costs.

Accuracy itself can be calculated in various ways, but for simplicity, I will refer to other sources for that.

Deployment

One of the big problems in the ML field is the reproducibility of builds. Neural networks' major weakness is their sensitivity to minor input changes, where chaos theory runs wild. I have seen neural networks change their outputs just because a minor version of a supporting mathematical library, Numpy, changed. In practice, code with loosely defined packages, a year or two after its creation, starts producing unrecognizable outputs. Preventing this problem is aided by fixing specific package versions and building a so-called container with Docker . Such a container has its libraries and operating system, isolated from any software on the host system that might affect it.

The Docker container can be built automatically after the previous phase (or simultaneously). For instance, with CI/CD tools. It should then be tested as well, since even in it, library versions can differ from older builds. Dockerization is also a fairly reliable way to ensure GPUs work if we need them.

Our finished Docker container can be combined with another, such as a web interface or a general API, which comes in another container. This combining is called orchestration, managed by other tools like Kubernetes. Whether locally or in the Google Cloud environment, Kubernetes ensures it can be used by the end-user. Additionally, it can scale individual services as needed. If ten times more traffic needs to be handled at lunchtime, containers can be cloned and then drop off again.

Containerization, unfortunately, is not a cure-all and still doesn't address the availability of old libraries on the internet or hardware discrepancies in computing, but that's a topic for longer winter evenings. Currently, it's the most reliable way to ensure repeatable and trouble-free deployment into production.

Operations Monitoring

The work doesn't end after deployment. The system needs to be logged and monitored to ensure it hasn't crashed or encountered subtler issues. For example, that a house on the outskirts of Brno isn't valued at 140 million. Google Cloud (and similar systems) includes tools for logging and monitoring, where we can see logs and set up threshold systems and alarms. Such outliers as a house valued at 140 million should be detectable.

Versioning and Problem Resolution

The model itself, Docker container, and build within Kubernetes can be versioned. In case a problem is detected through monitoring, we should be able to perform a rollback, returning to previous versions.


Conclusion

ML-OPS is a young field that begins where the research team’s work ends. And I dare say, there aren’t many of us yet. Or at least not with a minimum of 5+ years of experience, during which I have encountered many problems that current materials don’t prepare anyone for. If interested, I can help set up these processes or even migrate the entire system to the Google Cloud environment. The advantage of such a system is then the simplicity of development or the reproducibility and scalability of production.