tl;dr:
This series provides a comprehensive overview of how to operationalize your work as a data scientist. The structure is based on different scenarios to provide you a guide according to “when to do what and how”. Find your scenario from the list below or jump directly into a specific technology.
- Scenario 1: The Startup
- Scenario 2: The Manufacturer
- Tooling:
Why This Series
To create real value as a data scientist at some point you probably face the challenge of how to make your work accessible and usable for the world outside of your workstation. This may include your team members, other departments of your company or even your customers who for example benefits directly from your product recommendation model. Depending on the type of your organization (startup or big company, manufacturing or online business etc.), the existing IT infrastructure, security requirements (including roles and rights) etc. the landscape of possible workflows and tools can be completely overwhelming. Maybe you come up with questions like: How do I design a real service around my work? How do I manage deployments of different versions of my model with probably breaking changes to the input parameter? How do I guarantee almost zero downtime even under load or during updates? How do I manage who is allowed to use my work? And that’s only half of the story. Imagine what happens if your ideas emerge from a single adhoc analysis in a Jupyter Notebook to a companywide service with possibly thousands or millions of stakeholders (internal and external customers which rely on your work). How does this affect your development workflow? For example, how do you manage versions of your 500MB machine learning model binary? How do you keep track of the model performance if you work in a larger team?
In the software engineering world there exist well established best practices you have probably heard of or used it by yourself like version control, unit / integration tests, continuous integration and deployment etc. summarized under the term DevOps. The goal here is to bridge the gap between Development and Operation and ensuring higher software quality and faster development. While some of these concepts can and should be applied to the data science development process as is, there exist specific needs in our field which require our own approaches like the mentioned example of version control for machine learning model binaries. Spoiler: There are better ways than commit each new version to your code repository.
This is an introduction to a complete series about “DevOps for Data Science”. We will cover concepts like how to setup a proper development environment, how to write an API for machine learning models, how to deploy your services at scale and many more. Since there are so many use cases there is no one way of doing it right. Therefore, this series will be organized by representative scenarios which cover a wide variety of possible situations you may find yourself in as a data scientist. A small private project, a startup who wants to improve their services with machine learning, a big company with a (legacy) preexisting IT infrastructure, I have you covered. We will also cover multiple options for a specific use case and discuss pros and cons depending on your prerequisites. For example hosting a machine learning model on a serverless platform or deploying a wrapper API on a self maintained cloud machine. And even if no scenario suits you to 100% you should be able to cherry pic the necessary parts for your case or combine different stories.