cnvrg.io Blog

Docker for machine learning and reproducible data science

Written by Maya Perry | Jun 11, 2019 10:42:58 AM

It’s undeniable that Docker is an invaluable component to machine learning development. But, what makes Docker so conducive for data science? The docker hype doesn’t come from nothing. If you ask any data scientist or data engineer, they will tell you that data science without Docker is like a fish swimming upstream.For scientists, it makes development faster as they have a consistent environment for experimentation. For data engineers, it makes the data science workflow that much easier to push you towards production. In this post, we will tell you how Docker can be used for data science, and how it can streamline your machine learning pipeline.

What is Docker?

Developers have always used Docker to develop, deploy and run applications. While Docker was originally used for software development in 2013, it was quickly adopted by data engineers, and more recently by data scientists. Data scientists with a background as a developer or data engineer were familiar with Docker and have used it to develop, deploy and run machine learning models as well. How does Docker do this? It serves as a single environment with all your dependencies, frameworks, tools and libraries needed to run your project. It’s portable, scalable, and stackable. That means you can share it with others, replicate it, and build services on top of it. What more could a data scientist need? While it may be an obvious solution, not all data scientists are fluent enough with Docker to benefit their workflow. We’ll explain specific ways Docker can be used to improve data science workflows and streamline machine learning.

Bridges work from science to engineering

Data scientists do not come in one shape or size. Not all come from the world of engineering or software development. Some great data scientists come from the world of science or mathematics. In other words, while they can provide excellent research, their knowledge of software installation can increase time spent on projects. Data scientists can write Jupyter Notebooks, or scripts with their results baked into a Docker image. With a little bit of time, you can build a Dockerfile which will automagically instruct all that is needed to build and run your project. From there, you can create your custom Docker Image and share with anyone on your team responsible for engineering tasks. Since many engineers are pros at Docker, they won’t have any issue using your Docker Image. With Docker, you can transport all of the research and run it in the exact way the data scientist intended.

Deploy models to production

Once your model is complete, you can wrap it up into an API and place it in a Docker container to send to DevOps to deploy. Or, better yet, you can use Kubernetes and deploy your machine learning model without DevOps. This simple guide can help you deploy machine learning models with Kubernetes: https://blog.cnvrg.io/deploy-models-with-kubernetes. While it’s possible to deploy without Docker, many data scientists prefer to use Docker for smoother, more reliable deployments. Additionally, it makes the deployment more portable for future use.

Reproduce results

One major capability of Docker, is it’s ability to easily reproduce your working environment. It enables data scientists to build environments once - and ship their training/deployment quickly and easily. Think of it as a virtual machine that you can deploy across servers or on a personal computer without worrying about dependencies. This is key to solving and data science reproducibility problem. It’s quite easy to build an environment using pre-built Docker Images which can be found in DockerHub. The DockerHub community has built endless Docker Images to use as is, or to customize on top of. Data scientists can also build their own custom environment and there are plenty of resources on the web to follow, and learn how.

Not only that, but Docker can serve as a reproducibility tool with a few additional setups. You can persist any type of metadata and version your model by including the Docker Image tag along with the metadata.

Install Docker

If you don’t have Docker installed on your machine, you can follow the instructions in this post “How to Setup Docker and Nvidia-Docker 2.0 on Ubuntu 18.04