Docker Data Science Pipeline

At ING we needed a way to implement Data science models from exploration into production. I will do this talk from my experience on the exploration and production Hadoop environment as a senior Ops engineer. For this we are using OpenShift to run Docker containers that connect to the big data Hadoop environment.

During this talk I will explain why we need this and how this is done at ING. Also how to set up a docker container running a data science model using Hive, Python, and Spark. I’ll explain how to use Docker files to build Docker images, add all the needed components inside the Docker image, and how to run different versions of software in different containers.

In the end I will also give a demo of how it runs and is automated using Git with webhook connecting to Jenkins and start the docker service that will connect to a big data Hadoop environment.

This is going to be a great technical talk for engineers and data scientist.