Heartbeat

Comet is a machine learning platform helping data scientists, ML engineers, and deep learning…

Follow publication

Create an MLOps Pipeline With GitHub and Docker Hub in Minutes

--

Photo by Tom Fisk: https://www.pexels.com/photo/birds-eye-view-photo-of-freight-containers-2226458/

Can you create an MLOps pipeline with GitHub and Docker Hub in minutes? Definitely yes! In this tutorial, you will use the power of Docker with the automated MLOps process.

If you find yourself saying —

  • But it was working on my machine
  • Software is different
  • Operating system difference, maybe that’s is the reason
  • My configuration is different
  • Software mismatch

You are at the right place.

If I say an app or model works on my local environment, it should carry global reliability. Any one of my teammates from India, Germany, Singapore, Australia, etc., should easily get the same app or model working on their local environments.

Docker will package your app and its dependencies into a reusable Docker image. By using the MLOps process defined in this tutorial,

  • You will automate putting your app and its dependencies into a Docker image
  • Sending it to Github
  • Pushing it to Docker Hub to register your Docker Image
  • And you or your colleagues can use it reliably in different environments.

Easy-to-follow, step-by-step explanations will be your guide on this journey.

Let’s start the journey.

Introduction

In the previous two articles, we discussed end-to-end MLOps pipeline implementation using GitHub Actions and Heroku with Flask API and FastAPI.

In this article, we will discuss the end-to-end MLOps pipeline implementation using GitHub Actions and Docker Hub.

First let’s see both of the services’ definitions and then we will move on to implementations.

Github Actions

GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD. Build, test, and deploy your code right from GitHub. Make code reviews, branch management, and issue triaging work the way you want. With GitHub actions you can automate your workflow from idea to production.

Docker Hub

Docker Hub is the world’s largest library and community for container images.One can browse over 100,000 container images from software vendors, open-source projects, and the community.

Image by Author

CI (Continuous Integration)

For this CI part, I used the heart-failure dataset. One of the main reasons for heart failure is cardiovascular disease. The dataset contains different features which may increase the risk of cardiovascular disease. I developed the machine learning model by using these features.

Doing that model would help develop early disease detection and plan and manage the risk factors.

You can find a detailed discussion on EDA and ML models here

Based on the final Catboost model, a basic Flask App has been developed.

from flask import Flask, render_template, request
import pickle
import pandas as pd


app = Flask(__name__)
model = pickle.load(open('catboost_model-2.pkl', 'rb'))



def model_pred(features):
test_data = pd.DataFrame([features])
prediction = model.predict(test_data)
return int(prediction[0])



@app.route('/', methods=['GET'])
def Home():
return render_template('index.html')



@app.route("/predict", methods=['POST'])
def predict():
if request.method == 'POST':
Age = int(request.form['Age'])
RestingBP = int(request.form['RestingBP'])
Cholesterol = int(request.form['Cholesterol'])
Oldpeak = float(request.form['Oldpeak'])
FastingBS = int(request.form['FastingBS'])
MaxHR = int(request.form['MaxHR'])
prediction = model.predict([[Age, RestingBP, Cholesterol, FastingBS, MaxHR, Oldpeak
]])


if prediction[0] == 1:
return render_template('index.html', prediction_text="Kindly make an appointment with the doctor!")


else:
return render_template('index.html', prediction_text="You are well. No worries :)")


else:
return render_template('index.html')



if __name__ == "__main__":
app.run(host='0.0.0.0', port=5000, debug=True)

In addition to the Flask app, I also added:

  • requirements.txt
  • test.py
  • index.html
  • catboost.pkl

So we can move on to the GitHub part.

I created a new GitHub repo and connected the new repo with my local folder. Only one final touch needs to be done before starting the Continuous Integration part, which is the YAML file. YAML file lets GitHub know exactly what needs to be done at each step of the workflow.

ci_pipeline: 
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v1
with:
fetch-depth: 0


- name: Set up Python 3.9
uses: actions/setup-python@v1
with:
python-version: 3.9


- name: Install dependencies
run: |
python -m pip install --upgrade pip
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi


- name: Format
run: |
black app.py


- name: Lint
run: |
pylint --disable=R,C app.py


- name: Test
run: |
python -m pytest -vv test.py

Let’s push it to see Continuous Integration in action.

Image by Author

So far everything is OK with our CI part of the MLOps pipeline and we got our badge.

Image by Author

So we are ready to move on to the second part of our MLOps pipeline.

CD (Continuous Deployment)

For this part, we will use Docker Hub. We will build Docker Image and push Docker Image with the relevant tag to our Docker Hub repo.

We need to follow several steps to finalize our MLOps pipeline.

  • Create secret keys and store them in the GitHub repo.
  1. Docker user name
  2. Docker Hub password
  3. Docker Hub repo name
  • Create environment variables for secret keys
  • Dockerfile
  • We need to add the Continuous Deployment part to the Workflow Yaml file
  • .dockerignore file
  • Finalize the YAML file
  • Push it and check it from the Docker Hub repo.

Secret Keys

We can’t put our password or other sensitive information directly into Docker File or any file. One of the solutions for this is that we can use GitHub Actions Secrets options and create secret keys for further usage.

Github environment secrets allow us to encrypt sensitive information in our organization, in our repository. Secrets are encrypted environment variables, which we will use them in the GitHub Workflow.

“GitHub uses a libsodium sealed box to help ensure that secrets are encrypted before they reach GitHub and remain encrypted until you use them in a workflow.” (https://docs.github.com/en/actions/security-guides/encrypted-secrets)

In this tutorial, we have three secrets — sensitive information which we don’t want to put in our code directly.

Instead of using the sensitive information in our code without any mask, we will use GitHub environment secrets to encrypt our sensitive information.

Let’s create secret keys in the Github.

Image by Author

Select a new repository secret and add all three keys.

Image by Author

Environment Variables

Let’s create environment variables for the repository secrets.

env:

DOCKER_USER: ${{secrets.DOCKER_USER}}

DOCKER_PASSWORD: ${{secrets.DOCKER_PASSWORD}}

REPO_NAME: ${{secrets.REPO_NAME}}

Dockerfile

Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image (docs.docker.com).

Let’s write down our basic Dockerfile.

FROM python:3.9-slim


# Working Directory
WORKDIR /app


# Copy source code to working directory
COPY . app.py /app/


# Install packages from requirements.txt


RUN pip install --no-cache-dir --upgrade pip &&\
pip install --no-cache-dir --trusted-host pypi.python.org -r requirements.txt


CMD python app.py

Let’s look at each line of the code in detail.

FROM python:3.9-slim

Dockerfile first will get the Python image:

# Working Directory
WORKDIR /app

And will create working directory named app.

# Copy source code to working directory
COPY . app.py /app/

Will copy files to app directory.

# Install packages from requirements.txt


RUN pip install --no-cache-dir --upgrade pip &&\
pip install --no-cache-dir --trusted-host pypi.python.org -r requirements.txt

Will install required libraries.

CMD python app.py

And it will be ready to run with Python.

Sometimes simple solutions offer the best results. We made minor hardware optimizations for a huge increase in throughput. Check out the project here.

Continuous Deployment Part of the YAML File

First let’s look at the CD part of the YAML file.

cd_pipeline:
runs-on: ubuntu-latest
needs: [ci_pipeline]

steps:
- uses: actions/checkout@v2
- name: docker login
run: | # log into docker hub account
docker login -u $DOCKER_USER -p $DOCKER_PASSWORD

- name: Get current date # get the date of the build
id: date
run: echo "::set-output name=date::$(date +'%Y-%m-%d--%M-%S')"

- name: Build the Docker image # push The image to the docker hub
run: docker build . --file Dockerfile --tag $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

- name: Docker Push
run: docker push $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

Let’s look at each one of them in detail:

cd_pipeline:
runs-on: ubuntu-latest
needs: [ci_pipeline]

It is the second part of the MLOps CI-CD pipeline. We named it cd_pipeline, it will work on Ubuntu.

It won’t start until the CI part of the pipeline is successfully done. This is important. We don’t want to have two independent works in our pipeline. they have to be connected.

steps:
- uses: actions/checkout@v2
- name: docker login
run: | # log into docker hub account
docker login -u $DOCKER_USER -p $DOCKER_PASSWORD

The first step in the CD pipe is to enter the Docker Hub.

GitHub actions will use our environment variables to reach secret keys and with them will login to Docker Hub.

- name: Get current date # get the date of the build
id: date
run: echo "::set-output name=date::$(date +'%Y-%m-%d--%M-%S')"

We want to track our model with its tag. Whenever we have a new push from GitHub, we will get the date and time of the modification as a tag in our image.

- name: Build the Docker image # push The image to the docker hub
run: docker build . --file Dockerfile --tag $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

By using Dockerfile, GitHub actions will start to build the docker image.
The name of the image will consist of the user name, repo name, and date tag.

- name: Docker Push
run: docker push $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

After finalizing everything, GitHub will push the newly created image to the defined Docker Hub repo.

.Dockerignore File

We don’t want to include everything in our Docker container.
The whole point of the container is the run a single application.
The container only needs app files and relevant dependencies, nothing else.

With the .dockerignore file, we can exclude these files, when we copy other files into our working directory in the docker container.

In this study, I excluded test.py, readme file, and hidden GitHub files.

test.py
readme.md
.git*

Finalize the YAML file

Let’s combine our Continuous Integration and Continuous Deployment parts of the pipelines into one YAML file.

name: Github-Docker Hub MLOps pipeline - KB


env:
DOCKER_USER: ${{secrets.DOCKER_USER}}
DOCKER_PASSWORD: ${{secrets.DOCKER_PASSWORD}}
REPO_NAME: ${{secrets.REPO_NAME}}



on:
push:
branches:
- main
pull_request:
branches:
- main



jobs:


ci_pipeline:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v1
with:
fetch-depth: 0


- name: Set up Python 3.9
uses: actions/setup-python@v1
with:
python-version: 3.9


- name: Install dependencies
run: |
python -m pip install --upgrade pip
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi


- name: Format
run: |
black app.py


- name: Lint
run: |
pylint --disable=R,C app.py


- name: Test
run: |
python -m pytest -vv test.py


cd_pipeline:

runs-on: ubuntu-latest
needs: [ci_pipeline]

steps:

- uses: actions/checkout@v2
- name: docker login
run: | # log into docker hub account
docker login -u $DOCKER_USER -p $DOCKER_PASSWORD

- name: Get current date # get the date of the build
id: date
run: echo "::set-output name=date::$(date +'%Y-%m-%d--%M-%S')"

- name: Build the Docker image # push The image to the docker hub
run: docker build . --file Dockerfile --tag $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

- name: Docker Push
run: docker push $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

We are ready to push our files and follow the GitHub workflows. The workflow is started with the first part of our MLOps CI-CD pipeline.

As you see, the second part (CD) hasn’t started yet, just waiting for the CI part to be finished.

Image by Author

Cool. The pipeline is working. Whenever the first part of the pipeline finishes, the second part starts to work.

Image by Author

The pipeline finishes with success.

Image by Author

Let’s check it from the Docker Hub. We should have new repo and model with a date tag.

Image by Author

Ok, we have a repo. Let’s see the image with the date tag.

Image by Author

Docker Image in the local environment

Let’s pull the image and work with it in the local environment.

Image by Author

Let’s see what we have in the container.

Image by Author

.dockerignore did its part — excluded files are not in the container.

Let’s run the image and see our application.

Image by Author
Image by Author

Conclusion

That’s all folks. We have successfully constructed an end-to-end MLOps pipeline with Github Actions and Docker Hub.

With Docker, we packaged our application and service and its dependencies into Docker Image to use/reuse it in different environments without any technical problems.

When you say that an app or model works in my local environment, now it carries global reliability. Any one of your teammates from all over the world can get the same app or model working in your local environment.

After this tutorial, you can easily share your apps/services with your colleagues without worrying about the operating system difference, software mistakes, or any other common everyday Data Scientist/ML Engineer file sharing problems.

You can easily automate your MLOps pipeline to put your app/services into a Docker image registered on the Docker Hub and you can share your Docker image with your colleagues. And they can use it without any dependency problems.

The code can be downloaded from here.

By the way, when you like the topic, you can show it by supporting 👏

Feel free to leave a comment. Thanks for your time.

All the best 🤘

If you enjoy reading my content, please consider following me. Also, you can support me and other writers by subscribing to Medium. Using my referral link will not cost you extra.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--

Published in Heartbeat

Comet is a machine learning platform helping data scientists, ML engineers, and deep learning engineers build better models faster

Written by Kaan Boke Ph.D.

Data & ML Engineer | Learn, Share and Grow Together

Responses (2)