Create an MLOps Pipeline With GitHub and Docker Hub in Minutes

Published in

Heartbeat

11 min readSep 30, 2022

Photo by Tom Fisk: https://www.pexels.com/photo/birds-eye-view-photo-of-freight-containers-2226458/

Can you create an MLOps pipeline with GitHub and Docker Hub in minutes? Definitely yes! In this tutorial, you will use the power of Docker with the automated MLOps process.

If you find yourself saying —

But it was working on my machine
Software is different
Operating system difference, maybe that’s is the reason
My configuration is different
Software mismatch

You are at the right place.

If I say an app or model works on my local environment, it should carry global reliability. Any one of my teammates from India, Germany, Singapore, Australia, etc., should easily get the same app or model working on their local environments.

Docker will package your app and its dependencies into a reusable Docker image. By using the MLOps process defined in this tutorial,

You will automate putting your app and its dependencies into a Docker image
Sending it to Github
Pushing it to Docker Hub to register your Docker Image
And you or your colleagues can use it reliably in different environments.

Easy-to-follow, step-by-step explanations will be your guide on this journey.

Let’s start the journey.

Introduction

In the previous two articles, we discussed end-to-end MLOps pipeline implementation using GitHub Actions and Heroku with Flask API and FastAPI.

In this article, we will discuss the end-to-end MLOps pipeline implementation using GitHub Actions and Docker Hub.

First let’s see both of the services’ definitions and then we will move on to implementations.

Github Actions

GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD. Build, test, and deploy your code right from GitHub. Make code reviews, branch management, and issue triaging work the way you want. With GitHub actions you can automate your workflow from idea to production.

Docker Hub

Docker Hub is the world’s largest library and community for container images.One can browse over 100,000 container images from software vendors, open-source projects, and the community.

CI (Continuous Integration)

For this CI part, I used the heart-failure dataset. One of the main reasons for heart failure is cardiovascular disease. The dataset contains different features which may increase the risk of cardiovascular disease. I developed the machine learning model by using these features.

Doing that model would help develop early disease detection and plan and manage the risk factors.

You can find a detailed discussion on EDA and ML models here

Based on the final Catboost model, a basic Flask App has been developed.

from flask import Flask, render_template, request
import pickle
import pandas as pd


app = Flask(__name__)
model = pickle.load(open('catboost_model-2.pkl', 'rb'))



def model_pred(features):
    test_data = pd.DataFrame([features])
    prediction = model.predict(test_data)
    return int(prediction[0])



@app.route('/', methods=['GET'])
def Home():
    return render_template('index.html')



@app.route("/predict", methods=['POST'])
def predict():
    if request.method == 'POST':
        Age = int(request.form['Age'])
        RestingBP = int(request.form['RestingBP'])
        Cholesterol = int(request.form['Cholesterol'])
        Oldpeak = float(request.form['Oldpeak'])
        FastingBS = int(request.form['FastingBS'])
        MaxHR = int(request.form['MaxHR'])
        prediction = model.predict([[Age, RestingBP, Cholesterol, FastingBS, MaxHR, Oldpeak
                                     ]])


        if prediction[0] == 1:
            return render_template('index.html', prediction_text="Kindly make an appointment with the doctor!")


        else:
            return render_template('index.html', prediction_text="You are well. No worries :)")


    else:
        return render_template('index.html')



if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5000, debug=True)

In addition to the Flask app, I also added:

requirements.txt
test.py
index.html
catboost.pkl

So we can move on to the GitHub part.

I created a new GitHub repo and connected the new repo with my local folder. Only one final touch needs to be done before starting the Continuous Integration part, which is the YAML file. YAML file lets GitHub know exactly what needs to be done at each step of the workflow.

ci_pipeline: 
       runs-on: ubuntu-latest  
       
       steps:
        - uses: actions/checkout@v1
          with:
            fetch-depth: 0


        - name: Set up Python 3.9
          uses: actions/setup-python@v1
          with:
            python-version: 3.9


        - name: Install dependencies
          run: |
            python -m pip install --upgrade pip
            if [ -f requirements.txt ]; then pip install -r requirements.txt; fi


        - name: Format
          run: |
            black app.py


        - name: Lint
          run: |
            pylint --disable=R,C  app.py


        - name: Test
          run: |
            python -m pytest -vv test.py

Let’s push it to see Continuous Integration in action.

So far everything is OK with our CI part of the MLOps pipeline and we got our badge.

So we are ready to move on to the second part of our MLOps pipeline.

CD (Continuous Deployment)

For this part, we will use Docker Hub. We will build Docker Image and push Docker Image with the relevant tag to our Docker Hub repo.

We need to follow several steps to finalize our MLOps pipeline.

Create secret keys and store them in the GitHub repo.

Docker user name
Docker Hub password
Docker Hub repo name

Create environment variables for secret keys
Dockerfile
We need to add the Continuous Deployment part to the Workflow Yaml file
.dockerignore file
Finalize the YAML file
Push it and check it from the Docker Hub repo.

Secret Keys

We can’t put our password or other sensitive information directly into Docker File or any file. One of the solutions for this is that we can use GitHub Actions Secrets options and create secret keys for further usage.

Github environment secrets allow us to encrypt sensitive information in our organization, in our repository. Secrets are encrypted environment variables, which we will use them in the GitHub Workflow.

“GitHub uses a libsodium sealed box to help ensure that secrets are encrypted before they reach GitHub and remain encrypted until you use them in a workflow.” (https://docs.github.com/en/actions/security-guides/encrypted-secrets)

In this tutorial, we have three secrets — sensitive information which we don’t want to put in our code directly.

Instead of using the sensitive information in our code without any mask, we will use GitHub environment secrets to encrypt our sensitive information.

Let’s create secret keys in the Github.

Select a new repository secret and add all three keys.

Environment Variables

Let’s create environment variables for the repository secrets.

env:

  DOCKER_USER: ${{secrets.DOCKER_USER}}

  DOCKER_PASSWORD: ${{secrets.DOCKER_PASSWORD}}

  REPO_NAME: ${{secrets.REPO_NAME}}

Dockerfile

Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image (docs.docker.com).

Let’s write down our basic Dockerfile.

FROM python:3.9-slim


# Working Directory
WORKDIR /app


# Copy source code to working directory
COPY . app.py /app/


# Install packages from requirements.txt


RUN pip install --no-cache-dir --upgrade pip &&\
    pip install --no-cache-dir --trusted-host pypi.python.org -r requirements.txt


CMD python app.py

Let’s look at each line of the code in detail.

FROM python:3.9-slim

Dockerfile first will get the Python image:

# Working Directory
WORKDIR /app

And will create working directory named app.

# Copy source code to working directory
COPY . app.py /app/

Will copy files to app directory.

# Install packages from requirements.txt


RUN pip install --no-cache-dir --upgrade pip &&\
    pip install --no-cache-dir --trusted-host pypi.python.org -r requirements.txt

Will install required libraries.

CMD python app.py

And it will be ready to run with Python.

Sometimes simple solutions offer the best results. We made minor hardware optimizations for a huge increase in throughput. Check out the project here.

Continuous Deployment Part of the YAML File

First let’s look at the CD part of the YAML file.

cd_pipeline:
      runs-on: ubuntu-latest 
      needs: [ci_pipeline] 

      steps:
      - uses: actions/checkout@v2 
      - name: docker login
        run: | # log into docker hub account
          docker login -u $DOCKER_USER -p $DOCKER_PASSWORD  

      - name: Get current date # get the date of the build
        id: date
        run: echo "::set-output name=date::$(date +'%Y-%m-%d--%M-%S')"

      - name: Build the Docker image # push The image to the docker hub
        run: docker build . --file Dockerfile --tag $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

      - name: Docker Push
        run: docker push $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

Let’s look at each one of them in detail:

cd_pipeline:
      runs-on: ubuntu-latest 
      needs: [ci_pipeline]

It is the second part of the MLOps CI-CD pipeline. We named it cd_pipeline, it will work on Ubuntu.

It won’t start until the CI part of the pipeline is successfully done. This is important. We don’t want to have two independent works in our pipeline. they have to be connected.

steps:
      - uses: actions/checkout@v2 
      - name: docker login
        run: | # log into docker hub account
          docker login -u $DOCKER_USER -p $DOCKER_PASSWORD

The first step in the CD pipe is to enter the Docker Hub.

GitHub actions will use our environment variables to reach secret keys and with them will login to Docker Hub.

- name: Get current date # get the date of the build
        id: date
        run: echo "::set-output name=date::$(date +'%Y-%m-%d--%M-%S')"

We want to track our model with its tag. Whenever we have a new push from GitHub, we will get the date and time of the modification as a tag in our image.

- name: Build the Docker image # push The image to the docker hub
        run: docker build . --file Dockerfile --tag $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

By using Dockerfile, GitHub actions will start to build the docker image.
The name of the image will consist of the user name, repo name, and date tag.

- name: Docker Push
        run: docker push $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

After finalizing everything, GitHub will push the newly created image to the defined Docker Hub repo.

.Dockerignore File

We don’t want to include everything in our Docker container.
The whole point of the container is the run a single application.
The container only needs app files and relevant dependencies, nothing else.

With the .dockerignore file, we can exclude these files, when we copy other files into our working directory in the docker container.

In this study, I excluded test.py, readme file, and hidden GitHub files.

test.py
readme.md
.git*

Finalize the YAML file

Let’s combine our Continuous Integration and Continuous Deployment parts of the pipelines into one YAML file.

name: Github-Docker Hub MLOps pipeline - KB


env:
  DOCKER_USER: ${{secrets.DOCKER_USER}}
  DOCKER_PASSWORD: ${{secrets.DOCKER_PASSWORD}}
  REPO_NAME: ${{secrets.REPO_NAME}}



on:
  push:
    branches:
    - main
  pull_request:
    branches:
    - main



jobs:


  ci_pipeline: 
       runs-on: ubuntu-latest  
       
       steps:
        - uses: actions/checkout@v1
          with:
            fetch-depth: 0


        - name: Set up Python 3.9
          uses: actions/setup-python@v1
          with:
            python-version: 3.9


        - name: Install dependencies
          run: |
            python -m pip install --upgrade pip
            if [ -f requirements.txt ]; then pip install -r requirements.txt; fi


        - name: Format
          run: |
            black app.py


        - name: Lint
          run: |
            pylint --disable=R,C  app.py


        - name: Test
          run: |
            python -m pytest -vv test.py


  cd_pipeline:

      runs-on: ubuntu-latest 
      needs: [ci_pipeline]
 
      steps:

      - uses: actions/checkout@v2 
      - name: docker login
        run: | # log into docker hub account
          docker login -u $DOCKER_USER -p $DOCKER_PASSWORD
  
      - name: Get current date # get the date of the build
        id: date
        run: echo "::set-output name=date::$(date +'%Y-%m-%d--%M-%S')"

      - name: Build the Docker image # push The image to the docker hub
        run: docker build . --file Dockerfile --tag $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

      - name: Docker Push
        run: docker push $DOCKER_USER/$REPO_NAME:${{ steps.date.outputs.date }}

We are ready to push our files and follow the GitHub workflows. The workflow is started with the first part of our MLOps CI-CD pipeline.

As you see, the second part (CD) hasn’t started yet, just waiting for the CI part to be finished.

Cool. The pipeline is working. Whenever the first part of the pipeline finishes, the second part starts to work.

The pipeline finishes with success.

Let’s check it from the Docker Hub. We should have new repo and model with a date tag.

Ok, we have a repo. Let’s see the image with the date tag.

Docker Image in the local environment

Let’s pull the image and work with it in the local environment.

Let’s see what we have in the container.

.dockerignore did its part — excluded files are not in the container.

Let’s run the image and see our application.

Conclusion

That’s all folks. We have successfully constructed an end-to-end MLOps pipeline with Github Actions and Docker Hub.

With Docker, we packaged our application and service and its dependencies into Docker Image to use/reuse it in different environments without any technical problems.

When you say that an app or model works in my local environment, now it carries global reliability. Any one of your teammates from all over the world can get the same app or model working in your local environment.

After this tutorial, you can easily share your apps/services with your colleagues without worrying about the operating system difference, software mistakes, or any other common everyday Data Scientist/ML Engineer file sharing problems.

You can easily automate your MLOps pipeline to put your app/services into a Docker image registered on the Docker Hub and you can share your Docker image with your colleagues. And they can use it without any dependency problems.

The code can be downloaded from here.

By the way, when you like the topic, you can show it by supporting 👏

Feel free to leave a comment. Thanks for your time.

All the best 🤘

If you enjoy reading my content, please consider following me. Also, you can support me and other writers by subscribing to Medium. Using my referral link will not cost you extra.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.