Data Science Project Workflows

Published in

Heartbeat

8 min readMar 22, 2022

In 2017, LinkedIn stated that Data Science was the fastest-growing job. Since 2012, there has been a 650% growth in Data Scientists. Data Science jobs appear in the top-ranked best jobs listing and highest paying jobs so it’s safe to say the field has taken all industries by storm.

Data is the new oil: According to SeedScientific, in 2020, the amount of data in the world was estimated to be 44 zettabytes. According to Seagate, it is also estimated that, by 2025, the amount of data generated will reach 163 zettabytes; ten times the amount of data produced in 2017.

So we can understand why there is such a rise in Data Scientists; they are the people that know what to do with data. People want to adapt to the change in demand whilst keeping their jobs, so they will aim to develop their skills to fit the current market. It is a competitive market and we are seeing more and more people building interest in Data Science, Machine Learning, and more due to the continuous rise in the use of data. There are thousands of courses online, bootcamps, and Masters degrees available in the sector.

Understanding a Data Science workflow is important if you are interested in getting your foot in the door, or if you’re already in but need some help managing your projects more effectively and maintaining consistency across the team.

Data Science Workflow

Let’s start with what ‘workflow’ actually means. If you have a look at Wikipedia, their definition for ‘Workflow’ is:

“An orchestrated and repeatable pattern of business activity enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence of operations, declared as work of a person or group, an organization of staff, or one or more simple or complex mechanisms.”

I don’t know about you, but those are pretty wordy sentences. In layman’s terms, workflow is the way people get work done. It can be shown as a sequence of steps that need to be completed in order to move on to the next step, helping you go from start to finish. You can think of it as literally flowing through your work from one stage to the next, whether that includes communicating with a colleague or using a specific tool.

So using the definition of ‘workflow,’ let’s understand what a Data Science workflow is. A Data Science workflow is the stages (or steps) to complete a data science project. Using a well-defined and efficient Data Science workflow provides better organization and understanding of the data project process to the data team members.

General Stages of a Data Science Workflow:

The problem with Data Science is that as much as we wish to start with a specific hypothesis, it might not be the same theory by the end. Depending on the problem you are trying to solve using the data you have, there can be several solutions. However, it is primarily down to the data team to define a structure that is suitable and efficient for them.

That doesn’t mean that each data team has a completely different workflow. There are very common approaches to tackle different problems, regardless of the data at hand.

Defining the Problem

This step is used globally; in the workforce and throughout our day-to-day lives. Being able to define a problem is much harder than it seems as there are many factors to take into consideration. These factors can have varying effects on our problem. Questions to dive into whilst defining the problem are:

What is the problem we are trying to solve?
Why are we facing this or other current problems?
How is this problem affecting our customers, sales, finances?
Can we solve this problem?

Being able to answer these questions will give you the ability to clearly state your problem and is the first step before applying data to a Data Science project.

Data

Although there is a lot of data out there; it’s very rare to have the specific data you want and need. This leads us to attain the data ourselves; this is called Data Acquisition.

Wikipedia’s definition of Data Acquisition is described as the process of sampling signals that measure real-world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer.

There are different ways to collect data, for example:

Open Data — this is available to the public
Data Scraping
Data Augmentation

When acquiring data, you have to be careful and take into consideration the reliability of certain data. For example, is the open data source a government website that conducted demographic findings through a Census, or was the data collected via survey by a person interested in people’s characteristics?

According to this Forbes survey, 19% of your time will involve the collection of data.

Exploring the Data

At this point, we should have efficient, reliable data which is accessible to the Data Scientists. Data Scientists will explore the data; understand the data limitations, anomalies, and patterns.

At this point, the data team will refer back to the problem at hand and decide on a hypothesis based on looking at the data. You will also want to determine the type of problem being solved. Is the problem a supervised learning or unsupervised learning task? Are we trying to make a prediction? Is it simply a correlation between specific variables? Or is it more complex like a classification or regression task?

The overall aim at this stage of the workflow is to get a good grasp of the data and develop a hypothesis before moving on to the next stage.

Do you know how many engineering hours could be saved by implementing an MLOps strategy? We do. Learn more today.

Data Wrangling

Data is rarely “clean,” even if you got it from the best source. This means there is a possibility that there are mistakes or missing values in the data. Or it may be that the data is recorded in a specific unit, such as Degrees Celsius but for your task you need it to be in Fahrenheit.

Data needs to be cleaned before injecting them into a model. This stage may include:

NAs and NaNs — fix missing or wrong data
Remapping data — converting categorical data into numerical data
Images — rescaling, rotating, smoothing, and more of images
Natural Language — text may need to be case-corrected and/or punctuation and stop words
Audio — filtering or de-noising audio files

This stage of the Data Science workflow will consume most of your time as there will be a lot of manual checks to identify problems and resolve them. Referring back to the Forbes survey, 60% of your time will involve cleaning and organizing the data in the right format before modeling.

Below are a few recommended data cleaning tools to get the most out of your data and reduce the amount of time during this stage:

OpenRefine: It was formerly known as Google Refine and is a great tool that helps with cleaning data and transforming it. It’s a free and open-source data cleansing tool allowing you to wrangle big datasets in a simpler way, saving you a lot of time.
Trifacta Wrangler: Founded by the makers of Data Wrangler, this is another tool for cleaning data and transforming it. It is also free and is well known for its large focus on analyzing data which uses less time.
TIBCO Clarity: This data cleaning tool aims to help people spend less time preparing and cleaning their data and more time effectively using it. It helps users identify trends quicker allowing them to make the best decisions.

Modeling

After the data has been explored comprehensively we have a solid understanding of the problem and have developed one or more hypotheses. In the modeling stage, there is a lot of experimentation, trial, and error to help us further understand which solution to move forward with.

The experiment stage will include:

Building — This involves learning and generalizing a machine learning model using training data.
Fitting — This involves understanding and measuring the model’s ability to generalize to unseen data.
Validation — This involves evaluating the model by splitting the data into train and test.

The Forbes survey states that the modeling process; which includes data mining, building training sets, and refining algorithms will consume 16% of your time.

Presenting your outputs

This stage is spoken about the least, however, it is very important. Many Data Scientists sit there for hours on end working on Machine Learning models but forget the first stage: Defining the problem. If the outputs produced provide a clear representation of your problem, your task is complete.

Your next step is communicating your findings and outputs to various stakeholders; most of which have little or no knowledge of the concepts of Data Science. They are primarily interested in the results and how they can use these to make the right decision. Therefore, communication is an important soft skill for Data Scientists as they will have to present a project in which they may alter their choice of speech dependent on their audience.

Data Science Workflow Taught at Harvard

Harvard’s Introductory Data Science course uses Blitzstein & Pfister workflow. It was crafted by Joe Blitzstein and Hanspeter Pfister of Harvard’s CS 109. The goal of Harvard’s CS 109 this framework is to introduce students to the overall process of data science investigation, from start to finish.

The five phases are:

Ask an interesting question
Get the data
Explore the data
Model the data
Communicate and visualize the results

CRISP-DM Workflow Framework

CRISP-DM is an acronym that stands for CRoss-Industry Standard Process for Data Mining and is a well-known framework used to define the process of Data Science Workflows.

CRISP-DM workflow is made up of six iterative phases:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Particular phases can loop back to the previous stages, as shown in the diagram below. Each stage has a defined task and a set of deliverables (including documentation and reports).

Many other workflows state the need to loop back to a previous stage, however, they don’t explicitly describe when a team should undergo looping back or if it is okay to progress to the next phase.

In order to determine the best workflow for your team, you first need to assess the members of the team. Understanding your team is a good way to figure out what is required, for example, their roles in the data team, and their method of working. This will help you come up with a good set of practices that resonate with your team’s values and the overall goal of the company.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.