Data Pre-processing and Visualization for Machine Learning Models

Natasha Sharma
Heartbeat
Published in
12 min readJun 7, 2018

--

The objective of data science projects is to make sense of data to people who are only interested in the insights of that data. There are multiple steps a Data Scientist/Machine Learning Engineer follows to provide these desired results. Data pre-processing (Cleaning, Formatting, Scaling, and Normalization) and data visualization through different plots are two very important steps that help in building machine learning models more accurately.

Introduction

The idea of this post is to explain these terms and their roles in machine learning modeling, and to discuss their impacts on various business applications.

We’ll be using the Chocolate Bar Dataset (sounds yummy, right?). This dataset includes chocolate ratings, origins, percentage of cocoa, and the variety of beans used and where the beans were grown.

The dataset has so much information—I bet most of you must be thinking, what should we do with this data and what kind of information can be obtained? There is a lot we can do with the data, but for this particular exercise, we’ll explore the data to answer the following questions, using different visualization tools like distribution plot, box plot, KDE, and violin plot:

  1. What is the average rating for blended and pure chocolates?
  2. Which countries produce the highest rated chocolates bars?
  3. Find the distribution of cocoa percentage throughout the dataset (different data points).

Before finding the answers to above questions, we need to perform some data preprocessing steps — cleaning, formatting, etc., in order to visualize the data more clearly.

Data Preparation: Cleaning & Formatting Data

The data pipeline starts with collecting the data and ends with communicating the results. The process is not as easy as it sounds. There are multiple steps involved — one of the most important step is data pre-processing.

Data pre-processing itself has multiple steps and the number of steps depends on the type of data file, nature of the data, different value types, and more.

Meet Data Pre-processing

Wikipedia Definition,

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing. Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications (like neural networks).

What is it, then, that makes data pre-processing so important in machine learning or in any data science project?

Importance of Data Pre-processing

Let’s take a simple example: A couple goes into a hospital for a pregnancy test — both the man and woman have to go through the test. Once the pregnancy results return, they suggest that the man is pregnant. Pretty weird, right?

Now try and relate this to a machine learning problem — classification. We have 1000+ couples’ pregnancy test data, and for 60% of the data, we know who’s pregnant. For the remaining 40% we need to predict the results on the basis of previously recorded tests. Let’s say, out of these 60%, 1% suggests that the man is pregnant.

While building a machine learning model, if we haven’t done any pre-processing like correcting outliers, handling missing values, normalization and scaling of data, or feature engineering, we might end up considering those 1% of results that are false.

The machine learning model is nothing but a piece of code; an engineer or data scientist makes it smart through training with data. So if you give garbage to the model, you will get garbage in return, i.e. the trained model will provide false or wrong predictions for the people (40%) whose results are unknown.

This is just one example of incorrect data. People might end up collecting inappropriate values (e.g. negative salary of an employee), sometimes missing values. This can all result to misleading predictions/answers for the unknowns.

This is just one example of incorrect data. People might end up collecting inappropriate values (e.g. negative salary of an employee), sometimes missing values. This can lead to misleading results for the unknowns.

Join more than 14,000 of your fellow machine learners and data scientists. Subscribe to the premier newsletter for all things deep learning.

Getting Started with Data Pre-processing

Data pre-processing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. Kotsiantis et al. (2006) present a well-known algorithm for each step of data pre-processing.

Let’s load our chocolate data and explore if it needs any data pre-processing.

#Import the necessary libraries
import pandas as pd
import numpy as np
#load the chocolate data - Keep the data file in the same folder as #your python code
chocolate_data = pd.read_csv("flavors_of_cacao.csv")
#have a look at the data
chocolate_data.head()
Chocolate Data
# Let's have a look how many values are missing.
chocolate_data.isnull().sum()
Missing values by columns

It seems like we can ignore one missing value in the Bean Type column. So no imputation (inserting values) is required.

Let’s pause here and look at the column name in the above image. Specifically, we’re looking at the structure of the dataset :

#Lets have a look at the data and see identify Object/Categorical values and Continuous values
chocolate_data.dtypes
Structure of data

The column name contains \n — this will give the errors during data analysis. Let’s format the column names:

original_col = chocolate_data.columns
new_col = ['Company', 'Species', 'REF', 'ReviewDate', 'CocoaPercent','CompanyLocation', 'Rating', 'BeanType', 'Country']
chocolate_data =chocolate_data.rename(columns=dict(zip(original_col,
new_col)))
chocolate_data.head()
Updated column names

The column CocoaPercent contains % sign — this will also give further errors. So we need to format this, too.

#Remove % sign from CocoaPercent column 
chocolate_data['CocoaPercent'] = chocolate_data['CocoaPercent'].
str.replace('%','').astype(float)/100
chocolate_data.head()
Formatted data

Let’s create a new column, BlendNotBlend. This column will provide information on whether the chocolate is made with a mixture of flavors or is pure. We’ll talk about the reason behind creating this column in the next section.

chocolate_data['BlendNotBlend'] = np.where(np.logical_or(         np.logical_or(chocolate_data['Species'].str.lower().str.contains(',|(blend)|;'),chocolate_data['Country'].str.len() == 1), chocolate_data['Country'].str.lower().str.contains(','))
, 1
, 0)
chocolate_data.head()
Data with new column

We’ve cleaned and formatted the data. Now we want to see the presentation of this data using some visualization tools and answer the questions we discussed in the introduction.

Data Visualization

Data visualization is an integral part of any data science project. Understanding insights using excel spreadsheets or files becomes more difficult when the size of the dataset increases. It’s certainly not fun to scroll up/down to do an analysis. Let’s understand visualization and its importance in machine learning modeling. We’ll also try to explore the chocolate bar dataset using a few of these tools.

Visualize the data

Wikipedia definition:

Data visualization is viewed by many disciplines as a modern equivalent of visual communication. It involves the creation and study of the visual representation of data. To communicate information clearly and efficiently, data visualization uses statistical graphics, plots, information graphics and other tools. Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative message.

In data visualization, we use different graphs and plots to visualize complex data to ease the discovery of data patterns. How does this visualization help in machine learning modeling, or even before we start modeling?

Importance of Visualization

The CSV data (panda dataframes) can be really difficult to approach if you want to get some insights. It doesn’t matter if your data is formatted or not formatted correctly. According to SaS Data Visualization’s webpage,

The way the human brain processes information, using charts or graphs to visualize large amounts of complex data is easier than poring over spreadsheets or reports. Data visualization is a quick, easy way to convey concepts in a universal manner — and you can experiment with different scenarios by making slight adjustments.

Data visualization also helps identify areas that need attention, e.g outliers, which can later impact our machine learning model. It also helps us understand the factors that have more impacts on your results: for example, in house price predictions, the house price will be impacted more by the size of the house than the house style.

Visualization doesn’t just help before the modeling but even after it, too. For instance, it could help in identifying different clusters in a dataset, which is obviously very difficult to see through just simple files, without having proper visualization.

Visualization impacts modeling in many ways, but it’s especially handy in the EDA (Exploratory Data Analysis) phase, where you try to understand patterns in the data. For this particular exercise, we’ll visualize the distribution of chocolate bar data using some popular techniques.

Visualization Tools

The chocolate bar dataset has different kinds of values — Categorical and Continuous/Numeric. We’ll only be focusing on visualizing the distribution of continuous variables. Let’s jump into plotting.

1. Histogram Plot

Wikipedia definition:

A histogram is an accurate representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable).

The main question here is which data should we pick up and check the distribution? After reading the above definition one might say, “Oh! Except object or categorical variables/values, we can plot a histogram for anything.” This is a valid point, but are we certain that all continuous values tell a meaningful story?

Let’s start with the Rating column.

#Let's see the distribution of continuous variables
sb.distplot(chocolate_data['Rating'],kde = False)
plt.show()
Rating histogram

The number of different ratings given are counted and plotted. The bars are displayed next to each other, because the variable being measured is continuous and is on the x-axis. What’s the story behind this plot? We can see around 390 people provide 3.5 rating for the chocolates.

Now, the REF column,

sb.distplot(chocolate_data['REF'],kde = False)
plt.show()
REF Histogram

The REF column is the reference number of the ratings received. The higher reference number is the latest one.

The next continuous variable is CocoaPercent. A lot of people like dark chocolates (I don’t), so we want to see the distribution of the darkness included in the chocolates.

sb.distplot(chocolate_data['CocoaPercent'],kde = False)
plt.show()
Cocoa percentage distribution — histogram

2. Box Plot

Wikipedia definition:

In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles.

Box plots give an impression of the underlying distribution. But that’s what Histograms do, too. Then why do we need box plots? In histograms, when you compare many distributions, they do not overlay well and take up a lot of space to show them side-by-side.

Here we’re going to create a box plot for chocolate manufacturing facilities and the ratings given by customers.

# Look at boxplot over the countries, even Blends
fig, ax = plt.subplots(figsize=[6, 16])
sns.boxplot(
data=chocolate_data,
y='Country',
x='Rating'
)
ax.set_title('Boxplot, Rating for countries (+blends)')
Chocolate places and Given Rating

In the above plot, you can clearly see the ratings given to chocolate bars for each individual country. This visualization can help us understand the distribution of ratings throughout the dataset according to each countries and further help in finding which country has more popularity than others.

It also explains which country is more profitable to the sellers and potential regions to target. We can further calculate the average rating and sort the data before box plotting. But for this post, we aren’t going into too many details here.

3. Violin plot

Recently I came across Violin plots and yes, they does resemble the instrument. So see what it can tell us about the data.

Wikipedia definition:

A violin plot is a method of plotting numeric data. It is similar to a box plot with a rotated kernel density plot on each side.

Pretty complicated, right? In order to simplify this, let’s try and plot in steps.

Remember how earlier we created a column BlendNotBlend. Well here, we’re going to use it. We’re going to see how a Blended or Pure chocolate did by comparing the ratings received.

  1. Box plot (small one unlike the above box plot): The below plot shows that blended chocolate did better than pure chocolate. So it seems from the data that more people like chocolate with different flavors or a mixture of different flavors.
sns.boxplot(
data=chocolate_data,
x='BlendNotBlend',
y='Rating',
)
ax.set_title('Boxplot, Rating by Blend/Pure')
Boxplot for blend/pure vs rating

2. KDE (kernel density plot)- Let’s try and plot the same thing using a KDE plot.

Blended = chocolate_data.loc[chocolate_data.BlendNotBlend == 1]
NotBlended = chocolate_data.loc[chocolate_data.BlendNotBlend == 0]
ax = sns.kdeplot(Blended.Rating,
shade=True,shade_lowest=False, label = "Blend")
ax = sns.kdeplot(NotBlended.Rating,
shade=True,shade_lowest=False, label = "Pure")

Wikipedia definition:

A KDE is a non-parametric method to estimate a probability density function of a variable. A histogram can be thought of as a simplistic non-parametric density estimate. Here, a rectangle is used to represent each observation and it gets bigger the more observations are made.

So the above plot covers the area of observations/column values and gets bigger with more data points. The rationale behind this is that each value can be thought of as being representative of a greater number of observations. We can sum all of the kernels to give a smoothed distribution.

3. Violin Plot- We will now put together the box plot and KDE plot.

ax = sns.violinplot(x=”BlendNotBlend”, y=”Rating”, data=chocolate_data, hue=”BlendNotBlend”)
Violin plot

The violin plot shows a clear smooth curve i.e. the combination of box and KDE plot. With the above plot you can easily identify how “Blend” bar has a larger area covered for ratings, i.e. it got more reviews than pure bars and it also has received different types of ratings. The benefit of using this plot is there’s no need to read a lot of plot points to make sense of the data.

Summary

Throughout this post, we’ve explored how data preprocessing and data visualization can impact the complex machine learning model building phase. We learned about different data pre-processing techniques and tried out a few on the chocolate bar dataset.

In respect to this data, imagine we want to learn more about the distribution of current and future ratings/reviews so that companies can improve their production and strategy of making bars. If we don’t handle missing values or correct the incorrect/corrupted data, this will result in inaccurate decision making during modeling phase.

We also explored a few data visualization tools and discussed how visualization can impact modeling itself. Each visualization tool has its own significance in story telling, and it’s important to understand which ones can be used with particular types of data.

References

  1. Violin plot
  2. Kaggle Dataset
  3. Motivation — Blazing fast EDA
  4. GitHub repo
  5. SaS Visualization
  6. Data Pre-processing

Discuss this post on Hacker News.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--