Towards Data-Centric AI

“There are things known and there are things unknown, and in between are the doors of perception.” — Aldous Huxley

Ayyüce Kızrak, Ph.D.
Heartbeat

--

When we’d like to solve a problem using machine learning and deep learning methods, we cannot offer the solution “directly” (as in traditional software engineering). Instead, we proceed step-by-step and iteratively converge on the optimal solution through experimentation.

Photo by Hunter Harritt on Unsplash

It would not be wrong to say that the current and existing trend among ML engineers is using a model-centric approach to ML development. This approach has given us incredible progress in mathematical models and optimization methods, among other advancements.

And the widespread use open source tools and environments such as GitHub and Kaggle in almost every sector help accelerate these developments further.

Government incentives, private financial investment, and open source contributions by major tech firms can also bolster the solving of real-life problems with ML. In these ways, experts are motivated to put effort into developing SOTA AI models. This resulting win-win also supports model-centric AI.

However, the competitive environment created both in academia and in the private sector highlights the need for something else in addition to modeling work: large, robust datasets.

In short: Everyone wants to do the modeling work, not the data work!

Data cascades in high-stakes AI. Cascades are opaque and protracted, with multiplied, negative impacts. Cascades are triggered in the upstream (e.g., data collection) and have impacts on the downstream (e.g., model deployment). Thick red arrows represent the compounding effects after data cascades start to become visible; dotted red arrows represent abandoning or restarting of the ML data process. Indicators are mostly visible in model evaluation, as system metrics, and as malfunctioning or user feedback.*

“The model and the code for many applications are basically a solved problem, now that the models have advanced to a certain point, we got to make the data work as well.” — Andrew Ng

Although AI models are getting better day by day, they have still not been generalized enough, so working with data is a must!

What we can do with low-quality data will always be limited and sometimes misleading. Contrary to the general trend, I will try to explain in this article that the data-centric approach (rather than the model-centric approach) is quite effective in improving model performance.

AI System = Model/Algorithm + Data

First of all, it’s important to keep in mind that an AI system consists of (at a bare minimum) two components: the model/algorithm, and data.

However, it does not consist of only these two components. At the same time, verification and validation is the entire development lifecycle, covering productization, operation and maintenance, and facilitating information sharing between stakeholders. We call this the AI system lifecycle.

Examining a sample of recent publications revealed that 99% of the papers were model-centric with only 1% being data-centric. — Andrew Ng

What is a Model-Centric Approach?

Updates are made on the model itself for performance improvements. The focus is on finding the most suitable configuration by making improvements in the model architecture and training process—tuning hyperparameters, model weights, compression, and optimization, etc.

What is Data-Centric Approach?

Instead of focusing on the model itself, improvements are made to the dataset systematically to increase accuracy and other target metrics. This philosophical approach focuses on factors that affect label accuracy, precision, and quality in the dataset.

With Data Centered Artificial Intelligence (DCAI), we can make our AI systems more efficient and sustainable. In fact, as we said before, we can provide the generalization that we could not with a model-centric approach using higher-quality data practices.

The key challenge here is to democratize data engineering, increasing reusability while accelerating the creation of sustainable and consistent datasets.

Join 16,000 of your colleagues by subscribing to Deep Learning Weekly for the latest products, acquisitions, technologies, deep-dives and more.

The Importance of Data and a Systematic Approach to DCAI

If we are going to use AI to solve a problem, an approach like the one below is the general and most accepted. Within the scope of the definition of the project, data is collected, the model is trained, and updates are made on both the model and the data side, utilizing the feedback received during the productization process.

Human participation in the productization process is just as important. This is commonly called human-in-the-loop AI.

MLOps: Ensuring consistently high-quality data

Making improvements based on a model-centric error analysis does not prevent all possible problems. Possible correlations with your training set may also appear in the test set, and this will manipulate results. To avoid this, we need a more proactive, data-first approach to model robustness.

For this, we need to examine the data we have in more detail!
Let’s highlight the use case of a base AI model for predictive maintenance or defect detection as an example—one of the important problems faced in the energy, textile, or automotive industries.

It could be a computer vision problem, or it could be a situation where we process the array of data we get from other sensors like vibration, sound, pressure gauge, etc.

When model-centric and data-centric approaches are implemented separately and compared with the base result, we see that the effect of model improvement is less than data improvement as an accuracy.

The ratio of accuracy results obtained in different subjects is compared in the table below. We can see concretely the importance of using more quality data.

Improving the code vs. the data

Data cleaning and labeling/annotation optimization is a simple process and the performance gain is very obvious. In addition, operations such as hyperparameter optimization, activation function selection, and increasing the number of model layers in order to increase learning were made in the model-centered approach. When the contribution is 0 or close to 0 despite these actions, it probably takes a long time for AI experts to find ways to improve model performance.

To summarize: Be sure that the quality of data will empower your solution, rather than focusing simply on amassing an abundance of data.

So, in a data-centric approach, what are the key elements that we should pay attention to?

1. Volume of Data

How is it that humans can learn to drive a car in about 20 hours of practice with very little supervision, while fully autonomous driving still eludes our best AI systems trained with thousands of hours of data from human drivers? — Yann Lecun

As we all know, the amount of data is very important for AI systems and is essentially considered as the fuel of these systems.

AI models are mathematical machines with low bias and high variance. In order to avoid the variance problem, we train our models with more data, taking into account diversity.

But beware that blindly collecting more data will not get you anywhere. Data collection is often the most costly and time-intensive part of the ML lifecycle—when collecting data, you also need to correctly determine what kind of data and labels are needed.

2. Consistency of Data

Consistency in data labels is essential. Inconsistencies will mean you are training your model to no avail. Sometimes this can happen not in a dataset that you collect, but even in a widely used dataset with hundreds of benchmarks.

“For example; It is seen that approximately 6% of the CIFAR-100 test set has 2,916 label errors. On the other hand, it was determined that approximately 4% of the Amazon Reviews dataset (approximately 390,000 data) was incorrectly labeled. It has been revealed that the average error rate for 10 different datasets, including MNIST, is 3.4%.” I recommend you to read the details from Başak Buluz Kömeçoğlu’s article (in Turkish). (Ref. Paper: Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks-in English)

We are faced with the fact that all studies with these datasets should be reviewed. This is why we need consistently labeled datasets for better training and reliable evaluation!

In the image below, you can see how easily the inconsistency in human labeling can be reflected in your dataset.

The “human” labels are all correct — it has humans in them — but they are inconsistent with each other. For instance, do people in the image need to be detected individually or as a group? That is, the consistent arrangement of labels according to our problem definition is essential for realistic performance.

In order to avoid these problems, cross-labeling and cross-checks should be performed, and the consistency of the labels should be ensured. In addition, choosing the best labels through multi-labeling and voting is a commonly-used method to ensure consistency.

3. Quality of Data

We want our data to be diverse and random, including variations on the problem we’re trying to solve. There are some elements that need to be considered in order to say that the data is of good quality:

Image Source

False correlations: Let’s take a look at the cow images on the left.

When working on an image classification problem, the association of a non-objective attribute with the label — that is, the changing background while the cow is the same in all 3 images— adversely affects the classification result, even if the object is the same. The model tends to call the cow in the desert environment a camel, and the cow in the snowy environment a polar bear. These fake correlations are one undesirable situation.

Image Source

Lack of variation: When a non-causal attribute such as image brightness cannot vary sufficiently in the dataset, the model may be too fit for the distribution of that attribute and cannot generalize well. This is what we call overfitting.

E.g; Models trained on daytime data cannot do well in the dark and vice versa. In addition to collecting new data with more variation, data augmentation is a good strategy to avoid fake correlations and lack of variation issues.

BONUS: Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency

Abstract: Training datasets fundamentally affect the performance of AI systems. Any implicit or explicit bias that arises during training is often reflected in the behavior of the system. It can lead to questions about accuracy and loss of trust in the system.

However, information on training data is rarely communicated to stakeholders. In this study, which I have referenced, the concept of Data-Centric Explanations for AI systems that explain training data to end-users is investigated.

Through a formative study, the potential benefit of such an approach is mentioned, including information about the training data that participants found most attractive. The results show that data-centric disclosures have the potential to influence how users evaluate the reliability of a system and assist users in assessing fairness. The findings are discussed to support users’ perceptions of AI systems and to design explanations.

BONUS: Data-Centric Sessions at Leading Conferences/Workshops

The main goal of this workshop is to transform the data-centric AI community into a vibrant, interdisciplinary space that addresses practical data issues. Some of these problems are:

  • Data collection
  • Data labeling
  • Data preprocessing/augmentation
  • Data quality assessment
  • Data cost
  • Data governance

Many of these areas are newly developing, and this workshop will be held in December 2021 (NEURIPS DCAI Workshop) to further their development by bringing them together into a coherent whole.

We need a framework for excellence in data engineering that does not yet exist. In the first-to-market rush with data-focused projects, aspects of maintainability, reproducibility, reliability, validity, and fidelity of datasets are often overlooked. To reverse this line of thinking and highlight examples, case studies, and methodologies for excellence in data collection, such workshops will receive more attention in the near future.

Building an active research community oriented towards data-centric AI is an important part of identifying key issues and creating ways to measure AI progress through data quality tasks and workflows.

I would like to thank Başak Buluz Kömeçoğlu for her feedback on this blog post.

Feel free to follow me on GitHub and Twitter accounts for more content!

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--

AI Specialist @Digital Transformation Office, Presidency of the Republic of Türkiye | Academics @Bahçeşehir University | http://www.ayyucekizrak.com/