Data-centric AI: Building Stronger Models through Better Data

The data-centric approach to AI leverages your domain expertise to create better machine learning models


Data plays a pivotal role in machine learning (ML), AI’s hottest sub-discipline. After all, the term machine learning describes a group of algorithms that build models on the basis of sets of labeled or unlabeled data.

Despite the importance of data to ML, AI experts have focussed primarily on improving models through innovative architectures — in language modeling, for example, we have seen the emergence of the Transformer, whose abilities for natural language processing rival the language skills of humans. Likewise, when pushing for more powerful and efficient models, researchers have directed most of their attention into devising techniques to make model architectures smaller and for finding the ideal hyperparameter settings.

Increasingly, that model-centric approach is now being complemented by a focus on the quality and veracity of the data itself, in what’s been coined “data-centric AI.”

What Is Data-centric AI?

Anyone who’s ever deployed a machine learning model knows that it requires much more than code. In particular, curated, often labeled data is required at every step of a successful ML project: for training the model, testing and evaluating it, and monitoring it even after deployment.

Training an AI model for your use case usually amounts to tweaking a model’s hyperparameters, as well as fine-tuning an existing pre-trained model to your data. While adjusting a model’s parameters can give you an edge over the original model, what usually influences its quality the most is the data used to fine-tune the model. The field of data-centric AI therefore focuses on techniques for creating, maintaining and continuously monitoring new and existing data.

At deepset, we often see the “model-centric bias” in action. Many teams spend hours trying to improve their model’s quality by subjecting it to longer training runs and experimenting with different hyperparameter settings, while paying little attention to the quality of their data. When those teams come to us, we always advise them to invest in improving their data rather than their model — we’ve yet to see a case where that tip has not increased the quality of a system many times over.

Data-centric AI formalizes that approach by pushing the need for systematically engineered data to the forefront of AI projects. It emphasizes the importance of treating your data with as much care as you would your code, and proposes techniques and frameworks to ensure that teams are working with high-quality data that takes their AI project to the next level.

What being data-centric adds to your ML project

An attractive and democratizing aspect of data-centric AI is that it places the control over the quality of your model firmly in your own hands. As the domain expert of your own data, you have the best insights into what makes high-quality data for your use case. In prioritizing the creation, curation and monitoring of data, data-centric AI leverages your own expertise for the creation of better models.

Data-centric AI isn’t only promising in terms of results, but also more sustainable than many model-centric practices. By investing in the improvement of your data, you can reduce the number of times needed to train your model, and therefore the budget and energy spent on computation. When fine-tuned on high-quality, carefully curated datasets, your model’s hyperparameters will require no to little adjustments.

Being Data-centric Addresses AI’s Most Pressing Problems

Transformers, model sharing, knowledge distillation — as far as models are concerned, the field of AI is in great shape. Most problems in AI today revolve around data:

  • In deep learning — the branch of machine learning with the biggest impact right now — more is more in terms of data. These models have large infrastructures that often need millions of examples to learn from. Traditional datasets might not contain enough data points for training those models
  • With large datasets, the very people handling them often don’t know exactly what they contain. Therefore, for many teams, their datasets are sort of a black box.
  • Data can go stale. Ideally, models are trained on data that reflects their real-world use case most accurately. However, it’s very common for data “in the wild” to actually change its distribution over time. The data that the model was trained on is then no longer up to date, and the model will decay in quality.
  • Creating new datasets is expensive and requires training people. Teams often find it hard to set up an efficient workflow around the annotation of data.

The data-centric approach offers remedies for each of the above issues. In the next section, we’ll highlight our favorite tips from the data-centric AI resource hub.

Make Data-centric AI Work For You

On the resource hub, you’ll find detailed articles and research papers on data labeling, data in deployment, and other topics. What they all have in common is their practical approach that describes various methods for improving the state of your data. With these tips, you can leverage your domain expertise to build better datasets — and, in turn, train more accurate models.

Document your datasets

Code that is not properly documented is much harder to use — and the same is true for datasets. How was your dataset created? Who were the annotators? What were the annotation guidelines and conditions? The answers to these questions are impossible to extrapolate later from the dataset alone, which makes it all the more important to document them in an accompanying datasheet.

Explore your datasets before use

Data science is the science (or art) of making sense of big datasets. It’s not feasible to read through an Excel sheet with thousands or millions of rows, but using simple statistical techniques, you can generate an overview that lets you get a feel for your dataset — and combat the impression that you’re working with a “black box” of data.

How many empty rows are there? How many duplicates? Is there any recurring “boilerplate” text that could skew your data? These and many other questions can be answered by subjecting your dataset to a data analysis.

Integrate checks for the timeliness and quality of your data into your system

Just because your training data is up to date today doesn’t mean that the same will be true tomorrow. Most real-world use cases evolve with time — for instance, the data that your users send to your model could change its underlying distribution, or your labels might, over time, not suffice to accurately cover all your use cases.

When the difference between your training data and the real data that your deployed model encounters during its application becomes too large, model decay ensues. To combat model decay and data going stale, you should check the quality of your data periodically and systematically. You could, for instance, devise stress tests using particularly difficult data points, such as edge cases, and see how well your system can handle them.

Invest in the annotation process and a streamlined workflow

Few people enjoy annotating data — until they see the impact that high-quality annotated data can have on their model’s quality. As this article on labeling and crowdsourcing describes, the annotation task is surprisingly complex, making it all the more important to plan your annotation workflow carefully. Here are some tips for creating top-notch data annotations that will help propel your project forward:

  • Label some data points yourself before asking anyone else to do it. You may find that your data looks very different from what you expected!
  • Write clear guidelines. If people don’t understand them, they probably need to be rewritten. Plus, annotation guidelines can become part of a dataset’s documentation later on.
  • Verify the quality of your annotations by checking the agreement between different annotators, or through automated plausibility checks.

A platform like our Annotation Tool helps teams to get into a smoother workflow. When the annotation process is easy, accessible, and manageable, your team will have more fun doing it, resulting in higher-quality datasets.

Look into data augmentation methods

Even with a tool like deepset's annotation platform, the creation of new data can be prohibitively expensive for teams with small budgets. Data augmentation describes a set of techniques that create new data points by applying some creative methods to existing data. For example, data augmentation for textual data could consist of replacing words with synonyms, or using machine translation. (To get an idea how automatically generated labels can help efficiently train a model, have a look at our research paper on pseudo-labels.)

Make the Move to Data-centric AI with Haystack

With Haystack, we’ve been supporting data-centric approaches to AI — before the term even existed! Our framework for applied NLP lets you build pipelines with modules for document search, question answering, summarization, translation, and many others — all using publicly available pre-trained language models.

In Haystack, fine-tuning models on your own data is as simple as possible — so that you can use your own data expertise to the highest advantage!

To get started with Haystack, visit our GitHub repository. If you’re looking for a managed solution, join us in deepset Cloud. And if you’d like to connect with like-minded people and get advice from our team on data-centric AI, join our Discord community.