NLP Resources Beyond English

Our guide to natural language processing in the world’s languages

At first glance, there is no shortage of resources in natural language processing (NLP). The field boasts an active research community that is constantly building and refining models for use cases such as question answering, named entity recognition, sentiment analysis, and many more. Large datasets — a prerequisite for any Transformer-based language model — exist for plenty of different domains, and new ones are popping up at amazing speed.

But a closer look reveals that when it comes to specific languages, NLP resources are anything but evenly distributed. The majority of models and datasets that we have today are produced for English speakers. At deepset, we are committed to seeing this change. Most people communicate in languages other than English, and NLP resources should reflect the richness and diversity of the world’s linguistic landscape. In this article, we provide an overview over the current NLP landscape for languages other than English. Read on as we show you how to create your own NLP resources and share our experience with dataset and model creation.

Overview of the Multilingual NLP Landscape

We use the term “NLP resources” to describe both datasets and models. Like all models in machine learning (ML), NLP models need datasets to successfully learn a representation of a language. A general language model like BERT needs a large unlabeled corpus to learn a language representation. Conversely, models for specific tasks, like question answering or sentiment analysis, require special, labeled datasets. We’ll talk more about labeling later, but let us first look at the state of NLP resources for the world’s languages.

Datasets

Any collection of data can be a dataset. However, Transformer-based language models require very large amounts of data. For example, to train a language-specific BERT model, the entire Wikipedia in that language is used. Wikipedia is an accessible and relatively balanced general-purpose dataset, which makes it quite suitable for the task. But data scarcity issues make it harder to train language models for underrepresented languages.

When we look at the distribution of languages of Wikipedia articles, we see that many of the top languages are European, followed by Asian languages, while Egyptian Arabic is the only language from the African continent. When it comes to labeled datasets, which require much greater resources to create, the situation is similarly unbalanced, with English taking the lead in all categories.

Models

Training a large Transformer model for NLP is costly — not only in terms of data. These models also need heavy computational resources (read: several GPUs) to be able to learn. Thankfully, Transformer-based language models are often open-sourced by their developing organizations and may be reused by anyone — whether directly for inference or as a basis for fine-tuning.

The same trend that we’ve seen with regards to datasets continues when it comes to trained models. BERT models exist for all languages for which there are sufficient resources: English, German, Chinese, Japanese and many others. There are even BERT models for domain-specific applications, like ClinicalBERT for medical (English) terminology. There is as of yet no Transformer-based language model for an indigenous African language.

So far we’ve talked about monolingual models — that is, models that are specialized on one language only. On the other hand, multilingual models like Google’s M-BERT and Facebook AI’s XLM-Roberta are trained on many languages at once. The idea behind the multilingual setup is that low-resource languages will benefit from higher-resource ones through cross-lingual transfer.

Question Answering Datasets and Models

Question answering (QA) and semantic search are at the center of the Haystack framework for composable NLP. Many QA datasets follow the standard set by the Stanford Question Answering Dataset (SQuAD). While the latest version of the English-language SQuAD consists of 150,000 labelled examples, QA datasets for other languages are typically much smaller.

To create a SQuAD dataset, annotators come up with questions and annotate the respective answer passages in an article from Wikipedia. In SQuAD 2.0, it is also possible to ask a question that is not answerable by the article at hand. This allows the model to learn that some questions can not be answered by a given text passage. A general-purpose Transformer language model can be fine-tuned on SQuAD, resulting in a question answering model. For example, roberta-base-squad2 is a RoBERTa model fine-tuned on SQuAD 2.0.

SQuAD-like resources have been created for various languages, like Korean and Turkish (about 70,000 and 9,000 labelled examples, respectively). Later in this article, we’ll look at the creation of SQuAD-style datasets for German and French, as well as the models trained on them.

Where to Find NLP Resources in All Languages

When it comes to finding NLP models for the world’s languages, Hugging Face’s model hub is a good place to start. It currently has more than 22,000 models that you can filter by language and application. In addition to question answering, HF’s models cover use cases like summarization, text generation, and translation. These models work out of the box — all you need to do is plug them into, say, a Haystack pipeline, and the model is ready to use.

As an example, let’s look at the question answering category on Hugging Face (HF). It lists 736 models for various languages, among them European minority languages like Catalan, and Asian languages like Bahasa Indonesia, Hindi, and Vietnamese.

Datasets are more scattered across the Internet (as we’ve stated earlier, anything can be a dataset — even, say, the collection of emails on your hard drive). If you’re looking for ready-made datasets, good starting points are platforms like Hugging Face, Kaggle, and Google’s dataset search engine. For a comprehensive list of parallel corpora (used in machine translation), check out the OPUS project. If you’re interested in African languages specifically, have a look at the work of Masakhane, an organization for African NLP.

Wikipedia and newspaper corpora are great for large amounts of monolingual data. The structured Wikidata knowledge base, on the other hand, can be mined for the creation of parallel or named entity corpora. In certain cases, social media platforms can be real gold mines when it comes to resources in natural language processing. For instance, you could use Yelp reviews to train a sentiment classifier, or get Tweets via the Twitter API and use them to train your NLP models.

How to Create Your Own NLP Resources

If you cannot find the dataset that’s right for your use case, or find the existing ones lacking, you can create your own annotated resources. That option has some obvious advantages, like full control over data selection and annotation. On the flip side, data annotation can be tedious and costly. In this section, we want to share our own experience with dataset and model creation, and show you how you can use Haystack to easily create your own NLP resources.

Example: German

To advance the state of German language NLP, a model for the German language is required. Over the course of four months, we set about annotating a dataset for question answering in German (consisting of nearly 14,000 labelled examples) which we called GermanQuAD.

One of the perks of creating your own resources is that you make your own rules. While we mostly followed SQuAD annotation conventions, we made slight changes in places where we saw fit. For example, we paid extra attention to creating answer passages of different lengths to make sure the models could learn to flexibly adapt the lengths of their answers to the query. What’s more, we worked with an in-house team of annotators (rather than remote workers from Amazon’s Mechanical Turk platform), which allowed for close collaboration and regular check-ins.

Our annotators’ hard work resulted in two datasets (GermanQuAD for question answering and GermanDPR for document retrieval). We used them to pre-train models for QA, DPR and document reranking, which we uploaded to the Hugging Face model hub. Our experiments show that a model trained on the monolingual GermanQuAD dataset performs better than the multilingual model. We see this as a strong argument for organizations to prioritize the creation of monolingual NLP resources. To learn more about the annotation process and model performance, see our blog post on enabling German neural search and QA.

Example: French

We’ve already gone over two methods for annotating large datasets: outsourcing to paid crowd-workers (SQuAD) and training an in-house team (GermanQuAD). Our friends at Etalab took yet a different road. For their PIAF Q&A dataset for French (close to 4,000 annotated examples), they built an annotation platform to facilitate the work of crowd-sourced volunteers. Anyone who knows French and wants to donate time to create NLP resources for the language is welcome to join the platform.

Similarly to our own results for German, Etalab’s experiments have shown that monolingual French NLP models (like CamemBERT or FlxyauBERT) fine-tuned on the PIAF dataset perform better than a multilingual model that has been trained on multiple languages at once. You may also want to check our article about the Etalab’s approach to semantic search here.

Haystack Tools to Create NLP Resources

To assist you in creating your own datasets, Haystack offers a range of preprocessing tools. Use one of either the PDFToTextConverter, DocxToTextConverter, or even the ImageToTextConverter to extract texts from the given file format. You can even use the newly implemented AzureConverter to extract both texts and tables.

If, on the other hand, you still have to collect texts for your corpus, you can use the Crawler. Simply pass it a list of URLs and the desired crawler depth, and it will return documents that are ready for further processing. To further clean your documents, use the Preprocessor. It gets rid of unwanted boilerplate material and splits your texts into uniformly sized passages. Finally, we have a special clean_wiki_text function for cleaning text from Wikipedia.

Haystack Annotation Tool

Note: The annotation tool has been been deprecated. You have the option of self-hosting instead.

Annotation can be hard work, so we’ve developed the Annotation Tool to make it go as smoothly as possible. The tool, which can be run in the browser or locally, allows you to organize and distribute your annotation tasks. Especially when you’re a group of annotators that is dealing with many datapoints, it can be hard to keep track of everyone’s work schedule. In our own experience, and based on feedback from our users, the Annotation Tool helps you keep an overview of your team’s annotation process and speed up the work considerably. For more information, check out our article on labeling data with the Annotation Tool.

Finally, to further speed up the process of question annotation, we’ve implemented a node for question generation. The QuestionGenerator reads your texts and outputs possible questions. You can even combine QuestionGenerator and Reader in a pipeline that creates not only questions, but also answers. For more detail, read our article on best practices for automating your annotation workflow.

Tricks for NLP Dataset Creation in Low-Resource Settings

What if you simply don’t have the time or resources to create your own datasets? Don’t worry — other smart people have faced the same problem, and have devised crafty solutions. A simple method for dataset creation is to translate existing datasets. In fact, the first question answering model for French (before PIAF and FQuAD) was a machine translated version of the original SQuAD. Of course, this technique only works for languages with high-quality machine translation models and therefore excludes true low-resource languages.

If you have a dataset that is too small for training a model, you can use data augmentation techniques to create new datapoints. Have a look at Amit Chaudhary’s excellent visual blog post on data augmentation for NLP. You will be surprised by how creative some of the listed methods are!

Join the Haystack NLP Community

In this blog post, we covered the need for NLP resources that go beyond the English language. Join our community and help us collect more resources for more languages! We would be happy to see you in our Discord channel, where you can connect with other Haystack users as well as the deepset engineering team.

Last but not least, head over to our GitHub repository to check out the Haystack framework for composable NLP — and while you’re there, make sure to give us a star :)