Best Practices in Deploying NLP Models
Read about best practices of deploying NLP models and establishing an effective, iterative style of NLP development.
In the world of modern software design, there are whole careers dedicated to optimizing development processes. Why, then, do our natural language processing projects seem to so often go over budget, end up stuck in permanent development cycles, or result in unusable end products?
This drag on NLP development seems to occur among even the most professional teams. In the course of advising on dozens of high-priority NLP projects, we at deepset have found that most such projects suffer not from a lack of motivated, competent engineers, but rather an absence of mature development processes.
Although many fields of software have refined production strategies over decades of time, applied NLP implementation is still a relatively young field, and models are often developed by scientists whose process is more experimental than it is optimized. We often see clients using long, linear development processes that take months to complete and need to be repeated to refine the product. By introducing the small, constant modifications, frequent feedback, and rapid prototyping that we detail in this article, you can give your NLP project the best chance of success.
The following best practices for deploying NLP models have been adapted from a talk that is also freely available on our website. First, we will look at some common mistakes in the development of NLP systems for a production setting. We will then dive into our most recommended practices for NLP development.
Where Teams Often Go Wrong
Most problems in NLP projects boil down to over-engineered solutions and insufficient testing. This can occur in a number of ways: flooding your model with inadequately annotated data, procuring too little end-user feedback too late in the development cycle, or preparing advanced deployment tools before the prototype is useful.
Let’s look at a few different categories where problems can crop up.
Misunderstanding user requirements
For many ML engineers, getting a fresh dataset is like opening presents on your birthday. It can be quite exciting to dive into an unfamiliar archive and imagine all the clever cleanup scripts and synthetic metrics that you might use to create an impressive AI tool.
However, if your team dives into working the data before they thoroughly understand the use case, the whole project can begin on the wrong foot. The first step, before any data management is deployed, should be to understand and communicate with end users. This helps to give your team a clear view of the users’ goals for the data, but it also allows you to ensure that the available data is sufficient to solve the problem.
One-shot development cycles
The “one-shot” development cycle refers to a system where prototypes are completely developed before intensive testing. This is a brittle development strategy, where any single wrong choice can ruin months of work.
No matter how diligent and well tested your model training is, if the model doesn’t cover a new use case that has emerged since you began development, the whole training process will be wasted. Similarly, complex deployments like incremental learning work only when the initial prototype is functional enough for test users to generate new data by relying on the prototype.
If your development consists of a series of long stretches without sufficient, realistic testing, it can become expensive to retouch individual parts, especially if there are downstream dependencies. NLP projects do tend to require a lot of resources — budget, labor, and time. Therefore, it is vital to ensure that the prototype and course of development can be corrected without redoing too much work.
Insufficient feedback, testing, and evaluation
It is difficult to get reliable user feedback. That doesn’t mean it’s not important! Just as you might arduously track your code development with unit testing and code review, it is vital that your model is constantly subjected to reality checks from end users, and from a high-quality evaluation set. Just as with the bugs in your code, your model is certain to fail in a variety of unforeseen ways — but the better your ability to track and respond to problems, the faster and more successful your development cycle will be.
Measure Twice, Also Cut Twice
We have now seen how the high cost of an NLP project doesn’t work well with inadequately tested prototypes, or prototypes that don’t adapt to those tests. But how can you proactively design your development process to achieve this? By implementing efficient cycles of feedback and response, of course!
First, you need to ensure you have a relevant, high-quality, and frequently-updated view of your prototype. Although each stage of development might require different types of feedback, you should ensure that you have the best available view of your model’s direction (while planning) and its quality (as the prototypes are rolled out).
Getting enough feedback
Having established the importance of feedback, let’s look at some areas where we can employ it for maximum development quality. In order to keep your development on track and responsive to feedback, you should have visibility into your project throughout development:
- Initial planning should include a complete vision from the business team, UX researchers, and clients. Most importantly, it should involve representatives from the intended group of end users.
- As prototypes are generated, you should have test users from whom you can gather an organic view of the model.
- The better the quality of your evaluation set, the more confident you can be in the scores your model gets during training.
- Pipelines should be modular and accessible, so that different models, parameters, and datasets are easy to compare.
- Final deployment should include some ongoing monitoring to ensure the system performs as designed.
When beginning to develop an NLP model, it is important that your project team has a full awareness of the intended use of the model, the available dataset, and the expectations of the end users. Having all these in view helps organize the strategic direction of the development, and improves your team’s insight into the feedback that you will receive later.
Once the vision for the project is established, it is important to rapidly develop a testable prototype. This doesn’t have to be anything fancy. Nowadays, most models can perform well enough to demonstrate a prototype out of the box. And because you aren’t trying to solve the whole problem in one go, you don’t need to invest too much effort in the initial model.
It’s more important to design a prototype platform that allows you to gain maximum usage data. However, this doesn’t have to be explicit information like ratings or direct feedback. It is important to have subtle logs of user experience, tracking pain points and edge cases. Even queries themselves can be converted into useful reflection on the model, and further insight into its intended use. If it seems convenient, you might also ask for users to review the quality of output from the model, or even give more explicit feedback.
Rather than serving as further training data, the feedback from your users will tell you how well your model is doing. It serves to inform you qualitatively about the performance of the model, so that your team can best reflect on new directions.
Remember that your prototypes don’t need to be perfect, but they should be accessible enough to incentivize test users to participate, generating valuable feedback in the development cycle.
The training set
In order to train the model itself, you will probably require a labeled training set. The quality of the training set is very important. Good documentation of annotation procedures, an accessible annotation tool, and proper data storage all contribute towards quality annotation.
What many teams overlook is a focus on the quality of the annotators. It may be appealing to hire clickworkers to create a large, low-quality dataset, but this has many long-term detriments to your development cycle. Most significantly, low-quality labels lead to a low-quality evaluation set, which damages your single best metric for measuring model performance.
Pipelines and iteration
As you gain new insights into the route to success with your system, you will need to test all sorts of models, training parameters, user interfaces, and perhaps even annotation schemes. To enable rapid cycling through these options, and to ensure that your experimental results remain organized, you should use a pipeline tool. Tools like Hugging Face pipelines or deepset’s own Haystack will break up production of the prototype into modular parts that are easy to swap out while experimenting.
As your model and prototype platform develop through several iterations, you will hopefully be developing a genuinely useful product and gaining organic engagement from your beta testers. This virtuous cycle will now provide a sense of security for your MLOps (machine learning operations, responsible for deployment and maintenance of the product) if they decide to invest in expensive polishes like incremental training and complex CI/CD automation.
Avoid Getting Stuck with Feedback and Tooling
We have outlined two broad strategies for developing an NLP project. A one-shot approach picks a direction from the start, and generally pushes forward until the original vision is achieved. An interactive approach emphasizes constant reassessment of the user’s needs and the model’s capacity.
The iterative approach we recommend requires roughly the same amount of labor and investment as a large, one-shot development cycle. But while each successful round of tinkering increases your project's chances of success, the one-shot investment has little way of changing course to improve its odds. If either strategy leads to failure after the first cycle, it is easier to start up from a checkpoint in an interactive process, whereas a one-shot solution has the same costs that it did on every failed round.
In short: keep your steps bite-sized and your prototypes flexible, and always get user feedback!
Come Talk Pipelines and Development
Haystack is a great tool for keeping your development prototypes agile. We at deepset are proud of developing it, informed by our own experiences with iterative NLP development.