Skip to main content

Benchmarking & Evaluation

How do you determine whether one NLP system that suggests edits or completions is better than another system?

One way is to try both out on a dozen hand-picked sentences and see which system catches the most errors or makes the most accurate predictions. However, we'll argue that this is not a good way to perform an evaluation.

There are a couple key components to evaluating an NLP system. The first is the evaluation dataset. How is the dataset gathered and annotated? How much data is there, and what are the sources? The second is the metric chosen. Given the dataset, there is usually a standard metric used to evaluate the performance of systems, so the metric is more straightforward.


For evaluation, there needs to be both sufficient quality as well as quantity of data.

Sports Analogies

Why are both quality and quantity of data so important? Let's take two analogies from sports.

🏃 If a coach was choosing runners for the Olympic 10,000m team, she would at the very least time the runners over the 10,000m distance. Imagine she instead chose to measure their 100m dash times. The 100m results (where reaction time and power are critical) may not even correlate with 10,000m performance (where endurance is critical) at any competitive level. Similarly, NLP systems should be evaluated on data that tests the right metric and is representative of the real-world, production setting.

🏀 When evaluating whether a basketball player is good or not, we would never simply evaluate them on a single 12-minute quarter. We instead need to evaluate them over a period of a dozen or so games, maybe even over half of an 82-game season. Evaluating them based on a single quarter is akin to evaluating an NLP system off of only a dozen or so examples — there is too much variance with such a small sample size.

Take grammar and spelling as an example. It's tempting to generate obvious, glaring errors such as "hy howz it goying". However, these types of errors are rarely seen in practice and favor simple dictionary lookups. High-quality data should come from errors made in real-world usage. This will capture mistakes that stem from keyboard errors, phonetic mixups, and users where English (or whatever language you're checking) is not their first language. Once you gather such data, the difference between these errors vs. intentional errors will be stark.

As another example, consider autocorrect. Autocorrect must have extremely low rates of false positives, or it becomes too annoying and users will turn it off. It's much better to have autocorrect that's accurate but doesn't make many corrections than have it be inaccurate but make many corrections.

In terms of data quantity, we suggest at least a hundred sentences containing at last a hundred errors. This will ensure the metrics we describe below are statistically significant.

Automated Metrics

Automated metrics can be computed immediately after running a system on the evaluation data. One example is F-score, which looks at the true positives, false positives, false negatives, and true negatives of a system. We would like true positives to be as high as possible while avoiding too many false positives.

There are also probabilistic metrics that are frequently used — contact us to learn more.

Usage-Based and Qualitative Metrics

Metrics based on usage and user feedback take longer to collect but are ultimately what we wish to measure.

Again taking grammar checking as an example, usage-based metrics may include number of accepted suggestions, number of ignored suggestions, and total number of suggestions made.

Further downstream, we may wish to measure language quality, time saved, and impact on the customer (e.g., by running an A/B test).