Skip to main content

Custom Models

Language Variety

Just as different regions of the world use different languages and different English-speaking regions use different varieties of English, different companies use company and industry-specific language as well. Beyond differing lexicons, different settings and industries have differing syntactic and stylistic preferences as well.

One common example is an informal (chat, texting, colloquial) vs. formal (email, memo, business) style.

InformalFormal
hey wuts upHi, how are you?
brbI'll be with you in a moment.
cya tomorrowSpeak with you tomorrow.
that photo is bombThat's a wonderful photo.
dude, i'm lostSorry, could you please clarify?

Example industries that have industry-specific styles:

  • Academic publishing
  • Healthcare
  • Legal
  • Finance
  • Speech transcripts

These industries usually have writers and scribes that specialize in the conventions and vernaculars of the industry.

Approaches

While a marketing or content manager can sit down and create a style guide that serves as a good starting point, a manually specified guide is unlikely to capture the long tail of possible scenarios and language that users will encounter and generate. This is because language is rich and complex, and although rules of thumb can be helpful, they inevitably miss many phenomena.

Instead, machine learning (ML) allows systems to learn from data the patterns and nuances of particular industries and language varieties. The more data is provided, the better ML systems perform. Sapling uses this approach.

Prerequisites

While people can acquire a sense of a specific style from perhaps a dozen or so pages, ML systems still require a lot more data. We recommend custom models are trained on at least 200K sentences worth of data, though the more data, the better (Sapling systems are usually trained with at least a few million sentences or sentence pairs). This ensures that the system has a representative sample of text that will be similar enough to future text it encounters (in ML, we refer to this as generalization / avoiding overfitting).

In addition, training large ML models in a reasonable amount of time requires that Sapling uses graphics processing units (GPUs). Each training run requires ~$2000 in compute spend, and the development of a model may require multiple runs to benchmark performance with different settings.

Evaluation Set

Prior to training a model, we first set aside a few thousand sentences in what is referred to as an evaluation or validation set.

We use the validation set to benchmark the performance of the model once it is trained. In the case where we finetune an existing model, we can use the validation set to measure performance before and after finetuning to determine the performance gains.

The validation set should meet the following guidelines:

  1. It should be similar to / reflect the production data the customer expects. It should also have gone through the same preprocessing pipeline as the production data.
  2. It should be at least several hundred (if not more) examples.
  3. It should not intersect with the training set (this is called data contamination).

Please read our evaluation guide for more details.

Input Format

For most applications, we only need to be able to take input data (whether it be in XML, JSON, HTML, or TXT format) and extract the plain text. The text should not include formatting characters or markup (unless we are specifically training a formatting model).

Because of challenges in text extraction, we usually cannot work with PDF files.

However, once the text is in one of the plain text formats described above, it is usually straightforward to convert it to a format the Sapling team can use to train or finetune a custom model.

Results

Once a model is trained, there are two steps to evaluate performance.

  1. Evaluate performance on the validation set. If the model does not exhibit significantly better performance, it is highly unlikely to perform better once deployed. The metric that we usually use to evaluate performance with the validation set is F-score.
  2. Deploy the model to a test group that works with production data. This is the real test to determine of deploying a custom model has helped. Here the metric is usually any change in the number of suggestions accepted/ignored by users or a more qualitative survey.

(Read more about evaluation in this article.)

We've seen significant gains for businesses in healthcare, publishing, and even direct-to-consumer sales. Contact us to learn more.


Frequently Asked Questions

Q: How long does it take to develop a custom model?

A: Typically a few weeks for data collection, then another 1-2 weeks for deployment. Data collection usually takes the most time.

Q: How often should models be updated?

A: We recommend reviewing performance quarterly and updating as needed.

Q: How can users provide feedback?

A: Sapling's SDK provides an accept/ignore button for each suggestion. Sapling can use these actions to iteratively improve the model.