Sapling.ai Dev Blog | Sapling.ai Developer Documentation

ChatGPT as a Grammar Checker and Editor

November 10, 2024 · 6 min read

Ziang Xie

CEO, Sapling.ai

Introduction

A popular use case for LLMs is as a writing assistant, for example for editing text as a grammar and spelling checker. Students may need help with writing essays, while businesses use it for drafting and editing emails, marketing copy, reports, etc.

An early adopter of language models for writing assistance was Google within the Gmail product. With broader availability of LLM platforms for lower costs, major companies have integrated LLMs into their products for writing assistance, such as Apple and Microsoft.

However, many of these systems present a popup and do not display suggestions inline.

There has been a gap between the ability of LLMs to generate text and the tight integration of LLMs within text editors. We consider why this is the case, taking the current most popular frontier LLM, ChatGPT-4o mini, as an example, and explain why you should consider using a task-specific LLM for grammar checking (such as the one offered by Sapling's API).

Limitations of LLMs

While LLMs are now powerful assistants for Q&A, they often do not generate text in a specific format. For example, OpenAI must enforce that models follow a Structured Output schema as a separate model capability.

For the task of text editing, the simplest output is just the corrected version of the text (the output could also be a list of edits with the edit positions, but this is more expensive in terms of token count, and many LLMs are not great at providing exact numbers without invoking separate modules).

During pretraining, LLMs are simply trained to predict the next word in a sequence. In order to perform well at the chat / Q&A task, LLMs are then instruction fine-tuned so that they respond with a helpful answer instead of just a continuation. To consistently attain an exact output format, either a long prompt or finetuning is needed.

Take the following conversations as examples: [All conversations were with ChatGPT-4o mini on 2024-11-14 with no previous context or system prompt.]

Example 1

Human: Correct the grammar: Hi how are you doing

Assistant: The corrected sentence is: “Hi, how are you doing?”

This results in the extraneous phrase "The corrected sentence is:" at the beginning of the response.

Example 2

Human: Correct the grammar: Let me see what we can

Assistant: Certainly! A grammatically complete version of the phrase would be:

“Let me see what we can do.”

This completes the thought and makes the sentence clear and grammatically correct.

There is unnecessary text preceding and following the output sentence. The output sentence includes an autocompletion which may or may not be what was desired.

Example 3

Human: Correct the grammar: Hope all's well

Assistant: The grammatically correct version would be: “I hope all is well.”

The casual statement "Hope all's well" may very well be what the user intended, especially say in a less formal email response.

Workarounds

There are two methods to mitigate the above issues:

Provide more detailed instructions in the prompt.
Finetune the system on the editing task.

The first method is easier to get started with, but can be costly as the prompt may often be longer than the input text.

Human: You are a grammar assistant for real-time editing.
Do not complete partial sentences.
Also, do not provide unnecessary edits for casual statements.
Only output the corrected text.
Now correct the following text: How's it going?

Assistant: How's it going?

The second method requires some data collection, and also doubles the cost of each API call (as of this writing, for GPT-4o mini, this is $0.300 / 1M input tokens and $0.600 / 1M output tokens).

Deployment Considerations

Outside of the output format, here are some other considerations prior to deploying LLMs for the editing task.

Benchmarking

We highly recommend benchmarking the performance of the system on a real-world dataset prior to deployment. You can find our guidelines for this here. The intuition is that intentional errors made by developers and administrators doing testing are very different from accidental errors made by users. During testing, false negatives are the emphasis, while during production, false positives get annoying very quickly.

Latency

While LLMs have become very fast, there is still often a significant delay (time to first token) and processing time (tokens per second). Fortunately, for editing latency is not as critical of a requirement as for autocompletion. Even so, latencies of well under a second are desirable for real-time editing.

Refusals

Surprisingly, sometimes even frontier LLMs will have guardrails in place that prevent them from making recommendations. This can occur, for example, when the text contains offensive language or content.

Postprocessing

Once you have the corrected text, you'll still need to compute the diff between the old and new text to obtain the edit positions. This is simple for small edits, but for a paragraph where there may be cross-sentence edits, this may not be do-able with a plain edit distance-based method. Sapling's backend handles this for you.

Customization Options

Language is not one-size-fits-all. We previously mentioned the example of casual vs. formal English. Another example is that you may need to handle different English varieties (US/UK/AU/CA/etc.). Different businesses may also have their own style guides. Despite learning-based systems greatly outperforming rule-based systems in their ability to generalize, most editing use cases still involve some deterministic logic.

Displaying Edits

Lastly, you'll need to display the edits to the user, which may require significant time from a frontend developer. This is especially true if you wish to display the edits overlaid on top of the editable text.

Sapling's SDK allows you to attach grammar checking functionality to any text input field (textarea or contenteditable) with a few lines of code.

Conclusion

We hope this has been a helpful guide for how you may use ChatGPT as a grammar checker. In case you face some of the challenges presented above, Sapling's API offers a task-specific LLM for grammar checking that may be a good fit. Depending on usage volume, our pricing is competitive with major LLM providers.

As of this writing, on our benchmark of 5000 sentences, Sapling's API also significantly outperforms GPT-4o mini on the industry-standard F_0.5-score metric (contact us for more details).

ChatGPT's Favorite Phrases

October 18, 2023 · 8 min read

Ziang Xie

CEO, Sapling.ai

Introduction

Since the release of OpenAI's ChatGPT in November of 2022, hundreds of millions of users have used it to generate billions of tokens per day.

ChatGPT, like other large language models (LLMs), has a distinctive style, and in the initial months of its release would frequently use the phrase "as a large language model" as a precursor to its responses.

Since then, this tic appears to have been reduced, but led us to wonder: what other phrases are strongly associated with ChatGPT's style, but not with the writing of humans? Beyond these statistics, what does this tell us about its training?

Method

Our approach was to take question-answer (QA) data with responses by both humans and by ChatGPT. We then determine the phrases that frequently occur in ChatGPT responses but infrequently occur in human responses. While this can be done by manually inspecting outputs and making observations, we instead form all possible subsequences of words in the texts (up to length 5) and count the frequency of each.

Data

To perform an empirical analysis of ChatGPT's favorite phrases, we used the HC3 dataset. This dataset contains ~24K examples of human and AI-generated responses to prompts from Reddit, WikiQA, StackExchange, and other sources of question-answer pairs.

We note that this is just one snapshot of responses, and while we may expect the human styles and phrasings to hold steady (at least over the course of a few years), LLMs such as ChatGPT are evolving and changing by the week; the datasets described may need to be updated.

n-grams

After some simple preprocessing of the data, we computed the 5-grams in each split (human, AI) in the dataset. This resulted in 6,036,620 human n-grams and 3,631,482 ChatGPT n-grams: the first difference we observe is that human-generated text remains more diverse than AI-generated text.

We would expect certain n-grams to appear frequently in both splits (at the end of the, in whole or in part) but others to appear much more frequently in human-generated text and others to appear much more frequently in AI-generated text. More precisely perhaps, we would not expect LLMs to generate specific sequences of text due to the methods in which they've been trained and finetuned with human preferences.

Results

Based on the analysis above, here are the top 20 phrases most indicative of human-generated text:

Phrase	Log Probability Delta
the end of the trading	-2.3243
put a lot of strain	-2.0574
what you want to achieve	-2.0332
to give you a basic	-2.0118
for the most part its	-2.0069
a bit of a mystery	-1.9875
i am not a medical	-1.9823
is a matter of personal	-1.9750
a few of the many	-1.9646
i hope that helps let	-1.9407
it is up to each	-1.9109
the money in the account	-1.9042
if you are a nonresident	-1.8966
the ratio of the circumference	-1.8936
to do this is by	-1.8752
are some of the main	-1.8651
to the united states constitution	-1.8348
the tip of the [redacted]	-1.8265
come up with a plan	-1.7752
as a matter of law	-1.7748
the best of my ability	-1.7566
the length of the side	-1.7447
the energy that is released	-1.7409
the result of a combination	-1.7016
there is also the risk	-1.6976

Here are the top 20 phrases most likely to appear in ChatGPT-generated text and not in human-generated text:

Phrase	Log Probability Delta
<s> there are a couple	2.9191
its also important to think	2.3774
is always a good thing	2.1334
it is also important that	2.0646
it is important to know	2.0253
can vary depending on what	2.0024
i hope this helps </s>	1.9463
you may want to check	1.9204
<s> another reason is because	1.9133
i hope that helps </s>	1.8907
best course of action is	1.8803
it is generally considered impolite	1.8401
it is important to study	1.8371
to keep in mind is	1.8360
i hope this helps you	1.8343
a good idea to go	1.8296
<s> there are a lot	1.8218
by a variety of methods	1.8128
<s> there are many theories	1.8114
to keep in mind the	1.8098
one way to think of	1.8061
it can be difficult if	1.7998
a good idea to put	1.7873
be able to determine that	1.7555
is important to remember however	1.7528

Aside: <s> and </s> are special tokens indicating the start and end of a text. The number following each n-gram is the difference in log-probability between that text appearing in human vs. ChatGPT-generated text.

Observations

There are clear patterns in the phrases associated with ChatGPT:

Emphasizing important points (it is important..., is always a good thing, ...)
Expressing uncertainty (can vary depending, you may want to check, ...)
Presenting multiple options and viewpoints (its also important to..., it is important to remember..., ...) Many of these patterns are likely a result of the human examples and preferences that ChatGPT has been tuned on.

It's more difficult to immediately identify patterns in the human n-grams. Perhaps a slightly lower degree of formality (to give you a basic, for the most part its) and also topics such as medical and financial advice where LLMs can be more strict and formulaic given the gravity of the topic.

Conclusion

The phrases we've identified demonstrate patterns in ChatGPT's responses that are a reflection of the data that was used to train it. But how strong a signal are they of whether a text was AI-generated or not?

AI-generated content is proliferating, and it can be helpful to have surefire "signatures" (such as as a large language model) that a text was produced by an LLM. But this is only effective in cases where the user was sloppy and left these telltale signs in the generated text.

Many of tried to create watermarks that indicate whether an image is AI-generated. These properties can also be put in an image's metadata to indicate its provenance. However, the same is much harder to bake into easily editable text. Some researchers have proposed probabilistic watermarks based on partitioning vocabularies; however, these require the participation of all major LLM developers.

Interested in other, up-to-date approaches for flagging AI-generated text? Try Sapling's regularly updated AI detector and contact us if you're interested in using it for your business.

Introduction​

Limitations of LLMs​

Workarounds​

Deployment Considerations​

Benchmarking​

Latency​

Refusals​

Postprocessing​

Customization Options​

Displaying Edits​

Conclusion​

Introduction​

Method​

Data​

n-grams​

Results​

Observations​

Conclusion​

Introduction

Limitations of LLMs

Workarounds

Deployment Considerations

Benchmarking

Latency

Refusals

Postprocessing

Customization Options

Displaying Edits

Conclusion

Introduction

Method

Data

n-grams

Results

Observations

Conclusion