Chunking
Also refered to as segmentation, chunking text allows you to parallelize
the processing of a piece of text or reduce it into pieces that fall under the
length limit of a particular system (e.g., under the context length of a Large Language Model).
The first step is to identify delimiters where the text can be chunked.
For many use cases, this might be a double newline indicating the start of a new paragraph.
In other instances, you may want chunks of roughly equal length; there you can
instead chunk on the closest ending punctuation marker (e.g., .
, ?
, !
).
If latency is critical for your use of Sapling, you'll definitely want to chunk longer documents into much shorter texts so that you can process them in parallel. Sapling also provides certain endpoints such as the AI detection endpoint that have a length limit though you may want to feed a piece of text with length exceeding that limit. What follows are some notes on how we would approach chunking, but contact us if you need help doing this.
Levels:
- At the highest level, you have a corpus of documents. You may have heard of Common Crawl as one such corpus of documents across the web, but there are also corpora based on just Wikipedia, or Reddit, or internal documents belonging to a corporation.
- Beneath the corpus level, we have documents. Documents represent a cohesive piece of text and are a familiar concept--think Microsoft Word or Google Docs. For news or magazine publications, article may be a more accurate term than document. In some cases, it may be unclear if text should be grouped into one or many documents--e.g., in the case of book chapters--but regardless, in these cases it's straightforward to segment the text in either fashion.
- Within a document, there is a greater variety of segments that may work best. We present one that's drawn from the HTML DOM standard: a document is composed of sections, where each section contains headings and paragraphs. The paragraphs are further grouped into sentences, which consist of tokens. Tokens can be words, or subwords/morphemes, or characters, or even longer text segments when lexicons are built using information-theoretic approaches.
Hence we arrive at a structure that looks like the following (in JSON format):
{
"corpus": [
{
"title": "Test Document 1".
"body": [
{
"title": "Section 1",
"body": [
{
"heading": "Heading 1",
},
{
"paragraph": "Paragraph text"
},
{
"paragraph": "More paragraph text"
}
]
}
]
}
]
}
To help with the tasks above, Sapling offers endpoints for performing simple length-based chunking.
Given an input text or HTML document, Sapling will break the text into blocks of length of most max_length
.
When splitting the text, the API follows the following preference stack:
page break > paragraph breaks > line breaks > tabs > punctuation > all other whitespace
This endpoint is free of charge.
Chunk Text POST
Request Parameters
https://api.sapling.ai/api/v1/ingest/chunk_text
HTTP method: POST
key: String
32-character API key.
text: String
Text to chunk.
max_length: Integer
Maximum length of text segments.
step_size: Optional Integer
Size of window to look for split points. The larger the step size, the greater
the variance in chunk sizes, but the higher the chance of splitting on
a preferred split type (see preference stack above).
Response Parameters
JSON with the field chunks
containing a list of strings will be returned, representing the segmented text.
{
"chunks": [
"First chunk",
"Second chunk"
]
}
Chunk HTML POST
Request Parameters
https://api.sapling.ai/api/v1/ingest/chunk_html
HTTP method: POST
key: String
32-character API key.
html: String
HTML to extract text from and chunk.
max_length: Integer
Maximum length of text segments.
step_size: Optional Integer
Size of window to look for split points. The larger the step size, the greater
the variance in chunk sizes, but the higher the chance of splitting on
a preferred split type (see preference stack above).
Response Parameters
This endpoint not only breaks up the HTML but also discards all HTML tags, resulting in plain text. Some HTML fields that you wish to keep may also be discarded; contact us if this is the case so we can better support your use case.
JSON with the field chunks
containing a list of strings will be returned, representing the segmented text contained within the HTML.
{
"chunks": [
"First chunk",
"Second chunk"
]
}