Skip to main content

Chunking

Also refered to as segmentation, chunking text allows you to parallelize the processing of a piece of text or reduce it into pieces that fall under the length limit of a particular system (e.g., under the context length of a Large Language Model). The first step is to identify delimiters where the text can be chunked. For many use cases, this might be a double newline indicating the start of a new paragraph. In other instances, you may want chunks of roughly equal length; there you can instead chunk on the closest ending punctuation marker (e.g., ., ?, !).

If latency is critical for your use of Sapling, you'll definitely want to chunk longer documents into much shorter texts so that you can process them in parallel. Sapling also provides certain endpoints such as the AI detection endpoint that have a length limit though you may want to feed a piece of text with length exceeding that limit. What follows are some notes on how we would approach chunking, but contact us if you need help doing this.

Levels:

  • At the highest level, you have a corpus of documents. You may have heard of Common Crawl as one such corpus of documents across the web, but there are also corpora based on just Wikipedia, or Reddit, or internal documents belonging to a corporation.
  • Beneath the corpus level, we have documents. Documents represent a cohesive piece of text and are a familiar concept--think Microsoft Word or Google Docs. For news or magazine publications, article may be a more accurate term than document. In some cases, it may be unclear if text should be grouped into one or many documents--e.g., in the case of book chapters--but regardless, in these cases it's straightforward to segment the text in either fashion.
  • Within a document, there is a greater variety of segments that may work best. We present one that's drawn from the HTML DOM standard: a document is composed of sections, where each section contains headings and paragraphs. The paragraphs are further grouped into sentences, which consist of tokens. Tokens can be words, or subwords/morphemes, or characters, or even longer text segments when lexicons are built using information-theoretic approaches.

Hence we arrive at a structure that looks like the following (in JSON format):

{
"corpus": [
{
"title": "Test Document 1".
"body": [
{
"title": "Section 1",
"body": [
{
"heading": "Heading 1",
},
{
"paragraph": "Paragraph text"
},
{
"paragraph": "More paragraph text"
}
]
}
]
}
]
}

To help with the tasks above, Sapling offers endpoints for performing simple length-based chunking. Given an input text or HTML document, Sapling will break the text into blocks of length of most max_length. When splitting the text, the API follows the following preference stack:

page break > paragraph breaks > line breaks > tabs > punctuation > all other whitespace
tip

This endpoint is free of charge.

Chunk Text POST

Request Parameters

https://api.sapling.ai/api/v1/ingest/chunk_text

HTTP method: POST

key: String
32-character API key.

text: String
Text to chunk.

max_length: Integer
Maximum length of text segments.

step_size: Optional Integer
Size of window to look for split points. The larger the step size, the greater the variance in chunk sizes, but the higher the chance of splitting on a preferred split type (see preference stack above).

Response Parameters

JSON with the field chunks containing a list of strings will be returned, representing the segmented text.

{
"chunks": [
"First chunk",
"Second chunk"
]
}

Chunk HTML POST

Request Parameters

https://api.sapling.ai/api/v1/ingest/chunk_html

HTTP method: POST

key: String
32-character API key.

html: String
HTML to extract text from and chunk.

max_length: Integer
Maximum length of text segments.

step_size: Optional Integer
Size of window to look for split points. The larger the step size, the greater the variance in chunk sizes, but the higher the chance of splitting on a preferred split type (see preference stack above).

Response Parameters

info

This endpoint not only breaks up the HTML but also discards all HTML tags, resulting in plain text. Some HTML fields that you wish to keep may also be discarded; contact us if this is the case so we can better support your use case.

JSON with the field chunks containing a list of strings will be returned, representing the segmented text contained within the HTML.

{
"chunks": [
"First chunk",
"Second chunk"
]
}