Preprocessing Overview
Ingestion + Extraction
By extraction, we mean the process by which you turn structured documents (such as PDFs, Word documents, and webpages) into plain or structured text that you can send to the Sapling API.
Sapling currently supports rich HTML for some of its endpoints (including the Edits endpoint), and also offers extractors for PDF and DOCX files. We may add extractors for other input types in the future. Contact us if you need help with this.
Chunking / Segmentation
Please see our separate Chunking page.
Tokenization
Tokenization is the splitting of a piece of text into pieces, called tokens. This is usually done at a smaller level than chunking. Tokens can be characters, words, or even wordpieces (subwords). For more background, please refer to this Stanford NLP article. Tokenization is usually part of an NLP system's preprocessing pipeline.
The purpose of tokenization is to split the text into pieces that are easier to work with. For example, suppose you want to find sentences in a document that exceed a certain length. This would first require you to perform sentence tokenization (or sentence segmentation). As another example, suppose you want to build a word cloud visualization. You would first need to split the text into words before counting the frequency of each word and then creating the visualization. Tokenization is usually not as simple as splitting on whitespace (in English, are contractions one or two words?).
There are many tokenization schemes to split text into sentences and words. As hinted at above, there are even methods to segment words into subwords based on information theoretic principles--this is especially useful for agglutinative languages.
Sapling currently does not offer APIs for tokenization. However, we recommend trying NLTK and Stanza for your tokenization needs. If you're an existing customer, reach out if you need help.