When you connect a knowledge source — a PDF, a wiki article, a SharePoint file — Moveworks breaks it into searchable pieces called snippets or chunks. This two-stage process ensures that search results are relevant, appropriately sized, and structured around your content’s natural layout.
Before any chunking happens, Moveworks reads and interprets the document based on its file type.
PDF text is extracted using one of three engines (selected automatically or per-configuration):
., !, ?) followed by a capitalized new sentence.Limits that apply to all PDFs:
Moveworks walks the HTML structure and maps it to a document tree:
Confluence-specific macros (tabs, panels, etc.) are converted to standard HTML before processing.
Once the document is parsed into a structured representation, it is divided into snippets — the units that get indexed and returned in search results. Two chunking strategies are available.
Used for PDFs, plain text, and knowledge base articles.
Text is split using a cascading hierarchy of splitters, from coarse to fine. A splitter is only applied if the current chunk still exceeds the token limit after the previous level:
Token limit: 200 tokens per chunk by default (configurable). Token counting uses the same tokenizer as GPT-3.5 Turbo.
Segments are then greedily packed — consecutive segments are joined together until the next one would push the chunk over the limit.
Used for HTML documents when structure-aware mode is enabled.
Instead of splitting blindly by token count, this strategy uses the document’s own structure to find natural chunk boundaries, prioritized in tiers:
Token thresholds:
If a structural block still exceeds the hard maximum, it is recursively split further.