Document Chunking and Snippetization Overview
When you connect a knowledge source — a PDF, a wiki article, a SharePoint file — Moveworks breaks it into searchable pieces called snippets or chunks. This two-stage process ensures that search results are relevant, appropriately sized, and structured around your content’s natural layout.
Stage 1: Parsing Your Document
Before any chunking happens, Moveworks reads and interprets the document based on its file type.
PDFs
PDF text is extracted using one of three engines (selected automatically or per-configuration):
- PDFMiner*(default)* — Splits the document into page ranges and extracts text in parallel for speed.
- PyPDF — Extracts text one page at a time.
- PDFium — A high-fidelity parser backed by Google’s PDF rendering engine. After extraction, it identifies paragraph boundaries by detecting sentence-ending punctuation (
.,!,?) followed by a capitalized new sentence.
Limits that apply to all PDFs:
- Maximum file size: 25 MB
- Maximum pages processed: 100
- Certain PDF generators (e.g., ArchiCAD) produce non-standard files and are skipped with an error.
HTML / Web Articles / Knowledge Base Pages
Moveworks walks the HTML structure and maps it to a document tree:
Confluence-specific macros (tabs, panels, etc.) are converted to standard HTML before processing.
Other File Types
- PowerPoint (PPTX) — Each slide is treated as its own unit.
- Word documents (DOCX) — Extracted with heading structure preserved.
- Plain text — Processed directly without structural parsing.
Stage 2: Chunking into Snippets
Once the document is parsed into a structured representation, it is divided into snippets — the units that get indexed and returned in search results. Two chunking strategies are available.
Strategy A: Fixed-Size Chunking
Used for PDFs, plain text, and knowledge base articles.
Text is split using a cascading hierarchy of splitters, from coarse to fine. A splitter is only applied if the current chunk still exceeds the token limit after the previous level:
Token limit: 200 tokens per chunk by default (configurable). Token counting uses the same tokenizer as GPT-3.5 Turbo.
Segments are then greedily packed — consecutive segments are joined together until the next one would push the chunk over the limit.
Strategy B: Structure-Aware Dynamic Chunking
Used for HTML documents when structure-aware mode is enabled.
Instead of splitting blindly by token count, this strategy uses the document’s own structure to find natural chunk boundaries, prioritized in tiers:
Token thresholds:
- Minimum chunk size: 8 tokens (smaller chunks are merged)
- Target chunk size: 256 tokens
- Hard maximum: 512 tokens (tables and lists: 1,024 tokens)
If a structural block still exceeds the hard maximum, it is recursively split further.
What Happens After Chunking
Guardrails
- A per-request timeout is enforced throughout the pipeline. If parsing or chunking takes too long, the request fails gracefully rather than hanging.
- Empty documents, password-protected PDFs, and oversized files all return specific error codes rather than silently producing empty results.
- The strategy that produces the most snippets wins when multiple strategies are eligible — maximizing coverage of your document’s content.