Document Chunking and Snippetization Overview

When you connect a knowledge source — a PDF, a wiki article, a SharePoint file — Moveworks breaks it into searchable pieces called snippets or chunks. This two-stage process ensures that search results are relevant, appropriately sized, and structured around your content’s natural layout.

Stage 1: Parsing Your Document

Before any chunking happens, Moveworks reads and interprets the document based on its file type.

PDFs

PDF text is extracted using one of three engines (selected automatically or per-configuration):

PDFMiner*(default)* — Splits the document into page ranges and extracts text in parallel for speed.
PyPDF — Extracts text one page at a time.
PDFium — A high-fidelity parser backed by Google’s PDF rendering engine. After extraction, it identifies paragraph boundaries by detecting sentence-ending punctuation (., !, ?) followed by a capitalized new sentence.

Limits that apply to all PDFs:

Maximum file size: 25 MB
Maximum pages processed: 100
Certain PDF generators (e.g., ArchiCAD) produce non-standard files and are skipped with an error.

HTML / Web Articles / Knowledge Base Pages

Moveworks walks the HTML structure and maps it to a document tree:

HTML Element	How It’s Treated
Headings (H1–H6)	Recognized as section titles
Paragraphs, divs, sections	Become structural groupings
Lists (ordered & unordered)	Preserved as list structures
Tables	Preserved as table structures
Code / pre-formatted blocks	Preserved with exact formatting
Links	Extracted with their target URL
Images	Represented by their alt text
Navigation, footers, scripts, forms	Skipped entirely

Confluence-specific macros (tabs, panels, etc.) are converted to standard HTML before processing.

Other File Types

PowerPoint (PPTX) — Each slide is treated as its own unit.
Word documents (DOCX) — Extracted with heading structure preserved.
Plain text — Processed directly without structural parsing.

Stage 2: Chunking into Snippets

Once the document is parsed into a structured representation, it is divided into snippets — the units that get indexed and returned in search results. Two chunking strategies are available.

Strategy A: Fixed-Size Chunking

Used for PDFs, plain text, and knowledge base articles.

Text is split using a cascading hierarchy of splitters, from coarse to fine. A splitter is only applied if the current chunk still exceeds the token limit after the previous level:

Level	Splits on	Applied when
1 — Paragraph	Blank lines / double newlines	Chunk exceeds limit
2 — Sentence	Sentence boundaries	Chunk still exceeds limit
3 — Line	Single newlines	Chunk still exceeds limit
4 — Word	Word boundaries	Chunk still exceeds limit
5 — Character	Hard character cutoff	Last resort only

Token limit: 200 tokens per chunk by default (configurable). Token counting uses the same tokenizer as GPT-3.5 Turbo.

Segments are then greedily packed — consecutive segments are joined together until the next one would push the chunk over the limit.

Strategy B: Structure-Aware Dynamic Chunking

Used for HTML documents when structure-aware mode is enabled.

Instead of splitting blindly by token count, this strategy uses the document’s own structure to find natural chunk boundaries, prioritized in tiers:

Priority	Boundary Type	Examples
Highest	Headings, horizontal rules, sections	`<h2>`, `<hr>`, `<section>`
High	Generic containers	`<div>`
Medium	Paragraphs	`<p>`
Lower	Tables, lists, code blocks	`<table>`, `<ul>`, `<pre>`

Token thresholds:

Minimum chunk size: 8 tokens (smaller chunks are merged)
Target chunk size: 256 tokens
Hard maximum: 512 tokens (tables and lists: 1,024 tokens)

If a structural block still exceeds the hard maximum, it is recursively split further.

What Happens After Chunking

Optional Step	What It Does
Language detection	Identifies the language of the document for multilingual search routing
Sentence annotation	Each snippet is further annotated with individual sentences for more precise highlighting

Guardrails

A per-request timeout is enforced throughout the pipeline. If parsing or chunking takes too long, the request fails gracefully rather than hanging.
Empty documents, password-protected PDFs, and oversized files all return specific error codes rather than silently producing empty results.
The strategy that produces the most snippets wins when multiple strategies are eligible — maximizing coverage of your document’s content.