Document Chunking and Snippetization Overview

View as Markdown

When you connect a knowledge source — a PDF, a wiki article, a SharePoint file — Moveworks breaks it into searchable pieces called snippets or chunks. This two-stage process ensures that search results are relevant, appropriately sized, and structured around your content’s natural layout.

Stage 1: Parsing Your Document

Before any chunking happens, Moveworks reads and interprets the document based on its file type.

PDFs

PDF text is extracted using one of three engines (selected automatically or per-configuration):

  • PDFMiner*(default)* — Splits the document into page ranges and extracts text in parallel for speed.
  • PyPDF — Extracts text one page at a time.
  • PDFium — A high-fidelity parser backed by Google’s PDF rendering engine. After extraction, it identifies paragraph boundaries by detecting sentence-ending punctuation (., !, ?) followed by a capitalized new sentence.

Limits that apply to all PDFs:

  • Maximum file size: 25 MB
  • Maximum pages processed: 100
  • Certain PDF generators (e.g., ArchiCAD) produce non-standard files and are skipped with an error.

HTML / Web Articles / Knowledge Base Pages

Moveworks walks the HTML structure and maps it to a document tree:

HTML ElementHow It’s Treated
Headings (H1–H6)Recognized as section titles
Paragraphs, divs, sectionsBecome structural groupings
Lists (ordered & unordered)Preserved as list structures
TablesPreserved as table structures
Code / pre-formatted blocksPreserved with exact formatting
LinksExtracted with their target URL
ImagesRepresented by their alt text
Navigation, footers, scripts, formsSkipped entirely

Confluence-specific macros (tabs, panels, etc.) are converted to standard HTML before processing.

Other File Types

  • PowerPoint (PPTX) — Each slide is treated as its own unit.
  • Word documents (DOCX) — Extracted with heading structure preserved.
  • Plain text — Processed directly without structural parsing.

Stage 2: Chunking into Snippets

Once the document is parsed into a structured representation, it is divided into snippets — the units that get indexed and returned in search results. Two chunking strategies are available.

Strategy A: Fixed-Size Chunking

Used for PDFs, plain text, and knowledge base articles.

Text is split using a cascading hierarchy of splitters, from coarse to fine. A splitter is only applied if the current chunk still exceeds the token limit after the previous level:

LevelSplits onApplied when
1 — ParagraphBlank lines / double newlinesChunk exceeds limit
2 — SentenceSentence boundariesChunk still exceeds limit
3 — LineSingle newlinesChunk still exceeds limit
4 — WordWord boundariesChunk still exceeds limit
5 — CharacterHard character cutoffLast resort only

Token limit: 200 tokens per chunk by default (configurable). Token counting uses the same tokenizer as GPT-3.5 Turbo.

Segments are then greedily packed — consecutive segments are joined together until the next one would push the chunk over the limit.

Strategy B: Structure-Aware Dynamic Chunking

Used for HTML documents when structure-aware mode is enabled.

Instead of splitting blindly by token count, this strategy uses the document’s own structure to find natural chunk boundaries, prioritized in tiers:

PriorityBoundary TypeExamples
HighestHeadings, horizontal rules, sections<h2>, <hr>, <section>
HighGeneric containers<div>
MediumParagraphs<p>
LowerTables, lists, code blocks<table>, <ul>, <pre>

Token thresholds:

  • Minimum chunk size: 8 tokens (smaller chunks are merged)
  • Target chunk size: 256 tokens
  • Hard maximum: 512 tokens (tables and lists: 1,024 tokens)

If a structural block still exceeds the hard maximum, it is recursively split further.

What Happens After Chunking

Optional StepWhat It Does
Language detectionIdentifies the language of the document for multilingual search routing
Sentence annotationEach snippet is further annotated with individual sentences for more precise highlighting

Guardrails

  • A per-request timeout is enforced throughout the pipeline. If parsing or chunking takes too long, the request fails gracefully rather than hanging.
  • Empty documents, password-protected PDFs, and oversized files all return specific error codes rather than silently producing empty results.
  • The strategy that produces the most snippets wins when multiple strategies are eligible — maximizing coverage of your document’s content.