*** title: Document Chunking and Snippetization Overview slug: ai-assistant/enterprise-search/document-chunking-and-snippetization-overview ---------------------------------------------------------------------------------- When you connect a knowledge source — a PDF, a wiki article, a SharePoint file — Moveworks breaks it into searchable pieces called **snippets or chunks**. This two-stage process ensures that search results are relevant, appropriately sized, and structured around your content's natural layout. ## Stage 1: Parsing Your Document Before any chunking happens, Moveworks reads and interprets the document based on its file type. ### PDFs PDF text is extracted using one of three engines (selected automatically or per-configuration): * **PDFMiner**\*(default)\* — Splits the document into page ranges and extracts text in parallel for speed. * **PyPDF** — Extracts text one page at a time. * **PDFium** — A high-fidelity parser backed by Google's PDF rendering engine. After extraction, it identifies paragraph boundaries by detecting sentence-ending punctuation (`.`, `!`, `?`) followed by a capitalized new sentence. **Limits that apply to all PDFs:** * Maximum file size: **25 MB** * Maximum pages processed: **100** * Certain PDF generators (e.g., ArchiCAD) produce non-standard files and are skipped with an error. ### HTML / Web Articles / Knowledge Base Pages Moveworks walks the HTML structure and maps it to a document tree: | **HTML Element** | **How It's Treated** | | ----------------------------------- | ------------------------------- | | Headings (H1–H6) | Recognized as section titles | | Paragraphs, divs, sections | Become structural groupings | | Lists (ordered & unordered) | Preserved as list structures | | Tables | Preserved as table structures | | Code / pre-formatted blocks | Preserved with exact formatting | | Links | Extracted with their target URL | | Images | Represented by their alt text | | Navigation, footers, scripts, forms | **Skipped entirely** | Confluence-specific macros (tabs, panels, etc.) are converted to standard HTML before processing. ### Other File Types * **PowerPoint (PPTX)** — Each slide is treated as its own unit. * **Word documents (DOCX)** — Extracted with heading structure preserved. * **Plain text** — Processed directly without structural parsing. ## Stage 2: Chunking into Snippets Once the document is parsed into a structured representation, it is divided into snippets — the units that get indexed and returned in search results. Two chunking strategies are available. ### Strategy A: Fixed-Size Chunking *Used for PDFs, plain text, and knowledge base articles.* Text is split using a **cascading hierarchy of splitters**, from coarse to fine. A splitter is only applied if the current chunk still exceeds the token limit after the previous level: | **Level** | **Splits on** | **Applied when** | | ------------- | ----------------------------- | ------------------------- | | 1 — Paragraph | Blank lines / double newlines | Chunk exceeds limit | | 2 — Sentence | Sentence boundaries | Chunk still exceeds limit | | 3 — Line | Single newlines | Chunk still exceeds limit | | 4 — Word | Word boundaries | Chunk still exceeds limit | | 5 — Character | Hard character cutoff | Last resort only | **Token limit:** **200 tokens** per chunk by default (configurable). Token counting uses the same tokenizer as GPT-3.5 Turbo. Segments are then **greedily packed** — consecutive segments are joined together until the next one would push the chunk over the limit. ### Strategy B: Structure-Aware Dynamic Chunking *Used for HTML documents when structure-aware mode is enabled.* Instead of splitting blindly by token count, this strategy uses the **document's own structure** to find natural chunk boundaries, prioritized in tiers: | **Priority** | **Boundary Type** | **Examples** | | ------------ | ------------------------------------ | --------------------------- | | Highest | Headings, horizontal rules, sections | `

`, `
`, `
` | | High | Generic containers | `
` | | Medium | Paragraphs | `

` | | Lower | Tables, lists, code blocks | ``, `
    `, `
    `  |
    
    **Token thresholds:**
    
    * Minimum chunk size: **8 tokens** (smaller chunks are merged)
    * Target chunk size: **256 tokens**
    * Hard maximum: **512 tokens** (tables and lists: **1,024 tokens**)
    
    If a structural block still exceeds the hard maximum, it is recursively split further.
    
    ## What Happens After Chunking
    
    | **Optional Step**   | **What It Does**                                                                          |
    | ------------------- | ----------------------------------------------------------------------------------------- |
    | Language detection  | Identifies the language of the document for multilingual search routing                   |
    | Sentence annotation | Each snippet is further annotated with individual sentences for more precise highlighting |
    
    ## Guardrails
    
    * A **per-request timeout** is enforced throughout the pipeline. If parsing or chunking takes too long, the request fails gracefully rather than hanging.
    * Empty documents, password-protected PDFs, and oversized files all return specific error codes rather than silently producing empty results.
    * The strategy that produces the **most snippets wins** when multiple strategies are eligible — maximizing coverage of your document's content.