Custom Chunking Strategies

Chunkr provides flexible chunking strategies to combine document segments into optimal chunks for retrieval-augmented generation (RAG) and embedding systems.

Understanding Chunks vs Segments

Segments are individual layout elements (paragraphs, tables, images) detected during analysis. Chunks are groups of segments combined for embedding.

Segment: A single structural element (e.g., one paragraph, one table)
Chunk: One or more segments grouped together based on your target_length

Chunk Processing Configuration

Configure chunking behavior through the chunk_processing parameter:

Python

import requests

response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json={
        "file": "https://example.com/document.pdf",
        "chunk_processing": {
            "target_length": 512,
            "ignore_headers_and_footers": true,
            "tokenizer": "Word"
        }
    }
)

Target Length

Controls the approximate size of each chunk:

512 tokens (Default)
1024 tokens
256 tokens
0 (Single Segment)

{
  "chunk_processing": {
    "target_length": 512
  }
}

Best for: Standard RAG applications and most embedding models

{
  "chunk_processing": {
    "target_length": 1024
  }
}

Best for: Models with larger context windows, more comprehensive chunks

{
  "chunk_processing": {
    "target_length": 256
  }
}

Best for: Fine-grained retrieval, smaller embedding models

{
  "chunk_processing": {
    "target_length": 0
  }
}

Each chunk contains exactly one segment. Best for: When you need segment-level granularity

Chunkr never breaks segments apart - they remain intact within chunks. The target_length is the maximum size, and Chunkr will fit as many complete segments as possible without exceeding it.

Tokenizer Selection

Choose how text length is measured:

Word (Default)
Cl100kBase
XlmRobertaBase
BertBaseUncased
Custom HuggingFace Tokenizer

{
  "chunk_processing": {
    "tokenizer": "Word"
  }
}

Splits text by word boundaries. Simple and fast.

{
  "chunk_processing": {
    "tokenizer": "Cl100kBase"
  }
}

OpenAI’s tokenizer for GPT-3.5, GPT-4, and text-embedding-ada-002.

{
  "chunk_processing": {
    "tokenizer": "xlm-roberta-base"
  }
}

For RoBERTa-based multilingual models.

{
  "chunk_processing": {
    "tokenizer": "bert-base-uncased"
  }
}

For BERT-based models.

{
  "chunk_processing": {
    "tokenizer": "Qwen/Qwen-tokenizer"
  }
}

Any valid HuggingFace tokenizer ID can be specified.

Ignore Headers and Footers

Headers and footers can break reading order across pages. It’s recommended to ignore them.

{
  "chunk_processing": {
    "ignore_headers_and_footers": true
  }
}

Default: true When enabled, page headers and footers are excluded from chunking but still available in the output if needed.

Segment Processing

Control how individual segments are processed and what content is included in chunks.

Content Format Selection

Each segment type can be configured with a preferred output format:

Python

response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json={
        "file": "https://example.com/document.pdf",
        "segment_processing": {
            "Text": {
                "format": "Markdown",
                "strategy": "Auto"
            },
            "Table": {
                "format": "Html",
                "strategy": "LLM"
            },
            "Picture": {
                "format": "Markdown",
                "strategy": "Auto"
            }
        }
    }
)

Format Options

Markdown
Html

{
  "format": "Markdown"
}

Text-based format, good for most content. Default for most segment types.

{
  "format": "Html"
}

Preserves complex structure. Default for tables.

Generation Strategy

Auto
LLM

{
  "strategy": "Auto"
}

Uses heuristic-based generation. Fast and efficient.

{
  "strategy": "LLM"
}

Uses Chunkr’s fine-tuned models for enhanced quality. See VLM Processing for details.

Embed Sources

Control which content is included in the chunk’s embed field:

Python

{
    "segment_processing": {
        "Text": {
            "format": "Markdown",
            "embed_sources": ["Content", "LLM"]
        },
        "Table": {
            "format": "Html",
            "embed_sources": ["Content"]
        }
    }
}

Available sources:

Content: The generated content (HTML or Markdown based on format)
LLM: Custom LLM-generated output (when configured)
HTML: ⚠️ Deprecated - use Content with format: Html
Markdown: ⚠️ Deprecated - use Content with format: Markdown

The order of sources in the array determines the sequence in the embed field. For example, ["Content", "LLM"] means content appears first, followed by LLM output.

Complete Example

Python

import requests
import time

config = {
    "file": "https://example.com/research-paper.pdf",
    "segmentation_strategy": "LayoutAnalysis",
    "ocr_strategy": "All",
    "chunk_processing": {
        "target_length": 512,
        "ignore_headers_and_footers": true,
        "tokenizer": "Cl100kBase"  # For OpenAI embeddings
    },
    "segment_processing": {
        "Title": {
            "format": "Markdown",
            "strategy": "Auto",
            "embed_sources": ["Content"]
        },
        "Text": {
            "format": "Markdown",
            "strategy": "Auto",
            "embed_sources": ["Content"]
        },
        "Table": {
            "format": "Html",
            "strategy": "LLM",  # Use VLM for better table processing
            "embed_sources": ["Content"]
        },
        "Picture": {
            "format": "Markdown",
            "strategy": "Auto",
            "crop_image": "All",
            "embed_sources": ["Content"]
        },
        "Formula": {
            "format": "Markdown",
            "strategy": "LLM",
            "embed_sources": ["Content"]
        }
    }
}

# Create task
response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json=config
)
task = response.json()
task_id = task["task_id"]

# Poll for completion
while True:
    response = requests.get(
        f"https://api.chunkr.ai/api/v1/task/{task_id}",
        headers={"Authorization": "YOUR_API_KEY"}
    )
    task = response.json()
    
    if task["status"] == "Succeeded":
        break
    elif task["status"] == "Failed":
        raise Exception("Task failed")
    
    time.sleep(2)

# Process chunks
for chunk in task["output"]["chunks"]:
    print(f"\nChunk {chunk['chunk_id']}")
    print(f"Length: {chunk['chunk_length']} tokens")
    print(f"Segments: {len(chunk['segments'])}")
    print(f"\nEmbed content:\n{chunk['embed'][:200]}...")

Segment Types

All available segment types you can configure:

Segment Type	Default Format	Default Strategy	Description
`Title`	Markdown	Auto	Document titles
`SectionHeader`	Markdown	Auto	Section headings
`Text`	Markdown	Auto	Body paragraphs
`ListItem`	Markdown	Auto	Bullet/numbered lists
`Table`	Html	LLM	Tables and grids
`Picture`	Markdown	LLM	Images and figures
`Caption`	Markdown	Auto	Image/table captions
`Formula`	Markdown	LLM	Mathematical formulas
`Footnote`	Markdown	Auto	Footnotes
`PageHeader`	Markdown	Auto	Page headers
`PageFooter`	Markdown	Auto	Page footers
`Page`	Markdown	LLM	Full page (when using Page strategy)

Chunking Strategies by Use Case

RAG for Q&A Systems

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Cl100kBase",
    "ignore_headers_and_footers": true
  }
}

Balanced chunks for good context and retrieval precision.

Long-Form Document Analysis

{
  "chunk_processing": {
    "target_length": 1024,
    "tokenizer": "Cl100kBase",
    "ignore_headers_and_footers": true
  }
}

Larger chunks preserve more context.

Fine-Grained Semantic Search

{
  "chunk_processing": {
    "target_length": 256,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  }
}

Smaller chunks for precise retrieval.

Segment-Level Processing

{
  "chunk_processing": {
    "target_length": 0,
    "tokenizer": "Word",
    "ignore_headers_and_footers": false
  }
}

One segment per chunk for maximum granularity.

Best Practices

Match tokenizer to your embedding model - Use Cl100kBase for OpenAI models, the corresponding tokenizer for others
Start with defaults - The default 512 tokens works well for most cases
Consider your retrieval strategy - Smaller chunks for precise retrieval, larger for context-rich answers
Use segment types strategically - Configure different formats for different content (HTML for tables, Markdown for text)
Test with your data - Optimal settings vary by document type and use case

Next Steps

Learn about VLM processing for enhanced content generation
See processing documents for core API usage
Review the migration guide for recent API changes

Getting Started

Core Concepts

Configuration

Deployment

Guides

Custom Chunking Strategies

Understanding Chunks vs Segments

Chunk Processing Configuration

Target Length

Tokenizer Selection

Ignore Headers and Footers

Segment Processing

Content Format Selection

Format Options

Generation Strategy

Embed Sources

Complete Example

Segment Types

Chunking Strategies by Use Case

Best Practices

Next Steps

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​Understanding Chunks vs Segments

​Chunk Processing Configuration

​Target Length

​Tokenizer Selection

​Ignore Headers and Footers

​Segment Processing

​Content Format Selection

​Format Options

​Generation Strategy

​Embed Sources

​Complete Example

​Segment Types

​Chunking Strategies by Use Case

​Best Practices

​Next Steps

Understanding Chunks vs Segments

Chunk Processing Configuration

Target Length

Tokenizer Selection

Ignore Headers and Footers

Segment Processing

Content Format Selection

Format Options

Generation Strategy

Embed Sources

Complete Example

Segment Types

Chunking Strategies by Use Case

Best Practices

Next Steps