Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

Chunkr provides flexible chunking strategies to combine document segments into optimal chunks for retrieval-augmented generation (RAG) and embedding systems.

Understanding Chunks vs Segments

Segments are individual layout elements (paragraphs, tables, images) detected during analysis. Chunks are groups of segments combined for embedding.
  • Segment: A single structural element (e.g., one paragraph, one table)
  • Chunk: One or more segments grouped together based on your target_length

Chunk Processing Configuration

Configure chunking behavior through the chunk_processing parameter:
Python
import requests

response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json={
        "file": "https://example.com/document.pdf",
        "chunk_processing": {
            "target_length": 512,
            "ignore_headers_and_footers": true,
            "tokenizer": "Word"
        }
    }
)

Target Length

Controls the approximate size of each chunk:
{
  "chunk_processing": {
    "target_length": 512
  }
}
Best for: Standard RAG applications and most embedding models
Chunkr never breaks segments apart - they remain intact within chunks. The target_length is the maximum size, and Chunkr will fit as many complete segments as possible without exceeding it.

Tokenizer Selection

Choose how text length is measured:
{
  "chunk_processing": {
    "tokenizer": "Word"
  }
}
Splits text by word boundaries. Simple and fast.

Ignore Headers and Footers

Headers and footers can break reading order across pages. It’s recommended to ignore them.
{
  "chunk_processing": {
    "ignore_headers_and_footers": true
  }
}
Default: true When enabled, page headers and footers are excluded from chunking but still available in the output if needed.

Segment Processing

Control how individual segments are processed and what content is included in chunks.

Content Format Selection

Each segment type can be configured with a preferred output format:
Python
response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json={
        "file": "https://example.com/document.pdf",
        "segment_processing": {
            "Text": {
                "format": "Markdown",
                "strategy": "Auto"
            },
            "Table": {
                "format": "Html",
                "strategy": "LLM"
            },
            "Picture": {
                "format": "Markdown",
                "strategy": "Auto"
            }
        }
    }
)

Format Options

{
  "format": "Markdown"
}
Text-based format, good for most content. Default for most segment types.

Generation Strategy

{
  "strategy": "Auto"
}
Uses heuristic-based generation. Fast and efficient.

Embed Sources

Control which content is included in the chunk’s embed field:
Python
{
    "segment_processing": {
        "Text": {
            "format": "Markdown",
            "embed_sources": ["Content", "LLM"]
        },
        "Table": {
            "format": "Html",
            "embed_sources": ["Content"]
        }
    }
}
Available sources:
  • Content: The generated content (HTML or Markdown based on format)
  • LLM: Custom LLM-generated output (when configured)
  • HTML: ⚠️ Deprecated - use Content with format: Html
  • Markdown: ⚠️ Deprecated - use Content with format: Markdown
The order of sources in the array determines the sequence in the embed field. For example, ["Content", "LLM"] means content appears first, followed by LLM output.

Complete Example

Python
import requests
import time

config = {
    "file": "https://example.com/research-paper.pdf",
    "segmentation_strategy": "LayoutAnalysis",
    "ocr_strategy": "All",
    "chunk_processing": {
        "target_length": 512,
        "ignore_headers_and_footers": true,
        "tokenizer": "Cl100kBase"  # For OpenAI embeddings
    },
    "segment_processing": {
        "Title": {
            "format": "Markdown",
            "strategy": "Auto",
            "embed_sources": ["Content"]
        },
        "Text": {
            "format": "Markdown",
            "strategy": "Auto",
            "embed_sources": ["Content"]
        },
        "Table": {
            "format": "Html",
            "strategy": "LLM",  # Use VLM for better table processing
            "embed_sources": ["Content"]
        },
        "Picture": {
            "format": "Markdown",
            "strategy": "Auto",
            "crop_image": "All",
            "embed_sources": ["Content"]
        },
        "Formula": {
            "format": "Markdown",
            "strategy": "LLM",
            "embed_sources": ["Content"]
        }
    }
}

# Create task
response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json=config
)
task = response.json()
task_id = task["task_id"]

# Poll for completion
while True:
    response = requests.get(
        f"https://api.chunkr.ai/api/v1/task/{task_id}",
        headers={"Authorization": "YOUR_API_KEY"}
    )
    task = response.json()
    
    if task["status"] == "Succeeded":
        break
    elif task["status"] == "Failed":
        raise Exception("Task failed")
    
    time.sleep(2)

# Process chunks
for chunk in task["output"]["chunks"]:
    print(f"\nChunk {chunk['chunk_id']}")
    print(f"Length: {chunk['chunk_length']} tokens")
    print(f"Segments: {len(chunk['segments'])}")
    print(f"\nEmbed content:\n{chunk['embed'][:200]}...")

Segment Types

All available segment types you can configure:
Segment TypeDefault FormatDefault StrategyDescription
TitleMarkdownAutoDocument titles
SectionHeaderMarkdownAutoSection headings
TextMarkdownAutoBody paragraphs
ListItemMarkdownAutoBullet/numbered lists
TableHtmlLLMTables and grids
PictureMarkdownLLMImages and figures
CaptionMarkdownAutoImage/table captions
FormulaMarkdownLLMMathematical formulas
FootnoteMarkdownAutoFootnotes
PageHeaderMarkdownAutoPage headers
PageFooterMarkdownAutoPage footers
PageMarkdownLLMFull page (when using Page strategy)

Chunking Strategies by Use Case

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Cl100kBase",
    "ignore_headers_and_footers": true
  }
}
Balanced chunks for good context and retrieval precision.
{
  "chunk_processing": {
    "target_length": 1024,
    "tokenizer": "Cl100kBase",
    "ignore_headers_and_footers": true
  }
}
Larger chunks preserve more context.
{
  "chunk_processing": {
    "target_length": 0,
    "tokenizer": "Word",
    "ignore_headers_and_footers": false
  }
}
One segment per chunk for maximum granularity.

Best Practices

  1. Match tokenizer to your embedding model - Use Cl100kBase for OpenAI models, the corresponding tokenizer for others
  2. Start with defaults - The default 512 tokens works well for most cases
  3. Consider your retrieval strategy - Smaller chunks for precise retrieval, larger for context-rich answers
  4. Use segment types strategically - Configure different formats for different content (HTML for tables, Markdown for text)
  5. Test with your data - Optimal settings vary by document type and use case

Next Steps