Processing Documents

Chunkr’s document processing API converts PDFs, PowerPoint presentations, Word documents, and images into structured, RAG-ready chunks with layout analysis, OCR, and semantic processing.

Quick Start

Upload a Document

Send a POST request to /api/v1/task/parse with your document:

curl -X POST https://api.chunkr.ai/api/v1/task/parse \
  -H "Authorization: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/document.pdf",
    "ocr_strategy": "All",
    "segmentation_strategy": "LayoutAnalysis"
  }'

The API returns a task object with a task_id for polling.

Poll for Completion

Use the task ID to check processing status:

curl https://api.chunkr.ai/api/v1/task/{task_id} \
  -H "Authorization: YOUR_API_KEY"

Retrieve Results

Once status is Succeeded, access the processed output:

Python

# Access the processed chunks
for chunk in task["output"]["chunks"]:
    print(f"Chunk ID: {chunk['chunk_id']}")
    print(f"Chunk Length: {chunk['chunk_length']} tokens")
    
    # Access segments within the chunk
    for segment in chunk["segments"]:
        print(f"Type: {segment['segment_type']}")
        print(f"Content: {segment['content']}")
        print(f"Text: {segment['text']}")

Configuration Options

Segmentation Strategy

Controls how the document is analyzed and segmented.

LayoutAnalysis
Page

Default strategy - Analyzes document layout and detects different element types:

{
  "segmentation_strategy": "LayoutAnalysis"
}

Detects:

Title, SectionHeader, Text, ListItem
Table, Picture, Caption
Formula, Footnote
PageHeader, PageFooter

Best for: Most documents requiring accurate structure detection

Treats each page as a single segment:

{
  "segmentation_strategy": "Page"
}

Best for: Simple documents or when page-level processing is sufficient

OCR Strategy

Controls optical character recognition processing.

All
Auto

Default - Processes all pages with OCR:

{
  "ocr_strategy": "All"
}

Adds ~0.5 seconds per page latency

Selectively applies OCR only when needed:

{
  "ocr_strategy": "Auto"
}

Uses existing text layer when available, applies OCR only to pages with missing or low-quality text.

High Resolution Processing

Enables high-resolution images for better quality cropping and post-processing:

{
  "high_resolution": true
}

Adds ~7 seconds per page latency but significantly improves image quality

Advanced Examples

Complete Configuration

Python

import requests
import base64

with open("document.pdf", "rb") as f:
    file_data = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json={
        "file": f"data:application/pdf;base64,{file_data}",
        "file_name": "document.pdf",
        "ocr_strategy": "All",
        "segmentation_strategy": "LayoutAnalysis",
        "high_resolution": true,
        "expires_in": 86400,  # 24 hours
        "chunk_processing": {
            "target_length": 512,
            "ignore_headers_and_footers": true,
            "tokenizer": "Word"
        }
    }
)

task = response.json()
print(f"Task ID: {task['task_id']}")
print(f"Status: {task['status']}")

Updating Task Configuration

You can update a completed task to reprocess with different settings:

Task must have status Succeeded or Failed to be updated

Python

response = requests.patch(
    f"https://api.chunkr.ai/api/v1/task/{task_id}/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json={
        "ocr_strategy": "Auto",
        "high_resolution": false
    }
)

updated_task = response.json()

Deleting Tasks

Python

response = requests.delete(
    f"https://api.chunkr.ai/api/v1/task/{task_id}",
    headers={"Authorization": "YOUR_API_KEY"}
)

Canceling Tasks

Cancel a task that hasn’t started processing:

Python

response = requests.get(
    f"https://api.chunkr.ai/api/v1/task/{task_id}/cancel",
    headers={"Authorization": "YOUR_API_KEY"}
)

Task must have status Starting to be cancelled

Error Handling

Error Handling Strategy

Control how errors are handled during processing:

Fail
Continue

Default - Stops processing on any error:

{
  "error_handling": "Fail"
}

Attempts to continue despite non-critical errors:

{
  "error_handling": "Continue"
}

Useful for handling LLM refusals or partial processing failures.

Common Error Responses

Status Code	Error	Description
400	Bad Request	Invalid configuration or file format
404	Not Found	Task not found or expired
413	Payload Too Large	File size exceeds limits
429	Too Many Requests	Usage limit exceeded
500	Internal Server Error	Processing failed

Response Structure

See core/src/routes/task.rs:20-48 for complete response schema.

{
  "task_id": "uuid",
  "status": "Succeeded",
  "created_at": "2024-01-01T00:00:00Z",
  "started_at": "2024-01-01T00:00:01Z",
  "finished_at": "2024-01-01T00:00:15Z",
  "file_name": "document.pdf",
  "page_count": 10,
  "pdf_url": "https://...",
  "output": {
    "chunks": [
      {
        "chunk_id": "uuid",
        "chunk_length": 256,
        "segments": [
          {
            "segment_id": "uuid",
            "segment_type": "Text",
            "content": "Generated content (HTML or Markdown)",
            "text": "OCR extracted text",
            "html": "HTML representation",
            "markdown": "Markdown representation",
            "bbox": {"left": 0, "top": 0, "width": 100, "height": 50},
            "page_number": 1,
            "page_width": 612,
            "page_height": 792
          }
        ],
        "embed": "Text to be embedded"
      }
    ]
  }
}

Best Practices

Choose appropriate strategies

Use LayoutAnalysis for complex documents with tables, images, and varied layouts
Use Page strategy for simple text-only documents
Use Auto OCR strategy to optimize speed when documents have good text layers

Optimize for performance

Set high_resolution: false for documents without important images
Use reasonable target_length values (512-1024 tokens)
Configure expires_in to automatically clean up old tasks

Handle task lifecycle

Poll tasks with exponential backoff to avoid rate limits
Store task IDs for later retrieval
Delete tasks when no longer needed to free resources

Next Steps

Learn about custom chunking strategies
Configure VLM processing for enhanced content generation
Review the migration guide for API changes

Getting Started

Core Concepts

Configuration

Deployment

Guides

Processing Documents

Quick Start

Configuration Options

Segmentation Strategy

OCR Strategy

High Resolution Processing

Advanced Examples

Complete Configuration

Updating Task Configuration

Deleting Tasks

Canceling Tasks

Error Handling

Error Handling Strategy

Common Error Responses

Response Structure

Best Practices

Next Steps

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​Quick Start

​Configuration Options

​Segmentation Strategy

​OCR Strategy

​High Resolution Processing

​Advanced Examples

​Complete Configuration

​Updating Task Configuration

​Deleting Tasks

​Canceling Tasks

​Error Handling

​Error Handling Strategy

​Common Error Responses

​Response Structure

​Best Practices

​Next Steps

Quick Start

Configuration Options

Segmentation Strategy

OCR Strategy

High Resolution Processing

Advanced Examples

Complete Configuration

Updating Task Configuration

Deleting Tasks

Canceling Tasks

Error Handling

Error Handling Strategy

Common Error Responses

Response Structure

Best Practices

Next Steps