Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

Chunkr’s document processing API converts PDFs, PowerPoint presentations, Word documents, and images into structured, RAG-ready chunks with layout analysis, OCR, and semantic processing.

Quick Start

1

Upload a Document

Send a POST request to /api/v1/task/parse with your document:
curl -X POST https://api.chunkr.ai/api/v1/task/parse \
  -H "Authorization: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/document.pdf",
    "ocr_strategy": "All",
    "segmentation_strategy": "LayoutAnalysis"
  }'
The API returns a task object with a task_id for polling.
2

Poll for Completion

Use the task ID to check processing status:
curl https://api.chunkr.ai/api/v1/task/{task_id} \
  -H "Authorization: YOUR_API_KEY"
3

Retrieve Results

Once status is Succeeded, access the processed output:
Python
# Access the processed chunks
for chunk in task["output"]["chunks"]:
    print(f"Chunk ID: {chunk['chunk_id']}")
    print(f"Chunk Length: {chunk['chunk_length']} tokens")
    
    # Access segments within the chunk
    for segment in chunk["segments"]:
        print(f"Type: {segment['segment_type']}")
        print(f"Content: {segment['content']}")
        print(f"Text: {segment['text']}")

Configuration Options

Segmentation Strategy

Controls how the document is analyzed and segmented.
Default strategy - Analyzes document layout and detects different element types:
{
  "segmentation_strategy": "LayoutAnalysis"
}
Detects:
  • Title, SectionHeader, Text, ListItem
  • Table, Picture, Caption
  • Formula, Footnote
  • PageHeader, PageFooter
Best for: Most documents requiring accurate structure detection

OCR Strategy

Controls optical character recognition processing.
Default - Processes all pages with OCR:
{
  "ocr_strategy": "All"
}
Adds ~0.5 seconds per page latency

High Resolution Processing

Enables high-resolution images for better quality cropping and post-processing:
{
  "high_resolution": true
}
Adds ~7 seconds per page latency but significantly improves image quality

Advanced Examples

Complete Configuration

Python
import requests
import base64

with open("document.pdf", "rb") as f:
    file_data = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://api.chunkr.ai/api/v1/task/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json={
        "file": f"data:application/pdf;base64,{file_data}",
        "file_name": "document.pdf",
        "ocr_strategy": "All",
        "segmentation_strategy": "LayoutAnalysis",
        "high_resolution": true,
        "expires_in": 86400,  # 24 hours
        "chunk_processing": {
            "target_length": 512,
            "ignore_headers_and_footers": true,
            "tokenizer": "Word"
        }
    }
)

task = response.json()
print(f"Task ID: {task['task_id']}")
print(f"Status: {task['status']}")

Updating Task Configuration

You can update a completed task to reprocess with different settings:
Task must have status Succeeded or Failed to be updated
Python
response = requests.patch(
    f"https://api.chunkr.ai/api/v1/task/{task_id}/parse",
    headers={"Authorization": "YOUR_API_KEY"},
    json={
        "ocr_strategy": "Auto",
        "high_resolution": false
    }
)

updated_task = response.json()

Deleting Tasks

Python
response = requests.delete(
    f"https://api.chunkr.ai/api/v1/task/{task_id}",
    headers={"Authorization": "YOUR_API_KEY"}
)

Canceling Tasks

Cancel a task that hasn’t started processing:
Python
response = requests.get(
    f"https://api.chunkr.ai/api/v1/task/{task_id}/cancel",
    headers={"Authorization": "YOUR_API_KEY"}
)
Task must have status Starting to be cancelled

Error Handling

Error Handling Strategy

Control how errors are handled during processing:
Default - Stops processing on any error:
{
  "error_handling": "Fail"
}

Common Error Responses

Status CodeErrorDescription
400Bad RequestInvalid configuration or file format
404Not FoundTask not found or expired
413Payload Too LargeFile size exceeds limits
429Too Many RequestsUsage limit exceeded
500Internal Server ErrorProcessing failed

Response Structure

See core/src/routes/task.rs:20-48 for complete response schema.
{
  "task_id": "uuid",
  "status": "Succeeded",
  "created_at": "2024-01-01T00:00:00Z",
  "started_at": "2024-01-01T00:00:01Z",
  "finished_at": "2024-01-01T00:00:15Z",
  "file_name": "document.pdf",
  "page_count": 10,
  "pdf_url": "https://...",
  "output": {
    "chunks": [
      {
        "chunk_id": "uuid",
        "chunk_length": 256,
        "segments": [
          {
            "segment_id": "uuid",
            "segment_type": "Text",
            "content": "Generated content (HTML or Markdown)",
            "text": "OCR extracted text",
            "html": "HTML representation",
            "markdown": "Markdown representation",
            "bbox": {"left": 0, "top": 0, "width": 100, "height": 50},
            "page_number": 1,
            "page_width": 612,
            "page_height": 792
          }
        ],
        "embed": "Text to be embedded"
      }
    ]
  }
}

Best Practices

  • Use LayoutAnalysis for complex documents with tables, images, and varied layouts
  • Use Page strategy for simple text-only documents
  • Use Auto OCR strategy to optimize speed when documents have good text layers
  • Set high_resolution: false for documents without important images
  • Use reasonable target_length values (512-1024 tokens)
  • Configure expires_in to automatically clean up old tasks
  • Poll tasks with exponential backoff to avoid rate limits
  • Store task IDs for later retrieval
  • Delete tasks when no longer needed to free resources

Next Steps