Create Task

Endpoint

POST /api/v1/task/parse

Authentication

This endpoint requires API key authentication via the Authorization header:

Authorization: Bearer YOUR_API_KEY

Request Body

The request body must be JSON with the following schema:

file

string

required

The file to be uploaded. Can be a URL or a base64 encoded file.

file_name

string

The name of the file to be uploaded. If not set, a name will be generated.

ocr_strategy

enum

default:"All"

Controls the Optical Character Recognition (OCR) strategy:

All: Processes all pages with OCR. (Latency penalty: ~0.5 seconds per page)
Auto: Selectively applies OCR only to pages with missing or low-quality text. When text layer is present, the bounding boxes from the text layer are used.

segmentation_strategy

enum

default:"LayoutAnalysis"

Controls the segmentation strategy:

LayoutAnalysis: Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking.
Page: Treats each page as a single segment. Faster processing, but without layout element detection and only simple chunking.

high_resolution

boolean

default:true

Whether to use high-resolution images for cropping and post-processing. (Latency penalty: ~7 seconds per page)

expires_in

integer

The number of seconds until task is deleted. Expired tasks cannot be updated, polled, or accessed via web interface.

error_handling

enum

default:"Fail"

Controls how errors are handled during processing:

Fail: Stops processing and fails the task when any error occurs
Continue: Attempts to continue processing despite non-critical errors (e.g., LLM refusals)

chunk_processing

object

Controls the settings for chunking and post-processing of each chunk.

Show properties

target_length

integer

default:512

The target number of words in each chunk. If 0, each chunk will contain a single segment.

ignore_headers_and_footers

boolean

default:true

Whether to ignore headers and footers in the chunking process. This is recommended as headers and footers break reading order across pages.

tokenizer

string | enum

default:"Word"

The tokenizer to use for the chunking process. Can be:

Enum values: Word, Cl100kBase, xlm-roberta-base, bert-base-uncased
String: Any Hugging Face tokenizer model ID (e.g., “Qwen/Qwen-tokenizer”, “facebook/bart-large”)

segment_processing

object

Controls the post-processing of each segment type. Allows generation of HTML and Markdown from Chunkr models.

Show properties

Each segment type (Title, SectionHeader, Text, ListItem, Table, Picture, Caption, Formula, Footnote, PageHeader, PageFooter, Page) can be configured with:

crop_image

enum

Controls whether to crop images to the segment’s bounding box:

Auto: Only crop when needed for post-processing
All: Always crop (for Picture segments)

format

enum

default:"Markdown"

Output format: Html or Markdown

strategy

enum

Content generation strategy:

Auto: Use heuristics (default for most segments)
LLM: Use Chunkr fine-tuned models (default for Table, Formula, Page)

llm

string

Custom prompt for LLM model processing

embed_sources

array

default:["Content"]

Array of content sources to include in the chunk’s embed field: Content, LLM

extended_context

boolean

default:false

Use the full page image as context for LLM generation

llm_processing

object

Controls the LLM used for the task.

Show properties

model_id

string

The ID of the model to use for the task. If not provided, the default model will be used.

fallback_strategy

enum | object

default:"Default"

Fallback strategy for LLM processing:

None: No fallback
Default: System default fallback model
Model(string): Specific model ID as fallback

max_completion_tokens

integer

Maximum number of tokens to generate

temperature

number

Temperature for LLM generation

Response

task_id

string

The unique identifier for the task

status

enum

The status of the task: Starting, Processing, Succeeded, Failed, or Cancelled

created_at

string

The date and time when the task was created and queued (ISO 8601 format)

started_at

string

The date and time when the task started processing (ISO 8601 format)

finished_at

string

The date and time when the task was finished (ISO 8601 format)

expires_at

string

The date and time when the task will expire (ISO 8601 format)

message

string

A message describing the task’s status or any errors that occurred

task_url

string

The presigned URL of the task

configuration

object

The task configuration including all processing settings and the input file URL

output

object

Output data (only present when task is complete)

Show properties

chunks

array

Array of processed chunks with segments

file_name

string

Original file name

page_count

integer

Number of pages in the document

pdf_url

string

Presigned URL to access the processed PDF

Status Codes

200: Task created successfully
400: Bad request (invalid file, invalid base64 data, unsupported file type)
429: Usage limit exceeded
500: Internal server error

Examples

curl -X POST https://api.chunkr.ai/api/v1/task/parse \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/document.pdf",
    "ocr_strategy": "Auto",
    "segmentation_strategy": "LayoutAnalysis",
    "high_resolution": true,
    "chunk_processing": {
      "target_length": 512,
      "ignore_headers_and_footers": true,
      "tokenizer": "Word"
    }
  }'

Notes

The returned task will typically be in a Starting or Processing state
Use the GET /task/ endpoint to poll for completion
Tasks have a status progression: Starting → Processing → Succeeded or Failed
If a task expires, it cannot be accessed, updated, or viewed via the web interface

Overview

Tasks

Models

Endpoint

Authentication

Request Body

Response

Status Codes

Examples

Notes

Overview

Tasks

Models

Documentation Index

​Endpoint

​Authentication

​Request Body

​Response

​Status Codes

​Examples

​Notes

Endpoint

Authentication

Request Body

Response

Status Codes

Examples

Notes