PDFBench

PDFBench is a benchmark suite of 100 PDF form-filling tasks designed to evaluate computer-use agents. Each task requires the agent to fill out a PDF form in Chromium based on conversational instructions.

Overview

100 Tasks

10 form types with 10 variants each

Real-World Forms

Tax forms, medical forms, legal documents

Partial Credit

Granular scoring based on field accuracy

Reproducible

Deterministic verification of filled fields

Quick Start

Build the PDFBench base image

docker build -t pdfbench-base -f docker/Dockerfile.pdfbench .

Run a single task

helios tasks/pdfbench/pdfbench_eyemed_x001 --watch

Watch the agent

Open http://localhost:8080 to see the agent fill out the form.

Form Types

PDFBench includes 10 different form types, each with 10 variations:

Form Type	Description	Fields
`eyemed`	Vision enrollment forms	Personal info, plan selection
`w-9`	IRS W-9 tax forms	TIN, certification, address
`hipaa`	HIPAA authorization forms	Patient info, permissions
`invoice-template`	Invoice templates	Line items, totals, dates
`medical-consent`	Medical consent forms	Procedures, signatures
`medical-plan`	Medical plan enrollment	Coverage options, dependents
`nda`	Non-disclosure agreements	Parties, terms, dates
`rental-lease`	Rental/lease agreements	Property, terms, signatures
`prescription`	Prescription forms	Medication, dosage, patient
`pediatric-immunization`	Pediatric immunization records	Vaccines, dates, provider

Task Structure

Each PDFBench task follows this structure:

tasks/pdfbench/
├── pdfbench_eyemed_x001/
│   ├── instruction.md         # Conversational form-filling instructions
│   ├── task.toml              # Task config (gui=true)
│   ├── environment/
│   │   ├── Dockerfile         # FROM pdfbench-base + COPY files
│   │   ├── pdfs/eyemed.pdf    # The PDF form to fill
│   │   ├── solution.json      # Expected field values
│   │   └── bbox_verifier.py   # Verification script
│   └── tests/
│       └── test.sh            # Runs verification, outputs reward
└── ... (99 more tasks)

instruction.md Example

Please fill out the vision enrollment form with the following information:

- Enrollee Name: John Smith
- Date of Birth: 03/15/1985
- Social Security Number: 123-45-6789
- Address: 123 Main Street, Apt 4B, San Francisco, CA 94102
- Plan Type: Premium Vision
- Effective Date: 01/01/2025

Sign and date the form.

solution.json Example

{
  "enrollee_name": "John Smith",
  "dob": "03/15/1985",
  "ssn": "123-45-6789",
  "address": "123 Main Street, Apt 4B",
  "city": "San Francisco",
  "state": "CA",
  "zip": "94102",
  "plan_type": "Premium Vision",
  "effective_date": "01/01/2025"
}

Running the Benchmark

Full Benchmark

Run all 100 tasks:

helios batch tasks/pdfbench/ -n 4 -m claude-sonnet-4-20250514 -o results/pdfbench/

By Form Type

Run specific form types:

# Run all eyemed tasks
helios batch tasks/pdfbench/ -p "**/pdfbench_eyemed*/task.toml" -n 4

# Run all W-9 tasks
helios batch tasks/pdfbench/ -p "**/pdfbench_w-9*/task.toml" -n 4

At Scale with Daytona

For large-scale runs:

helios batch tasks/pdfbench/ -n 20 --provider daytona -o results/

Verification

PDFBench uses a specialized verification system:

Agent saves the PDF

The agent fills out and saves the PDF form

Text extraction

The verifier extracts text from form field bounding boxes

Comparison

Extracted values are compared against the solution.json

Scoring

Partial credit is awarded based on percentage of correct fields

Scoring

Score	Meaning
1.0	All fields correct
0.0	No fields correct
0.5	Half the fields correct

The reward is calculated as:

reward = correct_fields / total_fields

Results Analysis

After running the benchmark, analyze results:

import json
from pathlib import Path

results_dir = Path("results/pdfbench/")
results = json.loads((results_dir / "batch_results.json").read_text())

print(f"Total tasks: {results['total_tasks']}")
print(f"Mean reward: {results['mean_reward']:.3f}")
print(f"Passed (1.0): {results['passed']}")

# By form type
form_types = {}
for task in results['tasks']:
    form_type = task['name'].split('_')[1]
    if form_type not in form_types:
        form_types[form_type] = []
    form_types[form_type].append(task['reward'])

for form_type, rewards in form_types.items():
    avg = sum(rewards) / len(rewards)
    print(f"{form_type}: {avg:.3f}")

Comparing Models

Run PDFBench with different models:

# Claude Sonnet
helios batch tasks/pdfbench/ -n 4 -m claude-sonnet-4-20250514 -o results/claude-sonnet/

# Claude Opus
helios batch tasks/pdfbench/ -n 4 -m bedrock/global.anthropic.claude-opus-4-5-20251101-v1:0 -o results/claude-opus/

# Gemini
helios batch tasks/pdfbench/ -n 4 -m gemini/gemini-2.5-computer-use-preview-10-2025 -o results/gemini/

# OpenAI
helios batch tasks/pdfbench/ -n 4 -m openai/computer-use-preview -o results/openai/

Tips for Good Performance

Use GUI-capable models

PDFBench requires models with strong vision capabilities and computer-use tools.

Allow sufficient time

Form filling takes time. Default timeouts should be sufficient:

[agent]
timeout_sec = 300.0  # 5 minutes per task

Use adequate resources

GUI tasks need more resources:

[environment]
cpus = 2
memory_mb = 4096

Watch for common errors

Clicking wrong fields
Typos in data entry
Missing required fields
Not saving the PDF

Troubleshooting

pdfbench-base image not found

Build the base image first:

docker build -t pdfbench-base -f docker/Dockerfile.pdfbench .

PDF not opening

The pdfbench-base image includes Chromium. If PDFs don’t open, check that the task Dockerfile properly extends pdfbench-base.

Low scores on all tasks

Check that:

The agent is actually filling out fields
The PDF is being saved correctly
Verification is running (check test.sh output)

Verification errors

If verification fails:

Check that the PDF was saved
Verify the solution.json matches the PDF fields
Check bbox_verifier.py for errors

Creating New PDF Tasks

To add your own PDF form-filling tasks:

Create the task directory

mkdir -p tasks/pdfbench/pdfbench_myform_x001/{environment,tests}

Add the PDF

Place your PDF at environment/pdfs/myform.pdf

Create solution.json

Map field names to expected values:

{
  "field_name": "expected_value",
  "another_field": "another_value"
}

Write instructions

Create instruction.md with conversational instructions.

Configure task.toml

version = "1.0"

[agent]
timeout_sec = 300.0

[environment]
docker_image = "pdfbench-base"
gui = true
cpus = 2
memory_mb = 4096

Getting Started

Tasks

Execution

Infrastructure

Benchmarks

Reference

Development

Overview

100 Tasks

Real-World Forms

Partial Credit

Reproducible

Quick Start

Form Types

Task Structure

instruction.md Example

solution.json Example

Running the Benchmark

Full Benchmark

By Form Type

At Scale with Daytona

Verification

Scoring

Results Analysis

Comparing Models

Tips for Good Performance

Troubleshooting

Creating New PDF Tasks

Next Steps

Batch Execution

Daytona Cloud

Getting Started

Tasks

Execution

Infrastructure

Benchmarks

Reference

Development

​Overview

100 Tasks

Real-World Forms

Partial Credit

Reproducible

​Quick Start

​Form Types

​Task Structure

​instruction.md Example

​solution.json Example

​Running the Benchmark

​Full Benchmark

​By Form Type

​At Scale with Daytona

​Verification

​Scoring

​Results Analysis

​Comparing Models

​Tips for Good Performance

​Troubleshooting

​Creating New PDF Tasks

​Next Steps

Batch Execution

Daytona Cloud

Overview

Quick Start

Form Types

Task Structure

instruction.md Example

solution.json Example

Running the Benchmark

Full Benchmark

By Form Type

At Scale with Daytona

Verification

Scoring

Results Analysis

Comparing Models

Tips for Good Performance

Troubleshooting

Creating New PDF Tasks

Next Steps