Skip to main content
PDFBench is a benchmark suite of 100 PDF form-filling tasks designed to evaluate computer-use agents. Each task requires the agent to fill out a PDF form in Chromium based on conversational instructions.

Overview

100 Tasks

10 form types with 10 variants each

Real-World Forms

Tax forms, medical forms, legal documents

Partial Credit

Granular scoring based on field accuracy

Reproducible

Deterministic verification of filled fields

Quick Start

1

Build the PDFBench base image

docker build -t pdfbench-base -f docker/Dockerfile.pdfbench .
2

Run a single task

helios tasks/pdfbench/pdfbench_eyemed_x001 --watch
3

Watch the agent

Open http://localhost:8080 to see the agent fill out the form.

Form Types

PDFBench includes 10 different form types, each with 10 variations:
Form TypeDescriptionFields
eyemedVision enrollment formsPersonal info, plan selection
w-9IRS W-9 tax formsTIN, certification, address
hipaaHIPAA authorization formsPatient info, permissions
invoice-templateInvoice templatesLine items, totals, dates
medical-consentMedical consent formsProcedures, signatures
medical-planMedical plan enrollmentCoverage options, dependents
ndaNon-disclosure agreementsParties, terms, dates
rental-leaseRental/lease agreementsProperty, terms, signatures
prescriptionPrescription formsMedication, dosage, patient
pediatric-immunizationPediatric immunization recordsVaccines, dates, provider

Task Structure

Each PDFBench task follows this structure:
tasks/pdfbench/
├── pdfbench_eyemed_x001/
│   ├── instruction.md         # Conversational form-filling instructions
│   ├── task.toml              # Task config (gui=true)
│   ├── environment/
│   │   ├── Dockerfile         # FROM pdfbench-base + COPY files
│   │   ├── pdfs/eyemed.pdf    # The PDF form to fill
│   │   ├── solution.json      # Expected field values
│   │   └── bbox_verifier.py   # Verification script
│   └── tests/
│       └── test.sh            # Runs verification, outputs reward
└── ... (99 more tasks)

instruction.md Example

Please fill out the vision enrollment form with the following information:

- Enrollee Name: John Smith
- Date of Birth: 03/15/1985
- Social Security Number: 123-45-6789
- Address: 123 Main Street, Apt 4B, San Francisco, CA 94102
- Plan Type: Premium Vision
- Effective Date: 01/01/2025

Sign and date the form.

solution.json Example

{
  "enrollee_name": "John Smith",
  "dob": "03/15/1985",
  "ssn": "123-45-6789",
  "address": "123 Main Street, Apt 4B",
  "city": "San Francisco",
  "state": "CA",
  "zip": "94102",
  "plan_type": "Premium Vision",
  "effective_date": "01/01/2025"
}

Running the Benchmark

Full Benchmark

Run all 100 tasks:
helios batch tasks/pdfbench/ -n 4 -m claude-sonnet-4-20250514 -o results/pdfbench/

By Form Type

Run specific form types:
# Run all eyemed tasks
helios batch tasks/pdfbench/ -p "**/pdfbench_eyemed*/task.toml" -n 4

# Run all W-9 tasks
helios batch tasks/pdfbench/ -p "**/pdfbench_w-9*/task.toml" -n 4

At Scale with Daytona

For large-scale runs:
helios batch tasks/pdfbench/ -n 20 --provider daytona -o results/

Verification

PDFBench uses a specialized verification system:
1

Agent saves the PDF

The agent fills out and saves the PDF form
2

Text extraction

The verifier extracts text from form field bounding boxes
3

Comparison

Extracted values are compared against the solution.json
4

Scoring

Partial credit is awarded based on percentage of correct fields

Scoring

ScoreMeaning
1.0All fields correct
0.0No fields correct
0.5Half the fields correct
The reward is calculated as:
reward = correct_fields / total_fields

Results Analysis

After running the benchmark, analyze results:
import json
from pathlib import Path

results_dir = Path("results/pdfbench/")
results = json.loads((results_dir / "batch_results.json").read_text())

print(f"Total tasks: {results['total_tasks']}")
print(f"Mean reward: {results['mean_reward']:.3f}")
print(f"Passed (1.0): {results['passed']}")

# By form type
form_types = {}
for task in results['tasks']:
    form_type = task['name'].split('_')[1]
    if form_type not in form_types:
        form_types[form_type] = []
    form_types[form_type].append(task['reward'])

for form_type, rewards in form_types.items():
    avg = sum(rewards) / len(rewards)
    print(f"{form_type}: {avg:.3f}")

Comparing Models

Run PDFBench with different models:
# Claude Sonnet
helios batch tasks/pdfbench/ -n 4 -m claude-sonnet-4-20250514 -o results/claude-sonnet/

# Claude Opus
helios batch tasks/pdfbench/ -n 4 -m bedrock/global.anthropic.claude-opus-4-5-20251101-v1:0 -o results/claude-opus/

# Gemini
helios batch tasks/pdfbench/ -n 4 -m gemini/gemini-2.5-computer-use-preview-10-2025 -o results/gemini/

# OpenAI
helios batch tasks/pdfbench/ -n 4 -m openai/computer-use-preview -o results/openai/

Tips for Good Performance

PDFBench requires models with strong vision capabilities and computer-use tools.
Form filling takes time. Default timeouts should be sufficient:
[agent]
timeout_sec = 300.0  # 5 minutes per task
GUI tasks need more resources:
[environment]
cpus = 2
memory_mb = 4096
  • Clicking wrong fields
  • Typos in data entry
  • Missing required fields
  • Not saving the PDF

Troubleshooting

Build the base image first:
docker build -t pdfbench-base -f docker/Dockerfile.pdfbench .
The pdfbench-base image includes Chromium. If PDFs don’t open, check that the task Dockerfile properly extends pdfbench-base.
Check that:
  1. The agent is actually filling out fields
  2. The PDF is being saved correctly
  3. Verification is running (check test.sh output)
If verification fails:
  1. Check that the PDF was saved
  2. Verify the solution.json matches the PDF fields
  3. Check bbox_verifier.py for errors

Creating New PDF Tasks

To add your own PDF form-filling tasks:
1

Create the task directory

mkdir -p tasks/pdfbench/pdfbench_myform_x001/{environment,tests}
2

Add the PDF

Place your PDF at environment/pdfs/myform.pdf
3

Create solution.json

Map field names to expected values:
{
  "field_name": "expected_value",
  "another_field": "another_value"
}
4

Write instructions

Create instruction.md with conversational instructions.
5

Configure task.toml

version = "1.0"

[agent]
timeout_sec = 300.0

[environment]
docker_image = "pdfbench-base"
gui = true
cpus = 2
memory_mb = 4096

Next Steps