Overview
100 Tasks
10 form types with 10 variants each
Real-World Forms
Tax forms, medical forms, legal documents
Partial Credit
Granular scoring based on field accuracy
Reproducible
Deterministic verification of filled fields
Quick Start
1
Build the PDFBench base image
2
Run a single task
3
Watch the agent
Open http://localhost:8080 to see the agent fill out the form.
Form Types
PDFBench includes 10 different form types, each with 10 variations:| Form Type | Description | Fields |
|---|---|---|
eyemed | Vision enrollment forms | Personal info, plan selection |
w-9 | IRS W-9 tax forms | TIN, certification, address |
hipaa | HIPAA authorization forms | Patient info, permissions |
invoice-template | Invoice templates | Line items, totals, dates |
medical-consent | Medical consent forms | Procedures, signatures |
medical-plan | Medical plan enrollment | Coverage options, dependents |
nda | Non-disclosure agreements | Parties, terms, dates |
rental-lease | Rental/lease agreements | Property, terms, signatures |
prescription | Prescription forms | Medication, dosage, patient |
pediatric-immunization | Pediatric immunization records | Vaccines, dates, provider |
Task Structure
Each PDFBench task follows this structure:instruction.md Example
solution.json Example
Running the Benchmark
Full Benchmark
Run all 100 tasks:By Form Type
Run specific form types:At Scale with Daytona
For large-scale runs:Verification
PDFBench uses a specialized verification system:1
Agent saves the PDF
The agent fills out and saves the PDF form
2
Text extraction
The verifier extracts text from form field bounding boxes
3
Comparison
Extracted values are compared against the solution.json
4
Scoring
Partial credit is awarded based on percentage of correct fields
Scoring
| Score | Meaning |
|---|---|
| 1.0 | All fields correct |
| 0.0 | No fields correct |
| 0.5 | Half the fields correct |
Results Analysis
After running the benchmark, analyze results:Comparing Models
Run PDFBench with different models:Tips for Good Performance
Use GUI-capable models
Use GUI-capable models
PDFBench requires models with strong vision capabilities and computer-use tools.
Allow sufficient time
Allow sufficient time
Form filling takes time. Default timeouts should be sufficient:
Use adequate resources
Use adequate resources
GUI tasks need more resources:
Watch for common errors
Watch for common errors
- Clicking wrong fields
- Typos in data entry
- Missing required fields
- Not saving the PDF
Troubleshooting
pdfbench-base image not found
pdfbench-base image not found
Build the base image first:
PDF not opening
PDF not opening
The pdfbench-base image includes Chromium. If PDFs don’t open, check that the task Dockerfile properly extends pdfbench-base.
Low scores on all tasks
Low scores on all tasks
Check that:
- The agent is actually filling out fields
- The PDF is being saved correctly
- Verification is running (check test.sh output)
Verification errors
Verification errors
If verification fails:
- Check that the PDF was saved
- Verify the solution.json matches the PDF fields
- Check bbox_verifier.py for errors
Creating New PDF Tasks
To add your own PDF form-filling tasks:1
Create the task directory
2
Add the PDF
Place your PDF at
environment/pdfs/myform.pdf3
Create solution.json
Map field names to expected values:
4
Write instructions
Create
instruction.md with conversational instructions.5
Configure task.toml