Helios supports running multiple tasks concurrently, perfect for benchmarks, regression testing, and batch operations.
Basic Usage
Run all tasks in a directory:
This discovers all tasks in tasks/ and runs them with 4 concurrent containers.
Output
╭─────────────┬─────────────────────────────────────────────────╮
│ Tasks found │ 10 │
│ Concurrent │ 4 │
│ Model │ gemini/gemini-2.5-computer-use-preview-10-2025 │
│ Output │ output │
╰─────────────┴─────────────────────────────────────────────────╯
4/10 Passed: 3 | Failed: 1 ━━━━━━━━━━━━━━ 00:45 00:01:30
╭──────────────────── Batch Results ────────────────────╮
│ Total tasks │ 10 │
│ Passed │ 8 │
│ Failed │ 2 │
│ Mean reward │ 0.800 │
│ Duration │ 180.5s │
╰───────────────────────────────────────────────────────╯
Options
Option Short Default Description --concurrent-n2Number of tasks to run in parallel --n-attempts-k1Number of attempts per task (for pass@k) --model-mdefault Model to use for all tasks --output-ooutputDirectory to save all outputs --quiet-qfalseShow only aggregate progress --pattern-p**/task.tomlGlob pattern for finding tasks --providerdockerEnvironment provider
Examples
Basic Batch
With Model
Custom Output
Quiet Mode
Custom Pattern
Cloud Execution
pass@k Evaluation
# Run all tasks with 4 concurrent containers
helios batch tasks/ -n 4
# Use a specific model
helios batch tasks/ -n 4 -m claude-sonnet-4-20250514
# Save results to a specific directory
helios batch tasks/ -n 4 -o results/experiment-001/
# Minimal output (just progress bar)
helios batch tasks/ -n 4 --quiet
# Find tasks matching a specific pattern
helios batch benchmarks/ -p "**/pdfbench*/task.toml" -n 8
# Run on Daytona cloud sandboxes
helios batch tasks/ -n 10 --provider daytona
# Run each task 3 times (pass@3)
helios batch tasks/ -n 4 -k 3
pass@k Evaluation
The pass@k metric measures whether a task can be solved in k attempts. A task “passes” if any of its k attempts succeeds. This is useful for:
Benchmarking : Measure model capability more fairly by accounting for non-determinism
Reliability testing : Understand how often tasks succeed across multiple runs
Research : Compare models using standard pass@k metrics
Basic Usage
# Run each task 3 times
helios batch tasks/ -n 4 -k 3
Output with pass@k
╭─────────────────────────────────────────────────────────────────────╮
│ ☀ HELIOS v0.1.0 · Batch Mode │
│ │
│ ▸ Tasks 5 discovered │
│ ▸ Concurrency 4 workers │
│ ▸ Attempts 3 per task (pass@3) │
│ ▸ Total Runs 15 │
╰─────────────────────────────────────────────────────────────────────╯
╭──────────────────── Batch Results ────────────────────╮
│ pass@3: 80.0% (4/5 tasks) │
│ Raw: 10/15 (66.7%) │
│ Mean reward: 0.667 │
│ Duration: 245.3s │
╰───────────────────────────────────────────────────────╯
How pass@k Works
Task Expansion : Each task is expanded into k separate runs
Distributed Execution : Attempts are interleaved across tasks (task1-attempt1, task2-attempt1, task1-attempt2, …) to spread rate limits
Result Aggregation : A task passes if ANY of its k attempts succeeded
Metrics : Reports both raw pass rate and pass@k percentage
Output Structure with pass@k
output/batch_20250112_143022/
├── batch_summary.json
├── 000_task-a/
│ ├── attempt_001/
│ │ ├── agent/
│ │ └── verifier/
│ ├── attempt_002/
│ │ ├── agent/
│ │ └── verifier/
│ └── attempt_003/
│ ├── agent/
│ └── verifier/
├── 001_task-b/
│ ├── attempt_001/
│ ├── attempt_002/
│ └── attempt_003/
└── ...
Single Task with pass@k
You can also run a single task multiple times:
# Run one task 5 times
helios run tasks/create-hello-file -k 5
The -k option is incompatible with --watch and --interactive modes since those require a single execution context.
Concurrency Guidelines
Choose concurrency based on your system resources:
System Recommended -n Laptop (16GB RAM) 2-4 Desktop (32GB RAM) 4-8 Server (64GB+ RAM) 8-16 Cloud (Daytona) 10-50+
GUI tasks require more resources. For GUI-heavy batches, use lower concurrency (e.g., -n 2 for GUI tasks vs -n 8 for headless).
Output Structure
Batch runs create organized output:
output/batch_20250101_120000/
├── batch_summary.json # Aggregate results
├── 001_create-hello-file/
│ ├── agent/
│ │ └── trajectory.json
│ ├── verifier/
│ ├── config.json
│ └── result.json
├── 002_web-scraping-task/
│ ├── agent/
│ │ └── trajectory.json
│ ├── verifier/
│ ├── config.json
│ └── result.json
└── 003_gui-browser-task/
├── agent/
│ └── trajectory.json
├── verifier/
├── config.json
└── result.json
For -k runs, each task folder includes attempt_001/, attempt_002/, etc.
batch_summary.json
{
"total_tasks" : 10 ,
"passed" : 8 ,
"failed" : 2 ,
"mean_reward" : 0.8 ,
"duration_seconds" : 180.5 ,
"model" : "claude-sonnet-4-20250514" ,
"tasks" : [
{
"name" : "create-hello-file" ,
"status" : "passed" ,
"reward" : 1.0 ,
"duration" : 12.3
},
// ...
]
}
Programmatic Usage
Use Helios as a Python library for custom batch workflows:
import asyncio
from helios import ParallelRunner, discover_tasks
async def main ():
# Discover all tasks in a directory
task_paths = discover_tasks( "tasks/" )
# Create a parallel runner
runner = ParallelRunner(
task_paths = task_paths,
n_concurrent = 4 ,
n_attempts = 3 , # pass@3 evaluation
model = "claude-sonnet-4-20250514" ,
output_dir = "results/"
)
# Run all tasks
result = await runner.run()
# Access results
print ( f "Passed: { result.passed } / { result.total_tasks } " )
print ( f "Mean reward: { result.mean_reward :.3f} " )
print ( f "Duration: { result.total_duration_sec :.1f} s" )
# pass@k metrics (when n_attempts > 1)
if result.pass_at_k is not None :
print ( f "pass@ { result.n_attempts } : { result.pass_at_k :.1f} %" )
print ( f "Unique tasks: { result.n_unique_tasks } " )
# Iterate over individual results
for task_result in result.results:
if not task_result.passed:
print ( f "Failed: { task_result.task_name } (attempt { task_result.attempt } )" )
asyncio.run(main())
Custom Task Discovery
from pathlib import Path
from helios import ParallelRunner
# Custom task selection
task_paths = [
Path( "tasks/easy-task" ),
Path( "tasks/medium-task" ),
Path( "tasks/hard-task" ),
]
runner = ParallelRunner(
task_paths = task_paths,
n_concurrent = 3 ,
model = "claude-sonnet-4-20250514"
)
Progress Callbacks
async def on_task_complete ( task_name : str , result ):
print ( f "Completed: { task_name } - { result.status } " )
runner = ParallelRunner(
task_paths = task_paths,
n_concurrent = 4 ,
on_complete = on_task_complete
)
Benchmark Workflows
Running PDFBench
# Build the base image first
docker build -t pdfbench-base -f docker/Dockerfile.pdfbench .
# Run all 100 PDF tasks
helios batch tasks/pdfbench/ -n 4 -m claude-sonnet-4-20250514 -o results/pdfbench/
# Run a subset
helios batch tasks/pdfbench/ -p "**/pdfbench_eyemed*/task.toml" -n 4
Comparing Models
# Run same tasks with different models
helios batch tasks/benchmark/ -n 4 -m claude-sonnet-4-20250514 -o results/claude-sonnet/
helios batch tasks/benchmark/ -n 4 -m openai/computer-use-preview -o results/openai/
helios batch tasks/benchmark/ -n 4 -m gemini/gemini-2.5-computer-use-preview-10-2025 -o results/gemini/
# Compare results
python compare_results.py results/
CI/CD Integration
# .github/workflows/benchmark.yml
name : Run Benchmarks
on :
schedule :
- cron : '0 0 * * *' # Daily
jobs :
benchmark :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- name : Setup Python
uses : actions/setup-python@v5
with :
python-version : '3.12'
- name : Install Helios
run : pip install -e .
- name : Run Benchmarks
env :
ANTHROPIC_API_KEY : ${{ secrets.ANTHROPIC_API_KEY }}
run : |
helios batch tasks/regression/ -n 4 -o results/
- name : Upload Results
uses : actions/upload-artifact@v4
with :
name : benchmark-results
path : results/
Troubleshooting
Reduce concurrency or increase system swap: # Lower concurrency
helios batch tasks/ -n 2
For large batches, Docker Hub may rate limit. Use authenticated pulls:
Some tasks may be non-deterministic. Use pass@k for fair evaluation: # Recommended: use pass@k to account for non-determinism
helios batch tasks/ -n 4 -k 3
# Alternative: run multiple times and average
for i in { 1..3} ; do
helios batch tasks/ -n 4 -o results/run- $i /
done
Use more specific patterns: # Instead of searching everything
helios batch . -p "**/task.toml"
# Be specific
helios batch tasks/pdfbench/ -n 4
Next Steps