Harbor – a framework for evaluating and optimizing agents and language models (github.com)

🤖 AI Summary
Harbor is a new evaluation and optimization framework from the makers of Terminal-Bench that standardizes benchmarking, orchestration, and RL-ready rollout generation for agents and language models. It can evaluate arbitrary agents (e.g., Claude Code, OpenHands, Codex CLI), host and share custom benchmarks and environments, and run experiments at scale by dispatching thousands of parallel runs to cloud providers like Daytona or Modal. Harbor is the official harness for Terminal-Bench-2.0 and supports third‑party suites (SWE-Bench, Aider Polyglot, etc.), making it easy to compare models and agent implementations against shared, versioned datasets. Technically, Harbor is installable via "uv tool install harbor" or "pip install harbor" and provides simple CLI workflow: specify dataset, agent, model, and concurrency (e.g., harbor run --dataset terminal-bench@2.0 --agent claude-code --model anthropic/claude-opus-4-1 --n-concurrent 4). It launches locally with Docker or can target cloud providers with an --env flag and provider API keys (DAYTONA_API_KEY) to scale to hundreds of concurrent runs. Key implications: Harbor lowers friction for reproducible, large-scale agent benchmarking, accelerates RL fine-tuning by producing rollouts, and centralizes tooling so researchers and engineers can more easily compare, iterate, and optimize agent architectures and model-policy combinations.
Loading comments...
loading comments...