Pricing

Free to find out — paid to stay current.

The first answer is free or cheap. The ongoing watch — keeping you on the cheapest model that still clears your bar as the market moves — is the subscription, and where the value sits.

Standard
Free
Bring your own task. Run a real bake-off and get a verdict on one clear bar: accuracy.
  • Plain-English accuracy bar
  • Responsive shortlist from the full catalog
  • Cloud + self-hosted, side by side
  • Per-example grading detail
  • First verdict always free
Try it free →
Advanced run
$1 / run
Define "good enough" across every metric that matters — not just accuracy.
  • Worst-class accuracy & p95 latency
  • Format validity & consistency
  • Cost ceiling & hallucination rate
  • Multi-metric verdict, on the chart
Try advanced →
The real value
Monitoring
$29 / mo
Stay on the cheapest model that clears your bars as the market keeps moving.
  • Auto re-tests as new models ship
  • Alerts when a cheaper model clears your bars
  • Weekly report + full run history
  • Solo $29 · Team $199 (API, shared workspace)
Start free trial
Enterprise · On-prem
Custom
The whole comparison engine running inside your own environment.
  • On-premise / in-VPC deployment
  • Data never leaves your environment
  • Compliance & residency controls
  • Distilled task-specific models you own
  • SSO, security review, SLAs
Talk to us

Add-on: a shareable detailed report of any run — per-example breakdown, every metric, recommendation — $9. Included with Monitoring.

What's the difference between Standard and Advanced?

Standard judges on accuracy alone — free, and enough for many tasks. Advanced lets you set a bar across worst-class accuracy, latency, format validity, consistency, cost and hallucination rate, so "good enough" means what it actually means for you. It's $1 per run.

Why is monitoring the paid tier, not the verdict?

A one-off verdict goes stale the week a new model ships. The value isn't the first answer — it's staying on the cheapest model that still clears your bars over time, without you re-checking. That's the subscription.

How is grading "free"?

For objectively-gradable tasks — classification, extraction — we grade programmatically against your labelled examples. No expensive LLM-judge in the loop, so re-testing costs us almost nothing and we can do it continuously.

What makes on-prem different?

The whole comparison-and-inference stack runs inside your environment on open models, with the eval grounded in your own data — and, at scale, a small distilled model fine-tuned to your task that you run and own. Pricing's bespoke; start a conversation.

Find out for free

Run your first bake-off, see the verdict, then decide if staying current is worth $29.

Try the demo →