Modeljury — pricing

Standard

Free

Bring your own task. Run a real bake-off and get a verdict on one clear bar: accuracy.

Plain-English accuracy bar
Responsive shortlist from the full catalog
Cloud + self-hosted, side by side
Per-example grading detail
First verdict always free

Try it free →

Advanced run

$1 / run

Define "good enough" across every metric that matters — not just accuracy.

Worst-class accuracy & p95 latency
Format validity & consistency
Cost ceiling & hallucination rate
Multi-metric verdict, on the chart

Try advanced →

The real value

Monitoring

$29 / mo

Stay on the cheapest model that clears your bars as the market keeps moving.

Auto re-tests as new models ship
Alerts when a cheaper model clears your bars
Weekly report + full run history
Solo $29 · Team $199 (API, shared workspace)

Start free trial

Enterprise · On-prem

Custom

The whole comparison engine running inside your own environment.

On-premise / in-VPC deployment
Data never leaves your environment
Compliance & residency controls
Distilled task-specific models you own
SSO, security review, SLAs

Talk to us

Add-on: a shareable detailed report of any run — per-example breakdown, every metric, recommendation — $9. Included with Monitoring.

What's the difference between Standard and Advanced?

Standard judges on accuracy alone — free, and enough for many tasks. Advanced lets you set a bar across worst-class accuracy, latency, format validity, consistency, cost and hallucination rate, so "good enough" means what it actually means for you. It's $1 per run.

Why is monitoring the paid tier, not the verdict?

A one-off verdict goes stale the week a new model ships. The value isn't the first answer — it's staying on the cheapest model that still clears your bars over time, without you re-checking. That's the subscription.

How is grading "free"?

For objectively-gradable tasks — classification, extraction — we grade programmatically against your labelled examples. No expensive LLM-judge in the loop, so re-testing costs us almost nothing and we can do it continuously.

What makes on-prem different?

The whole comparison-and-inference stack runs inside your environment on open models, with the eval grounded in your own data — and, at scale, a small distilled model fine-tuned to your task that you run and own. Pricing's bespoke; start a conversation.

Free to find out — paid to stay current.

What's the difference between Standard and Advanced?

Why is monitoring the paid tier, not the verdict?

How is grading "free"?

What makes on-prem different?

Find out for free