Modeljury — glossary

Core

Bake-off

A head-to-head test: every shortlisted model runs your examples, and each answer is graded against the expected label. The output is a ranking of accuracy versus cost.

Core

Quality bar

The line a model has to clear to count as "good enough" for you. In Standard mode that's a single accuracy threshold; in Advanced mode it's several metrics at once.

Metric

Accuracy

The share of your examples a model gets exactly right. 90% means it matched the expected answer on 9 of every 10 examples.

Metric

Worst-class accuracy

A model's accuracy on its weakest category, not its average. Catches a model that aces the common case (say, "billing") but quietly fails a rare-but-important one (say, "fraud").

Metric

Latency (p95)

How long responses take, at the 95th percentile — meaning 95% of calls come back faster than this number. A truer measure of "slow moments" than an average.

Metric

Format validity

For extraction tasks: the share of outputs that are well-formed — valid JSON, all required fields present — so your downstream system can actually use them.

Metric

Consistency

Whether a model gives the same answer when the same input is run more than once. Low consistency means flaky behaviour that's hard to build on.

Metric

Cost ceiling

A hard cap on price per 1,000 calls. Any model above it is excluded regardless of how good it is — useful when budget is the binding constraint.

Metric

Hallucination rate

The share of outputs containing unsupported or made-up content. Lower is better; it matters most where a confident wrong answer is costly.

Selection

Shortlist

The models worth actually testing for your task. Modeljury reads the task's difficulty and only shortlists models capable enough to have a real chance.

Selection

Task difficulty

An automatic read of how hard your task is — from the task type, number of categories, language nuance and input length. Harder tasks raise the capability bar a model must clear to make the shortlist.

Selection

Dominated

A model is "dominated" when another candidate is at least as capable and costs the same or less. There's no reason to test a dominated model, so it's eliminated automatically.

Deployment

Cloud vs self-hosted

Cloud models run via a provider's API — zero ops, latest models. Self-hosted (open-weight) models run on your own hardware — often cheaper at scale, and your data never leaves your environment.

Deployment

Compliance & residency

Rules that prune candidates before cost matters — for example, "data can't leave our environment" (excludes cloud APIs) or "no non-domestic providers" (excludes certain origins).

Core

Monitoring

The ongoing watch. New models ship constantly; monitoring re-runs your bake-off as they do and alerts you the moment a cheaper model clears your bars — so you stay on the best-value option without re-checking.

Plain-English definitions.

Bake-off

Quality bar

Accuracy

Worst-class accuracy

Latency (p95)

Format validity

Consistency

Cost ceiling

Hallucination rate

Shortlist

Task difficulty

Dominated

Cloud vs self-hosted

Compliance & residency

Monitoring

See the terms in action