Every metric and term Modeljury uses, in one place — so "good enough" always means something specific.
A head-to-head test: every shortlisted model runs your examples, and each answer is graded against the expected label. The output is a ranking of accuracy versus cost.
The line a model has to clear to count as "good enough" for you. In Standard mode that's a single accuracy threshold; in Advanced mode it's several metrics at once.
The share of your examples a model gets exactly right. 90% means it matched the expected answer on 9 of every 10 examples.
A model's accuracy on its weakest category, not its average. Catches a model that aces the common case (say, "billing") but quietly fails a rare-but-important one (say, "fraud").
How long responses take, at the 95th percentile — meaning 95% of calls come back faster than this number. A truer measure of "slow moments" than an average.
For extraction tasks: the share of outputs that are well-formed — valid JSON, all required fields present — so your downstream system can actually use them.
Whether a model gives the same answer when the same input is run more than once. Low consistency means flaky behaviour that's hard to build on.
A hard cap on price per 1,000 calls. Any model above it is excluded regardless of how good it is — useful when budget is the binding constraint.
The share of outputs containing unsupported or made-up content. Lower is better; it matters most where a confident wrong answer is costly.
The models worth actually testing for your task. Modeljury reads the task's difficulty and only shortlists models capable enough to have a real chance.
An automatic read of how hard your task is — from the task type, number of categories, language nuance and input length. Harder tasks raise the capability bar a model must clear to make the shortlist.
A model is "dominated" when another candidate is at least as capable and costs the same or less. There's no reason to test a dominated model, so it's eliminated automatically.
Cloud models run via a provider's API — zero ops, latest models. Self-hosted (open-weight) models run on your own hardware — often cheaper at scale, and your data never leaves your environment.
Rules that prune candidates before cost matters — for example, "data can't leave our environment" (excludes cloud APIs) or "no non-domestic providers" (excludes certain origins).
The ongoing watch. New models ship constantly; monitoring re-runs your bake-off as they do and alerts you the moment a cheaper model clears your bars — so you stay on the best-value option without re-checking.
Run a bake-off on your own task and watch the metrics decide the verdict.
Try the demo →