Bring your own task — get the cheapest model that clears your bar

Describe your task. We pick the models worth testing.

Paste or upload your examples. Modeljury reads how hard the task is, builds an evaluation tailored to it, shortlists the models capable enough (and tells you why it cut the rest), then bakes off the survivors to find the cheapest one that clears your bar.

Run a bake-off

⚡ Live · runs real models

Describe what you want in plain English (type or use voice). Then add a few labelled examples below — generate them with AI, upload a file, or type your own. Then read the task difficulty below.

Type or
Labelled examplesnone yet

The bake-off grades each model against examples in the form input | expected — one per line. Add some three ways:

or press ⌘/Ctrl + Enter
Task difficulty (auto-read)
EasyModerateChallengingVery hard

This is what makes Modeljury different: it builds an evaluation tailored to your task — the test cases, what counts as correct, and how it's graded. Review it and add anything that's missing.

Read your task in step 1 first, then build the evaluation here.

Constraints prune the candidates before cost matters.

How good is good enough? i
90%≈ at most 1 wrong in 10
Standard run · free