MMLU, HumanEval, Arena ELO — how we actually measure if a model is good. Try the benchmarks yourself.