CEO Bench

CEO Bench evaluates how reliably large language models can respond to complex leadership scenarios. The benchmark generates realistic executive-level questions, collects model answers, and scores them against an automatic rubric so teams can compare performance over time. Everything is transparent: the prompts, responses, and scoring scripts live in the open.

The project started off as a tongue in cheek variation on the idea of CEO's using AI to replace workers' jobs but turned into something useful to evaluate LLMs.

I then instructed combination of the best performing LLMs to write a paper evaluating the results.