dave.engineer
Work

CEO Bench

An open benchmark that measures how well language models handle executive leadership tasks.

CEO Bench evaluates how reliably large language models can respond to complex leadership scenarios. The benchmark generates realistic executive-level questions, collects model answers, and scores them against an automatic rubric so teams can compare performance over time. Everything is transparent: the prompts, responses, and scoring scripts live in the open.

The project started off as a tongue in cheek variation on the idea of CEO's using AI to replace workers' jobs but turned into something useful to evaluate LLMs.

I then instructed combination of the best performing LLMs to write a paper evaluating the results.