A benchmark created by Scale AI and the Center for AI Safety (CAIS) to evaluate the reasoning and knowledge capabilities of AI systems at the frontier of human expertise.