MLCommons introduces AILuminate, an important benchmark to measure the safety of LLMs

MLCommons is announcing AILuminate, a first-of-its-kind safety test for large language models (LLMs) designed collaboratively by AI researchers and industry experts. The v1.0 benchmark—which provides a series of safety grades for the most widely-used LLMs—builds on MLCommons’ track record of producing trusted AI performance benchmarks, and offers a scientific, independent analysis of LLM risk that can be immediately incorporated into company decision-making.

“Companies are increasingly incorporating AI into their products, but they have no standardized way of evaluating product safety,” said Peter Mattson, founder and president of MLCommons. “Just like other complex technologies like cars or planes, AI models require industry-standard testing to guide responsible development. We hope this benchmark will assist developers in improving the safety of their systems and will give companies better clarity about the safety of the systems they use.”

The AILuminate benchmark assesses LLM responses to over 24,000 test prompts across twelve categories of hazards.

None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses.

This independence provides a methodological rigor uncommon in standard academic research and ensures an empirical analysis that can be trusted by industry and academia alike.

“With roots in well-respected research institutions, an open and transparent process, and buy-in across the industry, MLCommons is uniquely equipped to advance a global baseline on AI risk and reliability,” said Rebecca Weiss, executive director of MLCommons. “We are proud to release our v1.0 benchmark, which marks a major milestone in our work to build a harmonized approach to safer AI. By making AI more transparent, more reliable, and more trusted, we can ensure its positive use in society and move the industry forward.”

This benchmark was developed by the MLCommons AI Risk and Reliability working group—a team of leading AI researchers from institutions including Stanford University, Columbia University, and TU Eindhoven, civil society representatives, and technical experts from Google, Intel, NVIDIA, Meta, Microsoft, Qualcomm Technologies, Inc., and other industry giants committed to a standardized approach to AI safety.
The working group plans to release ongoing updates as AI technologies continue to advance.

Cognizant that AI safety requires a coordinated global approach, MLCommons intentionally collaborated with international organizations such as the AI Verify Foundation to design the v1.0 AILuminate benchmark. The v1.0 benchmark is initially available for use in English, with versions in French, Chinese, and Hindi coming in early 2025.

For more information about this news, visit https://mlcommons.org.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Register Now to SAVE BIG & Join Us for KMWorld 2025, November 17-20, in Washington, DC.

MLCommons introduces AILuminate, an important benchmark to measure the safety of LLMs

Special Report - Automation & AI-Powered Document Management: Transforming Knowledge Work for the Digital Age

KMWorld 2024 Sound Off: Conference Highlights and 2025 Forecast

Delivering Knowledge Everywhere: The Rise of Self-Service

More

Top KM Strategies for Optimizing Customer Experience

The Rise of Agentic AI: The New Era of Autonomous Intelligence

Faster, Smarter, Scalable KM: Leveraging Knowledge Automation and AI

Game-Changing Breakthroughs in Intelligent Content Management

More Webinars