MLCommons introduces AILuminate, an important benchmark to measure the safety of LLMs
MLCommons is announcing AILuminate, a first-of-its-kind safety test for large language models (LLMs) designed collaboratively by AI researchers and industry experts. The v1.0 benchmark—which provides a series of safety grades for the most widely-used LLMs—builds on MLCommons’ track record of producing trusted AI performance benchmarks, and offers a scientific, independent analysis of LLM risk that can be immediately incorporated into company decision-making.
“Companies are increasingly incorporating AI into their products, but they have no standardized way of evaluating product safety,” said Peter Mattson, founder and president of MLCommons. “Just like other complex technologies like cars or planes, AI models require industry-standard testing to guide responsible development. We hope this benchmark will assist developers in improving the safety of their systems and will give companies better clarity about the safety of the systems they use.”
The AILuminate benchmark assesses LLM responses to over 24,000 test prompts across twelve categories of hazards.
None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses.
This independence provides a methodological rigor uncommon in standard academic research and ensures an empirical analysis that can be trusted by industry and academia alike.
“With roots in well-respected research institutions, an open and transparent process, and buy-in across the industry, MLCommons is uniquely equipped to advance a global baseline on AI risk and reliability,” said Rebecca Weiss, executive director of MLCommons. “We are proud to release our v1.0 benchmark, which marks a major milestone in our work to build a harmonized approach to safer AI. By making AI more transparent, more reliable, and more trusted, we can ensure its positive use in society and move the industry forward.”
This benchmark was developed by the MLCommons AI Risk and Reliability working group—a team of leading AI researchers from institutions including Stanford University, Columbia University, and TU Eindhoven, civil society representatives, and technical experts from Google, Intel, NVIDIA, Meta, Microsoft, Qualcomm Technologies, Inc., and other industry giants committed to a standardized approach to AI safety.
The working group plans to release ongoing updates as AI technologies continue to advance.
Cognizant that AI safety requires a coordinated global approach, MLCommons intentionally collaborated with international organizations such as the AI Verify Foundation to design the v1.0 AILuminate benchmark. The v1.0 benchmark is initially available for use in English, with versions in French, Chinese, and Hindi coming in early 2025.
For more information about this news, visit https://mlcommons.org.