Model Evaluation - Search News

LLM Evaluation: Metrics, Methodologies, Best Practices

Large language models are not just experimental tools limited to research labs. They now run smart chatbots and virtual ...

Cloud Security Alliance

RiskRubric: A New Compass for Secure and Responsible Model Adoption

RiskRubric provides a six-pillar framework to quantify AI model risk, guiding secure, compliant adoption with evidence-based ...

Scientific Research Publishing

A Review of Agent Data Evaluation: Status, Challenges, and Future Prospects as of 2025 ()

Wang, S. (2025) A Review of Agent Data Evaluation: Status, Challenges, and Future Prospects as of 2025. Journal of Software ...

AI models know when they're being tested - and change their behavior, research shows

New joint safety testing from UK-based nonprofit Apollo Research and OpenAI set out to reduce secretive behaviors like scheming in AI models. What researchers found could complicate promising ...

Can Computing Power Be a Teacher? Meta's New Research Disrupts Large Model Training, Will the Study Abroad Landscape Change?

For a long time, training large models has relied heavily on the guidance of a "teacher." This could either be human-annotated "standard answers," which are time-consuming and labor-intensive, or ...

Parallel Technology's MaaS Platform Ranks on the '2025 Large Model Service Performance List'

On September 13, at the GOSIM HANGZHOU 2025 event hosted by GOSIM Global Open Source Innovation Conference and organized by ...

Evaluations As A North Star For AI Companies

Sebastian Crossa is the Co-founder of ZeroEval (YC S25), a platform to measure and optimize the quality of AI agents.

MediaNama

OpenAI Reveals AI Models Can Lie To You: What It Means?

OpenAI recently revealed that AI models may resort to hoodwinking users in what it calls 'AI scheming'. But what is this ...

The Daily Reflector

MITRE and FAA Introduce Novel Aerospace Large Language Model Evaluation Benchmark

The Federal Aviation Administration (FAA) and MITRE are introducing a new benchmark to enable the evaluation and assessment of large language models (LLMs) for aerospace tasks. Given the ...

Drive on MSN

2026 Zeekr 007 GT electric wagon coming to Australia for evaluation

China's Zeekr could follow its Tesla Model Y competitor in Australia with a sleek electric wagon to undercut German marques.

US car safety regulators launch Tesla Model Y door handle investigation

US car safety regulators have opened an investigation into the door handles of Tesla’s flagship Model Y model following ...

Nature

Bring us your LLMs: why peer review is good for AI models

None of the most widely used large language models (LLMs) that are rapidly upending how humanity is acquiring knowledge has ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results