Large language models are not just experimental tools limited to research labs. They now run smart chatbots and virtual ...
RiskRubric provides a six-pillar framework to quantify AI model risk, guiding secure, compliant adoption with evidence-based ...
Wang, S. (2025) A Review of Agent Data Evaluation: Status, Challenges, and Future Prospects as of 2025. Journal of Software ...
New joint safety testing from UK-based nonprofit Apollo Research and OpenAI set out to reduce secretive behaviors like scheming in AI models. What researchers found could complicate promising ...
For a long time, training large models has relied heavily on the guidance of a "teacher." This could either be human-annotated "standard answers," which are time-consuming and labor-intensive, or ...
On September 13, at the GOSIM HANGZHOU 2025 event hosted by GOSIM Global Open Source Innovation Conference and organized by ...
Sebastian Crossa is the Co-founder of ZeroEval (YC S25), a platform to measure and optimize the quality of AI agents.
OpenAI recently revealed that AI models may resort to hoodwinking users in what it calls 'AI scheming'. But what is this ...
The Federal Aviation Administration (FAA) and MITRE are introducing a new benchmark to enable the evaluation and assessment of large language models (LLMs) for aerospace tasks. Given the ...
China's Zeekr could follow its Tesla Model Y competitor in Australia with a sleek electric wagon to undercut German marques.
US car safety regulators have opened an investigation into the door handles of Tesla’s flagship Model Y model following ...
None of the most widely used large language models (LLMs) that are rapidly upending how humanity is acquiring knowledge has ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results