Microsoft Unveils ASSERT: AI Behavior Testing Framework

The Lead

Microsoft has launched ASSERT, an open-source framework designed to make evaluating application-specific AI behavior easier. The framework uses AI to turn natural-language descriptions of goals, policies, or intended behaviors into thorough, scored tests.

How ASSERT Works

ASSERT takes plain-language descriptions of an AI model's expected behavior and policies, turns them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them against the target system, and scores the results. It can also record the paths the AI system takes, including intermediate actions and tool calls, allowing developers to inspect where failures happen.

The Data Analysis

By providing system context, tools, and constraints, developers can further customize what the evaluations cover. For instance, a developer could specify that a document research AI agent shouldn't send emails to people outside the company and should limit confidential information to C-level executives. ASSERT will use those rules to generate test cases that check whether the system follows those rules on an ongoing basis.

The Impact Analysis

Sarah Bird, chief product officer of Responsible AI at Microsoft, emphasized the importance of evaluations in making good decisions. 'If you don't understand the behavior of the AI system, it's really hard to know if it's meeting your organization's bar,' she said. ASSERT fills a gap that broader, more general evaluations cannot when AI models are intended to behave in a manner shaped by an application's context, policies, and tools.

The Prediction

The release of ASSERT comes amidst a broader shift in the AI industry towards repeatable testing and regression checks. As models grow more capable, researchers are focusing on evaluating systems when they're being built, after deployment, and even for continuous monitoring. With ASSERT, Microsoft aims to provide a tool that can be used throughout the AI development lifecycle to ensure trustworthy systems.