How to Test and Evaluate AI?

Live

AI Testing and Assurance Framework for Public Sector

Evaluation-driven approach to building safe, fair and reliable AI

Banner image for AI testing framework

Table of Contents

  1. Executive Summary
  2. Introduction
  3. Core AI Quality Attributes
  4. Testing and Evaluation Principles for AI
  5. Continuous Defensive Assurance Model
  6. Modular AI Testing Approach
  7. Tools and Resources for Testing
  8. Conclusion
  9. Review Log

Executive Summary

AI is playing a growing role across public services, from decision support to automation to frontline service delivery. As these systems become more capable and more embedded in critical processes, it is essential that we understand how well they work, how fair they are, and what happens when they go wrong.

This framework sets out a practical, shared approach for testing, evaluation and assurance of AI systems across public sector. It focuses on helping teams test systems, evaluate AI models and assure quality, trustworthiness, and risk. This is whether the AI in question is a traditional rule-based system, Machine Learned model, or a newer generative or agentic system.

In this framework, testing refers to checking the whole AI-enabled system (infrastructure, APIs, integrations, user interfaces, data flows). Evaluation focuses on AI model/layer, assessing how well they perform against agreed quality attributes. Testing generates the evidence, evaluation interprets it in context, and assurance gives confidence that systems are safe, fair and accountable.

Testing AI is not the same as testing traditional software. AI behaves probabilistically, learns from data that may shift over time, and may act autonomously or opaquely. That means we need new methods, new language, and new standards. This framework provides the foundations to support that shift.

Importantly, this is not a checklist or a rigid standard. It is a living tool to help organisations ask better questions, design better tests, assurance strategy and share what works. Departments can tailor the framework to fit their context, and contribute their own learning back into the shared understanding of how to evaluate AI in public sector responsibly.

Introduction

Purpose

The purpose of this framework is to provide a shared reference point for government teams who are responsible for testing, evaluation and the wider assurance of AI systems. Whether you are building AI internally, procuring it externally, or overseeing its implementation, this document is designed to help you ask the right questions about how AI systems are tested, which covers before development, during development, and after deployment. It helps establish a minimum baseline of testing and evaluation for AI systems that impact the public, while allowing departments to adapt it to their own governance models, service needs, and levels of technical maturity.

Scope

This framework applies to across the full AI system lifecycle: from initial planning and design through development, testing, deployment, and ongoing monitoring. It is intended for use across UK Government departments, agencies, and other public sector bodies developing or procuring AI systems.The scope covers:

The framework can be adapted to AI initiatives of varying size and risk. It covers pre-deployment testing (e.g. validating models in controlled environments) as well as post-deployment assurance (e.g. monitoring live systems for drift or issues).

The framework is cross-disciplinary, encompassing activities for data scientists, developers, test engineers, policy and ethics reviewers, information testing teams, and senior decision-makers. It does not replace specific legal or regulatory requirements, but rather consolidates and references them so that teams can ensure compliance through testing and evaluation.

This framework is intended as a common foundation, not a fixed prescription. It sets out principles, testing strategies, and testing modules that departments can adopt, adapt, or extend based on their specific operational contexts, risk profiles, and technical architectures. Flexibility is essential: testing activities should be proportionate to the impact of the AI system, and tailored to its type, whether rule-based, machine learning, generative, or agentic. Departments are encouraged to use this framework as a baseline for their own testing strategies, aligning with key principles while addressing the unique demands of their services.

Audience

The intended audience for this document includes all teams and stakeholders involved in the development, deployment, and oversight of AI systems in Government. In particular:

AI Types Covered

This framework is designed to cover a broad range of AI system types, noting that different testing approaches may be needed for each. It explicitly addresses the following categories of AI:

Because agentic AI can evolve through interactions, continuous monitoring and periodic re-validation are crucial for this type.

Where specific testing needs differ by AI type, these are called out in the relevant sections of this framework.

Core AI Quality Attributes

Adapted from the Cross-Government Testing Community AI Quality Workshop

To assure that AI systems are safe, fair, and effective, we use a set of quality attributes. These guide both system-level testing (does the AI system as a whole work as intended in real conditions?) and model-level evaluation (does the AI model behave fairly, accurately, and robustly?). Together, they provide assurance over time.

We group attributes into four themes:

  1. Safety and Ethics
  2. Openness & Trust
  3. Performance & Resilience
  4. User & Context Fit
Quality Attribute What it means? Focus
Autonomy Degree to which the AI can operate independently of human intervention. Challenge system outside its envelope, test for human handover triggers, verify deferral to humans when thresholds are exceeded.
Fairness Mitigation of unwanted bias. Conduct bias audits, test for data skew, validate outcomes against equality law.
Safety Ability of system to not cause harm to life, property, or the environment. Worst-case scenario testing, validate safety mechanisms. If standards exist (e.g. ISO/IEC TR 5469:2024), ensure tests cover those criteria. Safety testing often overlaps with robustness and ethical testing but focuses on preventing harm.
Ethical Compliance Alignment with ethical guidelines, values, and laws. Checklist-based testing, ethics panel review, pass/fail criteria.
Side Effects & Reward Hacking The presence of unintended behaviors arising from the AI’s optimization process. Identify and trigger any potential side effects of the AI’s objectives, simulation testing and anomaly detection on the agent’s behaviors.
Security Protection against unauthorized access, misuse, or adversarial attacks. Security testing, including penetration tests on AI APIs, checks for data leakage, ensure proper authentication/authorization around the AI service.
Quality Attribute What it means? Focus
Transparency Visibility of AI’s data, logic, and workings. Verify traceability records, audit trails, documentation like model cards or algorithmic transparency records.
Explainability Ability to explain AI outputs in human terms. Use explainability tools (e.g. SHAP, LIME), domain expert review, user comprehension checks.
Accountability Clear responsibility for AI outcomes and decisions. Trace decision responsibility, review governance mechanisms.
Compliance Adherence to organisational policies, standards, and relevant legal and regulatory requirements. Assure compliance with legal frameworks like Data Protection Act, GDPR or AI assurance policies etc.
Quality Attribute What it means? Focus
Functional Suitability Degree to which the AI fulfills its specified tasks. Validate against intended outputs for each requirement.
Performance Efficiency Speed, responsiveness, and resource use. Measure latency, throughput, and system scalability.
Reliability Stability of performance under varying conditions. Stress tests, fault injection, endurance testing.
Maintainability Ease of updating, fixing, or improving the system. Verify retraining process, modularity, and version control.
Evolution The capability of the AI to learn and improve from experience over time. Test for concept or policy drift handling, validate if learning improves performance without breaking existing functionality.
Quality Attribute What it means? Focus
Usability Ease of use for human operators or end users. UX testing, error handling, usability review.
Accessibility Designed for a wide range of abilities. Test and audit the AI and all its user interfaces against the Web Content Accessibility Guidelines (WCAG) version 2.2 AA, to help it meet the Public Sector Bodies Accessibility Regulations.
Compatibility Ability to work across environments and integrations. Test across OS, APIs, browsers, devices, databases, cloud/on-prem.
Portability Ability to move across platforms or contexts. Domain drift checks, portability test to new environments.
Adaptability Ability to adjust to changes in data or environment. Test for handling new data patterns, concept drift.
Data Correctness Quality and diversity of data used to train the model. Validate data quality, completeness, and representativeness.

Testing and Evaluation Principles for AI

Testing AI is not the same as testing traditional software. Unlike rule-based systems, AI adapts, learns from data, and generates outputs that may be uncertain or change over time. This means we must test the system around the AI, evaluate the models inside it, and build assurance that both remain safe, fair and accountable.

The following principles set out how to do this in practice:

These principles encourage testers and project teams to look at AI quality from multiple angles (technical, ethical, and operational), and to integrate testing as a continuous effort.

Continuous Defensive Assurance Model

Assuring an AI system’s quality is not a one-time event, it must be woven through the entire AI development lifecycle. In this framework, we adopt a continuous defensive assurance model, identifying key testing activities and deliverables at each phase of an AI project. Below provides an overview of each phase, what the testing/assurance focus is, and examples of metrics or outcomes to measure:

Defensive model for AI testing framework

The table below shows where in the lifecycle each quality attribute should be considered. It helps you identify when to test, and validate specific characteristics of your AI system.

Quality Attribute Planning/Design Data Collection/Prep Model Dev & Training Validation/Verification Operational Readiness Deployment Monitoring/Continuous Assurance
Autonomy        
Fairness        
Safety        
Ethical Compliance        
Side Effects / Hacking        
Security      
Transparency    
Explainability          
Accountability        
Compliance      
Functional Suitability            
Performance Efficiency        
Reliability          
Maintainability          
Evolution          
Usability          
Accessibility          
Compatibility            
Portability            
Adaptability          
Data Correctness        

The framework covers both the build phase (data preparation, model training, and validation, where models are developed and weights may change) and the use phase (deployment and ongoing monitoring, where models operate in real-world conditions and need continuous evaluation for drift, bias, and robustness). Together, these ensure AI systems remain safe, fair, and accountable across their entire lifecycle.

Planning and Design

Focus:

At the very start of a project, the focus should be on building quality in from the beginning. This phase sets the direction for how the AI system will be tested, evaluated, and assured throughout its lifecycle.

Key activities include:

AI assurance should start as early as possible in the lifecycle. For projects involving procurement or external suppliers, teams should:

Example Outputs/Metrics:

Accessibility is important. Some AI systems don’t offer alternatives like text-to-speech, voice input, or visual aids, which are crucial for accessibility. Without adaptive features, AI may not adjust its language level or format to suit the user’s needs.

Data Collection and Preparation

Focus:

In this phase, the goal is to gather, generate, or select the data that will be used to train and inform the AI model, and to prepare it for safe and effective use. High-quality data is critical, it must be accurate, complete, representative, and lawfully obtained.

Key activities include:

By the end of this phase, datasets should be of known quality, well-documented, and suitable for both testing the system and evaluating the model.

Example Outputs/Metrics:

Model Development and Training

Focus:

The focus is to ensure the model is learning correctly, generalising beyond training data, and embedding fairness, explainability, and robustness from the start. For example, teams may use validation datasets, cross-validation, or interpretable models to monitor for overfitting and bias. By the end of this phase, there should be a trained candidate model that meets initial performance criteria and shows no major flaws.

Example Outputs/Metrics:

Validation and Verification (Testing Phase)

Focus:

At this stage, the AI system and its integrated components are put through rigorous pre-deployment checks. This is the classic testing phase, where the system is exercised against a wide range of scenarios and quality requirements. The aim is to confirm that the system behaves as intended under realistic and adverse conditions, and that all requirements set earlier in the project have been met. By the end of Validation & Verification, the team should have high confidence with evidence that the AI system is ready for real-world use, or identify issues that need fixing before it can proceed.

Verification here means confirming that the system implementation aligns with specified requirements.Validation refers to confirming that the system is fit for its intended use in real-world conditions. These sit alongside, but are distinct from, the broader framework’s concepts of Testing (generating evidence) and Evaluation (interpreting that evidence in context).

Activities typically include:

Example Test Techniques:

To support thorough and consistent AI testing, teams should consider established test techniques. These help verify model behaviour and track quality over time. Example techniques include:

Teams should select techniques appropriate to their system type and risk profile. AI testing techniques will evolve over time, hence it is important to keep them under regular review.

Example Outputs/Metrics:

Operational Readiness

Focus:

Before deployment, AI systems should pass an Operational Acceptance Test (OAT) to confirm they are ready for a production environment. This stage assesses whether the system has the safeguards, processes, and resilience needed for safe and reliable operation at scale.

Traditional OAT covers areas like alerting and incident response, failover and redundancy, user access and permission management, and audit logging. For AI systems, additional considerations are critical:

Example Outputs/Metrics:

Deployment (Release and Integration)

Focus:

This phase covers releasing the AI system into the live environment and integrating it into the wider business or service workflow. Even after extensive pre-release testing, new issues can appear in production, so this stage includes final smoke testing and operational checks.

AI systems should be deployed using secure, repeatable, and automated processes. Teams are encouraged to integrate testing and assurance into their DevOps pipelines and use Continuous Integration/Continuous Deployment (CI/CD) workflows.

Key activities include:

Example Outputs/Metrics:

Monitoring and Continuous Assurance

Focus:

Deployment is not the end of the lifecycle. Continuous monitoring and improvement are essential to keep AI systems reliable, efficient, and fair over time. Once in operation, the AI system must be monitored to ensure it performs as intended and to detect issues early.

Teams should establish ongoing monitoring of key metrics, including:

Monitoring should include alerts for anomalous behaviour, such as predictions deviating significantly from expectations or input data distributions shifting away from training data. Scheduled re-testing, periodic audits, and compliance reviews (such as ‘AI health checks’) provide additional safeguards.

Where business processes change or new data sources emerge, AI models may need to be retrained or updated. Such changes must go through a controlled process with re-testing and governance sign-off to prevent risks from being introduced unnoticed.

Example Outputs/Metrics:

Modular AI Testing Approach

While the defensive assurance approach tells us when to perform testing activities, this section describes what tests to perform in detail. We present a modular testing approach for AI, consisting of distinct testing modules, each addressing a specific facet of AI quality. These modules can be thought of as building blocks, depending on the AI system and its risk level. You might emphasize some modules more than others, but together they form a comprehensive testing regimen. The modular design allows flexibility. For example, a simple rule-based system might not need elaborate adversarial testing, whereas a machine learning model would. Each module also notes how approaches may differ for various AI types (rule-based vs ML vs generative vs agentic AI), ensuring the unique challenges of each are covered.

Modules for AI testing framework

The table below shows where each quality attribute is covered across the modular AI testing approach. It helps align your assurance effort across all modules.

Quality Attribute Data&Input Validation Functionality Testing Bias&Fairness Testing Explainability &Transparency Robustness &Adversarial Testing Performance &Efficiency Testing System &Integration Testing UAT &Ethical Review Continuous Monitoring &Improvement
Autonomy              
Fairness              
Safety              
Ethical Compliance              
Side Effects / Hacking              
Security              
Transparency              
Explainability                
Accountability              
Compliance              
Functional Suitability              
Performance Efficiency                
Reliability              
Maintainability                
Evolution                
Usability                
Accessibility                
Compatibility                
Portability                
Adaptability                
Data Correctness              

There are 9 modules in this framework:

Data & Input Validation Module

Objective

AI systems are only as good as the data that feeds them. If data is missing, skewed, or unsafe, errors and bias will follow. This module is about making sure training, validation, and live inputs are valid, representative, and compliant. It is the first safeguard in building confidence in an AI system.

Testing and Evaluation

Testing generates the evidence. Evaluation interprets that evidence. Together, they form the assurance that data is safe to use.

Core practices

Approaches by AI type

Example Tests

Metrics - Example

Evidence and Artefacts

Common Pitfalls

By the end of this module, the team should have high confidence in the data pipeline that the training data was sound and representative, and that live inputs are properly checked. The output of Data & Input Validation is often an ‘OK to proceed’ signal for training (i.e., data is fit for model consumption) and a set of runtime validators that protect the model in production.

Model Functionality Testing Module

Objective

This module verifies whether the AI model and system works as intended. In simple terms, validate whether it behave correctly when given the inputs it was designed for. For rule-based systems, this means checking each rule produces the right outcome. For predictive or generative models, it means confirming outputs are valid, accurate, and within expectations. Strong functionality testing ensures that before the system is integrated or deployed, the model’s logic is sound.

Testing and Evaluation

Core practices

Approaches by AI type

Test each rule in isolation with unit-style tests. Include combinations of conditions, boundary checks (e.g., exactly £20k income), and invalid inputs. Confirm that fallback rules trigger appropriately.

Example: If eligibility is defined as Income < £20k AND Savings < £5k, test cases should include: meets both criteria, fails one, fails both, and boundary cases like exactly £20k.

Validate using statistical metrics on hold-out test sets (accuracy, precision/recall, F1, RMSE, etc.). Use confusion matrices to highlight misclassifications. Apply metamorphic testing (e.g., adding small noise to inputs should not change predictions).

Example: If a model classifies medical images, flipping brightness by 5% should not radically change the prediction.

Since outputs are open-ended, test with both quantitative metrics (e.g., BLEU/ROUGE for translation) and human evaluation (e.g., correctness, tone, relevance). Include adversarial prompts to check resilience.

Example: Prompt ‘Summarise this article in one sentence’ should return a factual, concise summary, not an irrelevant or unsafe response.

Verify that the agent achieves its defined goals under varied scenarios. Test reward signals carefully to avoid reward hacking. Simulate environment changes and check whether the agent still behaves consistently and safely.

Example: A cleaning robot should actually clean when rewarded for ‘clean floor’, not simply stay near the charging dock to exploit the scoring.

Example Tests

Metrics - Example

Evidence and Artefacts

Common Pitfalls

Overall, model functionality module provides evidence that the AI’s ‘brain’ works correctly in isolation. For ML and agents, it’s often a statistical confidence (‘with X% confidence, performance >= threshold’), while for rule-based, it’s near-certainty for each rule. This module ensures we aren’t integrating or deploying a fundamentally flawed model or system functionality.

Bias and Fairness Testing Module

Objective

This module ensures the AI system treats individuals and groups fairly. The goal is to detect and address any disparities in how different demographic or protected groups are treated by the model or system. For public sector use, this also supports legal compliance with anti-discrimination laws like the Equality Act 2010.

Testing and Evaluation

Core practices

Approaches by AI type

Example Tests

Metrics - Example

Evidence and Artefacts

Common Pitfalls

After completing the Fairness & Bias Testing module, the team should have a clear understanding of any bias issues and have taken steps to correct them. Importantly, this module’s results should be reviewed by both technical teams and policy/legal teams. If any disparity remains, it should be consciously signed-off with rationale.

Explainability & Transparency Module

Objective

This module ensures the AI system’s decisions can be understood, traced, and explained, both by those overseeing the system and by the people affected by it. The aim is to test whether the model’s decisions are inspectable and whether the explanations are accurate, complete, and useful. For the public sector, this directly supports accountability, audit, and compliance requirements.

Testing and Evaluation

Core practices

Approaches by AI type

Example Tests

Metrics - Example

Evidence and Artefacts

Common Pitfalls

Once this module is complete, teams should be confident that the system can answer with evidence, why it behaved the way it did.

Robustness & Adversarial Testing Module

Objective

This module tests how the AI system behaves under stress, uncertainty, or attack. The goal is twofold:

This is critical for any system exposed to the real world, especially where safety, integrity, or trust is on the line.

Testing and Evaluation

Core practices

Approaches by AI type

Example Tests

Metrics - Example

Evidence and Artefacts

Common Pitfalls

When this module is done properly, you should have high confidence that the system won’t fall over in bad conditions or worse, be manipulated into doing the wrong thing. It’s a core part of making AI secure, safe, and production-ready.

Performance & Efficiency Testing Module

Objective

This module checks whether the AI system runs fast enough, scales effectively, and uses resources efficiently. That includes both latency (how quickly it responds) and throughput (how many requests it can handle). Efficiency also matters, especially when cost, hardware limits, or energy use are concerns.

Testing and Evaluation

Core practices

Approaches by AI type

Example Tests

Metrics - Example

Evidence and Artefacts

Common Pitfalls

When this module is done properly, it gives assurance that the AI system won’t fall apart under pressure, whether that’s high traffic, limited hardware, or fast paced environments. It ensures the system scales with demand, not against it.

Integration & System Testing Module

Objective

This module verifies that the AI component works correctly as part of a wider system, not just on its own. It checks that the full end-to-end workflow from user input to AI output to final outcome, behaves as expected, and that all interfaces and dependencies work correctly. It ensures the AI’s integration with services like APIs, databases, user interfaces, business logic, and identity/auth flows is seamless and safe, especially under real-world conditions.

Testing and Evaluation

Core practices

Approaches by AI type

Example Tests with Steps

Metrics - Example

Evidence and Artefacts

Common Pitfalls

When this module is done properly, this gives you confidence the AI can be deployed without disrupting the end-to-end user or system journey.

User Acceptance & Ethical Review Module

Objective

This module is the final gate before deployment. It has two purposes:

At this stage, the focus shifts from technical readiness to real world fitness. Do people find the AI system usable, understandable, and trustworthy? Can oversight bodies sign off on its responsible deployment?

Testing and Evaluation

Core practices

Approaches by AI type

Example Activities

Metrics - Example

Evidence and Artefacts

Common Pitfalls

When this module is complete, you should have high confidence that real users can work with the AI system safely and effectively.

Continuous Monitoring & Improvement Module

Objective

Deploying AI isn’t the end of assurance, it’s the beginning of live operations. This module ensures that after deployment, the system is actively monitored, maintained, and improved. AI systems change over time, data drifts, user behaviour shifts, performance degrades, and new risks emerge.

Core practices

Metrics - Example

Evidence and Artefacts

Common Pitfalls

When this module is embedded into your operating model, AI becomes a live service, not a frozen product. It’s how you keep your system effective, lawful, and trustworthy, even as everything around it changes.

While automation and AI reduce manual effort, they may inadvertently contribute to the degradation of critical human skills over time. This risk can compromise Human in the Loop (HITL) systems, where effective oversight relies on staff retaining domain knowledge and decision making ability. Organisations should assess the risk of skill loss, particularly in areas where: Staff serve as safety backstops (e.g. validating AI outputs), Manual recovery may be needed (e.g. outages or model failures)

Tools and Resources for Testing

This framework does not mandate specific tools. Departments should select tools that align with their testing needs, guided by appropriate governance, risk assessments, and testing goals.

We encourage teams to refer to the UK Government AI Playbook, which provides practical guidance on responsible AI development and use. For evaluating LLMs, teams should also consider Inspect, an open-source model evaluation framework released by the UK AI Safety Institute.

Conclusion

The responsible deployment of Artificial Intelligence in public services requires more than innovation. It demands trust, transparency, and accountability. This framework provides a structured approach to testing and assuring the quality of AI systems, supporting departments in meeting their obligations to the public while enabling the safe use of advanced technologies. By aligning testing and assurance activities with defined quality principles, lifecycle strategies, modular testing methods, and proportionate risk management, government teams can evaluate AI systems consistently and rigorously. This framework recognises the evolving nature of AI, especially with the emergence of complex agentic and generative models, and promotes continuous adaptation, monitoring, and governance to keep testing practices relevant and robust.

Review Log

Action Name Date
Author Mibin Boban
xGov Testing Community Chair / Head of Quality Engineering - GDS
5/6/2025
Working Group Review 1. Dinesh KTJ
Principal Test Engineer - Home Office
16/6/2025
  2. David Lee
Lead Technical Architect - GDS
17/6/2025
  3. Vas Ntokas
Lead Test Engineer - DWP
18/6/2025
  4. David Rutter-Close
Lead Test Engineer - DfE
19/6/2025
  5. Adam Byfield
Principal Technical Assurance Specialist - NHS England
19/6/2025
  6. Matthew Dyson
Principal Test Assurance Manager - National Highways(DfT)
30/6/2025
Review 1. Nayyab Naqvi
Principal Technologist for AI in the Public Sector - DSIT
27/8/2025
  2. Bharathi Sudanagunta
Test Manager - MHRA
27/8/2025
  3. Carla Chiodi
Senior AI Engineer - 2i Testing
31/7/2025

An initiative by the Cross Government Testing Community (UK)