Evaluation-driven approach to building safe, fair and reliable AI
AI is playing a growing role across public services, from decision support to automation to frontline service delivery. As these systems become more capable and more embedded in critical processes, it is essential that we understand how well they work, how fair they are, and what happens when they go wrong.
This framework sets out a practical, shared approach for testing, evaluation and assurance of AI systems across public sector. It focuses on helping teams test systems, evaluate AI models and assure quality, trustworthiness, and risk. This is whether the AI in question is a traditional rule-based system, Machine Learned model, or a newer generative or agentic system.
In this framework, testing refers to checking the whole AI-enabled system (infrastructure, APIs, integrations, user interfaces, data flows). Evaluation focuses on AI model/layer, assessing how well they perform against agreed quality attributes. Testing generates the evidence, evaluation interprets it in context, and assurance gives confidence that systems are safe, fair and accountable.
Testing AI is not the same as testing traditional software. AI behaves probabilistically, learns from data that may shift over time, and may act autonomously or opaquely. That means we need new methods, new language, and new standards. This framework provides the foundations to support that shift.
Importantly, this is not a checklist or a rigid standard. It is a living tool to help organisations ask better questions, design better tests, assurance strategy and share what works. Departments can tailor the framework to fit their context, and contribute their own learning back into the shared understanding of how to evaluate AI in public sector responsibly.
The purpose of this framework is to provide a shared reference point for government teams who are responsible for testing, evaluation and the wider assurance of AI systems. Whether you are building AI internally, procuring it externally, or overseeing its implementation, this document is designed to help you ask the right questions about how AI systems are tested, which covers before development, during development, and after deployment. It helps establish a minimum baseline of testing and evaluation for AI systems that impact the public, while allowing departments to adapt it to their own governance models, service needs, and levels of technical maturity.
This framework applies to across the full AI system lifecycle: from initial planning and design through development, testing, deployment, and ongoing monitoring. It is intended for use across UK Government departments, agencies, and other public sector bodies developing or procuring AI systems.The scope covers:
System Testing - infrastructure, data flows, integrations, UI, components.
AI Model / Layer Evaluation - quality attributes like accuracy, fairness, transparency etc.
Governance Activities - risk assessments, documentation, approvals, monitoring.
The framework can be adapted to AI initiatives of varying size and risk. It covers pre-deployment testing (e.g. validating models in controlled environments) as well as post-deployment assurance (e.g. monitoring live systems for drift or issues).
The framework is cross-disciplinary, encompassing activities for data scientists, developers, test engineers, policy and ethics reviewers, information testing teams, and senior decision-makers. It does not replace specific legal or regulatory requirements, but rather consolidates and references them so that teams can ensure compliance through testing and evaluation.
This framework is intended as a common foundation, not a fixed prescription. It sets out principles, testing strategies, and testing modules that departments can adopt, adapt, or extend based on their specific operational contexts, risk profiles, and technical architectures. Flexibility is essential: testing activities should be proportionate to the impact of the AI system, and tailored to its type, whether rule-based, machine learning, generative, or agentic. Departments are encouraged to use this framework as a baseline for their own testing strategies, aligning with key principles while addressing the unique demands of their services.
The intended audience for this document includes all teams and stakeholders involved in the development, deployment, and oversight of AI systems in Government. In particular:
Product Managers - to plan AI projects with right testing phases, manage risk, and include governance checkpoints.
Data Scientists and Machine Learning Engineers - to embed these testing practices into model development, so models meet quality criteria and remain auditable.
Test Engineers - to design and run test cases specific to AI components, including both functional and non-functional testing beyond traditional software approaches.
Policy, Ethics, and Legal Advisors - to ensure ethical and legal requirements are addressed and evidenced through testing, and evaluation.
Senior Responsible Owners and Governance Boards - to oversee the risk management, make go/no-go decisions based on evidence, and ensure accountability for AI outcomes.
Operational Teams and Security - to monitor AI systems in production, detect incidents, and enforce safeguards.
External Assessors or Auditors (where applicable) - to provide independent assurance by reviewing evidence from this framework.
This framework is designed to cover a broad range of AI system types, noting that different testing approaches may be needed for each. It explicitly addresses the following categories of AI:
Rule-Based or Deterministic AI: Systems that follow predefined logic or rulesets (e.g. expert systems or automated business rules). These are largely predictable, but testing focuses on verifying all rules and conditions are correct and complete.
Machine Learning (ML): Predictive models trained on data (including supervised, unsupervised, and simple reinforcement learning where applicable). Examples include classification or regression models, neural networks for prediction, etc. Testing focuses on data quality, accuracy, generalization, and avoidance of bias in these models.
Generative AI: Advanced models that produce content (such as text, images, or audio) - for example, Large Language Models (LLMs) and other Generative Adversarial Networks (GANs). These models have more open-ended outputs. Testing emphasizes output appropriateness, factual accuracy, avoidance of harmful content, and controlling unpredictable behavior (including prompt testing for LLMs).
Agentic AI (Autonomous Agents): Artificial intelligence (AI) agents are small, specialised pieces of software that can make decisions and operate cooperatively or independently to achieve system objectives. Agentic AI refers to AI systems composed of AI agents that can behave and interact autonomously in order to achieve their objectives. They are probabilistic and highly adaptive, which poses unique testing challenges. Testing for agentic AI focuses on safety of autonomous behaviors, the agent’s ability to handle novel situations, avoidance of ‘reward hacking’ (gaming the specified goals), and ensuring that mechanisms for human override or intervention function properly.
Because agentic AI can evolve through interactions, continuous monitoring and periodic re-validation are crucial for this type.
Where specific testing needs differ by AI type, these are called out in the relevant sections of this framework.
Adapted from the Cross-Government Testing Community AI Quality Workshop
To assure that AI systems are safe, fair, and effective, we use a set of quality attributes. These guide both system-level testing (does the AI system as a whole work as intended in real conditions?) and model-level evaluation (does the AI model behave fairly, accurately, and robustly?). Together, they provide assurance over time.
We group attributes into four themes:
Quality Attribute | What it means? | Focus |
---|---|---|
Autonomy | Degree to which the AI can operate independently of human intervention. | Challenge system outside its envelope, test for human handover triggers, verify deferral to humans when thresholds are exceeded. |
Fairness | Mitigation of unwanted bias. | Conduct bias audits, test for data skew, validate outcomes against equality law. |
Safety | Ability of system to not cause harm to life, property, or the environment. | Worst-case scenario testing, validate safety mechanisms. If standards exist (e.g. ISO/IEC TR 5469:2024), ensure tests cover those criteria. Safety testing often overlaps with robustness and ethical testing but focuses on preventing harm. |
Ethical Compliance | Alignment with ethical guidelines, values, and laws. | Checklist-based testing, ethics panel review, pass/fail criteria. |
Side Effects & Reward Hacking | The presence of unintended behaviors arising from the AI’s optimization process. | Identify and trigger any potential side effects of the AI’s objectives, simulation testing and anomaly detection on the agent’s behaviors. |
Security | Protection against unauthorized access, misuse, or adversarial attacks. | Security testing, including penetration tests on AI APIs, checks for data leakage, ensure proper authentication/authorization around the AI service. |
Quality Attribute | What it means? | Focus |
---|---|---|
Transparency | Visibility of AI’s data, logic, and workings. | Verify traceability records, audit trails, documentation like model cards or algorithmic transparency records. |
Explainability | Ability to explain AI outputs in human terms. | Use explainability tools (e.g. SHAP, LIME), domain expert review, user comprehension checks. |
Accountability | Clear responsibility for AI outcomes and decisions. | Trace decision responsibility, review governance mechanisms. |
Compliance | Adherence to organisational policies, standards, and relevant legal and regulatory requirements. | Assure compliance with legal frameworks like Data Protection Act, GDPR or AI assurance policies etc. |
Quality Attribute | What it means? | Focus |
---|---|---|
Functional Suitability | Degree to which the AI fulfills its specified tasks. | Validate against intended outputs for each requirement. |
Performance Efficiency | Speed, responsiveness, and resource use. | Measure latency, throughput, and system scalability. |
Reliability | Stability of performance under varying conditions. | Stress tests, fault injection, endurance testing. |
Maintainability | Ease of updating, fixing, or improving the system. | Verify retraining process, modularity, and version control. |
Evolution | The capability of the AI to learn and improve from experience over time. | Test for concept or policy drift handling, validate if learning improves performance without breaking existing functionality. |
Quality Attribute | What it means? | Focus |
---|---|---|
Usability | Ease of use for human operators or end users. | UX testing, error handling, usability review. |
Accessibility | Designed for a wide range of abilities. | Test and audit the AI and all its user interfaces against the Web Content Accessibility Guidelines (WCAG) version 2.2 AA, to help it meet the Public Sector Bodies Accessibility Regulations. |
Compatibility | Ability to work across environments and integrations. | Test across OS, APIs, browsers, devices, databases, cloud/on-prem. |
Portability | Ability to move across platforms or contexts. | Domain drift checks, portability test to new environments. |
Adaptability | Ability to adjust to changes in data or environment. | Test for handling new data patterns, concept drift. |
Data Correctness | Quality and diversity of data used to train the model. | Validate data quality, completeness, and representativeness. |
Testing AI is not the same as testing traditional software. Unlike rule-based systems, AI adapts, learns from data, and generates outputs that may be uncertain or change over time. This means we must test the system around the AI, evaluate the models inside it, and build assurance that both remain safe, fair and accountable.
The following principles set out how to do this in practice:
Design Context-Appropriate Testing
One size does not fit all. Always test AI systems in conditions that reflect their real-world use context . This means moving beyond idealised training scenarios. A rule-based expert system, a predictive ML model, and a generative AI chatbot each require a different testing focus and test design tailored to their use case and operating environment.
Test the Quality and Diversity of Your Data
AI systems are only as good as the data they learn from. High-quality testing must start with scrutinising the datasets that shape the model’s behaviour. The gaps or biases in the data can lead to unfair or unreliable results. Good AI testing isn’t just about what the system does, it’s about what it was taught.
Test Autonomy, Don’t Assume it
If the AI operate with any autonomy, evaluate how it behaves during prolonged unattended operation and edge cases. This prevents silent failures or harmful behaviour.
Test for Drift and Change
AI models evolve. They may be retrained, updated, or influenced by new data. Establish procedures to retest models whenever they change (new data, new model release, pipeline modifications). Without version control, it becomes impossible to reproduce issues, audit changes or track quality over time. Continuously incorporate feedback from real-world use to improve both the AI and the assurance measures (learning from incidents, updating tests, etc.).
Adopt a Risk-Based Approach
The rigour of testing should be proportional to potential impacts on people, services and organisation. Not all AI deployments carry the same risk. A typo-correcting AI assistant is not as critical as an AI diagnosing medical conditions. Perform an initial risk classification (considering factors like impact on legal rights, safety, scale of use, novelty of the tech) and let that guide the depth of testing. High-risk AI (e.g. those that could endanger lives or cause legal determinations about individuals) demand exhaustive testing, possibly including formal verification or external audits before deployment. Lower-risk tools can use lighter-weight checks, though still covering all relevant quality dimensions. Under this framework, no AI system is deployed without adequate testing, but the notion of ‘proportionality’ ensures resources are focused where it matters most. Risk Based Assurance (RBA) practices are recommended.
Test What You Can Explain or Interpret
An AI decision that can’t be explained or interpreted can’t be trusted or fixed. Testing should include not only whether the output is correct, but whether it makes sense. AI may assume users can read and write fluently, which excludes those with dyslexia or learning disabilities. In some cases (e.g. in clinical settings), we may not be able to explain how decisions are made, but we should be able to interpret what the model might be doing.
Treat Ethics as Testable Risk
Ethical considerations (e.g. avoiding harm, respecting rights, non-discrimination) should be managed like any other risk. Define ethical risk scenarios (such as the AI producing harmful or offensive output, or unfairly denying a service) and include them in test plans.
Look for What Wasn’t Intended
AI systems can fail in ways designers didn’t anticipate. This can reveal ‘unknown unknowns’, for example, a vision model picking up a spurious pattern (shortcut) or a chatbot getting tricked into revealing confidential info.
Test Safe and Predictable Failure
When AI does fail, it must fail safely.
Benchmark Performance Holistically
Test not only accuracy, but also the system’s efficiency, scalability, and resilience under load. Holistic performance testing ensures the AI can meet service level requirements in a production environment, not just produce correct output in ideal lab conditions.
Build and Observe Quality from the Start
Quality shouldn’t be an afterthought, it must be built in from beginning. Apply a shift-left approach by embedding testing into early stages and continuously throughout development. If something goes wrong, you should know about it early, clearly, and with enough data to respond quickly.
These principles encourage testers and project teams to look at AI quality from multiple angles (technical, ethical, and operational), and to integrate testing as a continuous effort.
Assuring an AI system’s quality is not a one-time event, it must be woven through the entire AI development lifecycle. In this framework, we adopt a continuous defensive assurance model, identifying key testing activities and deliverables at each phase of an AI project. Below provides an overview of each phase, what the testing/assurance focus is, and examples of metrics or outcomes to measure:
The table below shows where in the lifecycle each quality attribute should be considered. It helps you identify when to test, and validate specific characteristics of your AI system.
Quality Attribute | Planning/Design | Data Collection/Prep | Model Dev & Training | Validation/Verification | Operational Readiness | Deployment | Monitoring/Continuous Assurance |
---|---|---|---|---|---|---|---|
Autonomy | ✅ | ✅ | ✅ | ||||
Fairness | ✅ | ✅ | ✅ | ||||
Safety | ✅ | ✅ | ✅ | ||||
Ethical Compliance | ✅ | ✅ | ✅ | ||||
Side Effects / Hacking | ✅ | ✅ | ✅ | ||||
Security | ✅ | ✅ | ✅ | ✅ | |||
Transparency | ✅ | ✅ | ✅ | ✅ | ✅ | ||
Explainability | ✅ | ✅ | |||||
Accountability | ✅ | ✅ | ✅ | ||||
Compliance | ✅ | ✅ | ✅ | ✅ | |||
Functional Suitability | ✅ | ||||||
Performance Efficiency | ✅ | ✅ | ✅ | ||||
Reliability | ✅ | ✅ | |||||
Maintainability | ✅ | ✅ | |||||
Evolution | ✅ | ✅ | |||||
Usability | ✅ | ✅ | |||||
Accessibility | ✅ | ✅ | |||||
Compatibility | ✅ | ||||||
Portability | ✅ | ||||||
Adaptability | ✅ | ✅ | |||||
Data Correctness | ✅ | ✅ | ✅ |
The framework covers both the build phase (data preparation, model training, and validation, where models are developed and weights may change) and the use phase (deployment and ongoing monitoring, where models operate in real-world conditions and need continuous evaluation for drift, bias, and robustness). Together, these ensure AI systems remain safe, fair, and accountable across their entire lifecycle.
Focus:
At the very start of a project, the focus should be on building quality in from the beginning. This phase sets the direction for how the AI system will be tested, evaluated, and assured throughout its lifecycle.
Key activities include:
AI assurance should start as early as possible in the lifecycle. For projects involving procurement or external suppliers, teams should:
Example Outputs/Metrics:
Risk Identification: — Number of high risk items identified in the initial risk assessment. E.g. if 5 major ethical risks are logged, they will need mitigation plans. A completed risk register or heatmap is an output.
Impact Assessment Score: — If using an impact assessment framework (e.g. scoring for societal impact), record the outcome. E.g. a score indicating medium impact with mitigation, or a summary of the impact assessment’s findings such as ‘minimal privacy impact’ or ‘significant equality implications’.
Defined Performance Targets: - Document the target values for key performance metrics of the model (e.g. required precision/recall, maximum acceptable error rate, response time). This serves as the benchmark criteria for later testing.
Compliance Checklist Initiation: - Note any legal/ethical requirements the project must comply with and include these in the project plan.
Accessibility is important. Some AI systems don’t offer alternatives like text-to-speech, voice input, or visual aids, which are crucial for accessibility. Without adaptive features, AI may not adjust its language level or format to suit the user’s needs.
Focus:
In this phase, the goal is to gather, generate, or select the data that will be used to train and inform the AI model, and to prepare it for safe and effective use. High-quality data is critical, it must be accurate, complete, representative, and lawfully obtained.
Key activities include:
By the end of this phase, datasets should be of known quality, well-documented, and suitable for both testing the system and evaluating the model.
Example Outputs/Metrics:
Data Quality Metrics: - Quantitative measures such as the percentage of missing or incomplete values in the dataset (e.g. ‘2% of records have missing age field’), error rates in labels perhaps found via manual spot checks or data cleaning scripts, or consistency scores if applicable, like schema validation pass rate. For example: 95% of records passed all validation checks, 5% had minor formatting issues corrected.
Bias Indicators: - Any computed metrics that reveal dataset bias, such as class imbalance ratios (e.g. one class constitutes 80% of data), or demographic parity in the dataset (e.g. ‘male:female ratio = 70:30 in training data’). These indicators might be compared to real world distributions or desired equity levels . If thresholds were set (say, no group should be <15% of data), note whether the dataset meets them or what adjustments were made.
Coverage of Scenarios: - A qualitative or quantitative assessment of how well the data covers expected use cases. For example: a checklist of scenarios with data counts, ‘Contains data for all 10 regions of the UK, includes examples of all major categories of inquiries from last year’s records, synthetic data added for rare but critical scenario X’. This ensures that the model won’t be blindsided by a common scenario that was absent in training.
Data Lineage & Documentation: - Although not a metric per se, an output here is documentation. It could be a Data Factsheet or datasheet for dataset, capturing where data came from, how it was processed, any assumptions or filtering applied, and any known limitations of the dataset.
Focus:
The focus is to ensure the model is learning correctly, generalising beyond training data, and embedding fairness, explainability, and robustness from the start. For example, teams may use validation datasets, cross-validation, or interpretable models to monitor for overfitting and bias. By the end of this phase, there should be a trained candidate model that meets initial performance criteria and shows no major flaws.
Example Outputs/Metrics:
Model Performance on Validation Data: - Typical metrics like accuracy, F1-score, precision/recall, RMSE (for regression), etc., evaluated on a hold-out validation set . For instance, ‘Validation accuracy = 92%, exceeding the 90% target, F1 = 0.88 for class A vs 0.85 for class B’. These results indicate how well the model generalises. Also track metrics across multiple validation splits (k-fold cross-validation), e.g. ‘Standard deviation of accuracy across 5 folds = 1.2%’ to gauge stability .
Fairness Metrics (on Validation): - Calculate any relevant fairness statistics on the model’s validation predictions . For example: ‘False negative rate for group X = 5%, for group Y = 9% (disparity of 4%)’ or ‘Calibration between demographics is within 2%’. If these are out of acceptable range, note any mitigation integrated (e.g. adjusted threshold or retrained with balanced data).
Overfitting Indicators: - Metrics or observations that ensure the model is not memorizing training data. For example, compare training vs validation performance. If training accuracy is 99% but validation 85%, that’s an overfitting red flag. Or use techniques like ‘validation curve’ to see if performance plateaus. An output could be, ‘No overfitting observed, training and validation loss curves converged, with validation performance within 1% of training’.
Model Documentation: - At this stage, a model description or design document is often produced, capturing architecture, feature inputs, and justification for choices. It might also include initial explainability analysis. E.g. feature importance from the model if available, like ‘Feature X accounted for 40% of decision importance in a tree model’, to verify it aligns with expectations. Domain experts might check that and say it makes sense or not. This documentation is important for later explainability and transparency requirements.
Focus:
At this stage, the AI system and its integrated components are put through rigorous pre-deployment checks. This is the classic testing phase, where the system is exercised against a wide range of scenarios and quality requirements. The aim is to confirm that the system behaves as intended under realistic and adverse conditions, and that all requirements set earlier in the project have been met. By the end of Validation & Verification, the team should have high confidence with evidence that the AI system is ready for real-world use, or identify issues that need fixing before it can proceed.
Verification here means confirming that the system implementation aligns with specified requirements.Validation refers to confirming that the system is fit for its intended use in real-world conditions. These sit alongside, but are distinct from, the broader framework’s concepts of Testing (generating evidence) and Evaluation (interpreting that evidence in context).
Activities typically include:
Example Test Techniques:
To support thorough and consistent AI testing, teams should consider established test techniques. These help verify model behaviour and track quality over time. Example techniques include:
Teams should select techniques appropriate to their system type and risk profile. AI testing techniques will evolve over time, hence it is important to keep them under regular review.
Example Outputs/Metrics:
Test Coverage: - A metric indicating how much of the AI system’s logic has been tested . For rule-based systems, this might be the percentage of rules executed at least once in tests. For ML, it could be coverage of input space or scenarios (e.g. ‘100% of requirement-specified scenarios tested, 85% of identified edge cases tested’). High coverage lends confidence that most behaviors have been vetted.
Test Results by Quality Characteristic: - A summary status of how the system performed with respect to each quality attribute detailed before. For instance: ‘Functional accuracy: passed (met 90% accuracy target). Performance: passed (avg response 800ms < 1s requirement). Reliability: passed (no crashes in 24h test). Security: 2 vulnerabilities found (SQL injection bug on input API – needs fix). Fairness: passed after mitigation (no significant disparity >3%)’. This can be presented as a checklist or table – effectively a scorecard across the quality dimensions .
User Feedback (Pilot/UAT): - If a pilot user group or user acceptance testing (UAT) was conducted (which is recommended especially for tools with user interfaces), collect their feedback quantitatively. E.g. ‘In UAT, 90% of users were able to complete the task with the AI’s assistance without errors, average satisfaction rating 4.2/5’. Also capture qualitative issues (like ‘users found explanations confusing in 3 out of 10 cases’). This informs last-mile improvements.
Bug and Issue Counts: - The number of defects found during this testing phase, often categorized by severity. For example: ‘5 critical issues (must fix), 12 minor issues logged’. And ideally, by the end, all critical issues are resolved or have acceptable workarounds. Particularly important are any ethical or safety issues discovered. These must be addressed or explicitly accepted by governance if not fixable.
Go/No-Go Recommendation: - Usually a testing phase for major changes or strategic solutions ends with a formal test report. A key outcome is a recommendation on whether the system is fit for deployment. This is often phrased as ‘Proceed to deployment’ or ‘Proceed to deployment - but with a few conditions’ or ‘Not ready, requires rework on X’. This recommendation will feed into the governance decision-making in the next phase.
Focus:
Before deployment, AI systems should pass an Operational Acceptance Test (OAT) to confirm they are ready for a production environment. This stage assesses whether the system has the safeguards, processes, and resilience needed for safe and reliable operation at scale.
Traditional OAT covers areas like alerting and incident response, failover and redundancy, user access and permission management, and audit logging. For AI systems, additional considerations are critical:
Example Outputs/Metrics:
Existence of Manual override interfaces (e.g., admin dashboards). Ability to measure Override Rate (%) shows percentage of AI decisions overridden by human operators. High rates may indicate trust issues or model performance problems.
Existence of Emergency stop mechanisms, Graceful degradation or fallback to manual mode. Ability to measure Mean Time to Shutdown (MTTS) shows time taken to fully disable the AI system after initiating shutdown.
Existence of Versioning of decisions and actions, Reversible and Human-in-the-loop (HITL) workflows. Ability to measure Undo Request Rate (%) shows percentage of AI decisions that are reversed. High rates may indicate poor decision quality or lack of trust.
Focus:
This phase covers releasing the AI system into the live environment and integrating it into the wider business or service workflow. Even after extensive pre-release testing, new issues can appear in production, so this stage includes final smoke testing and operational checks.
AI systems should be deployed using secure, repeatable, and automated processes. Teams are encouraged to integrate testing and assurance into their DevOps pipelines and use Continuous Integration/Continuous Deployment (CI/CD) workflows.
Key activities include:
Example Outputs/Metrics:
Smoke Test Results (Production): - Results of final smoke tests run in the production environment or staging environment identical to production . This could include live proving of user journeys. For example: ‘Full workflow test (user submits application -> AI risk scoring -> database update -> user notified) passed all steps, data flows and hand-offs confirmed’. If any integration bugs were found (e.g. data format mismatches between systems), those are resolved or documented.
Production Performance Metrics: - Baseline metrics collected with the system running under production load (or simulated load). E.g. ‘Average latency of AI API calls in production = 850ms, p95 latency = 1.3s, within acceptable range’. Throughput might be measured during a load test, ‘System can handle 50 requests per second with <5% error rate’. These numbers confirm that the system meets performance requirements in the real setup .
Deployment Checklist Completion: - A completed checklist indicating all necessary deployment steps were done. For instance: ‘Model artifact hashed and archived, config files updated, monitoring alerts set up (yes/no), security review sign-off obtained (yes), ATRS transparency record published (yes)’. This ensures nothing is missed like forgetting to start the monitoring service or failing to inform support teams.
Final Approval Records: - Documentation of governance approval for go live. This might be meeting minutes or a form signed by a relevant group. Essentially a record that due diligence was done and accountability is taken for deploying the AI.
Focus:
Deployment is not the end of the lifecycle. Continuous monitoring and improvement are essential to keep AI systems reliable, efficient, and fair over time. Once in operation, the AI system must be monitored to ensure it performs as intended and to detect issues early.
Teams should establish ongoing monitoring of key metrics, including:
Monitoring should include alerts for anomalous behaviour, such as predictions deviating significantly from expectations or input data distributions shifting away from training data. Scheduled re-testing, periodic audits, and compliance reviews (such as ‘AI health checks’) provide additional safeguards.
Where business processes change or new data sources emerge, AI models may need to be retrained or updated. Such changes must go through a controlled process with re-testing and governance sign-off to prevent risks from being introduced unnoticed.
Example Outputs/Metrics:
Real-Time Performance & Drift Metrics: - Continuous tracking of metrics such as latency and throughput in production (monitored on dashboards), plus data drift and model drift indicators . For instance: a data drift metric might be the statistical distance between recent input data distribution and the training data distribution (alert if exceeds threshold), a model drift metric might compare model predictions now vs. a baseline (alert if e.g. the acceptance rate changes by more than X%). Also track system uptime and any failure incidents (downtime or error count).
Incident Reports: - If any issues occur (e.g. the model made a significant error or the system was unavailable for a period), document each incident and the response. Metrics here include the number of incidents per month/quarter and mean time to detection and resolution. For example: ‘3 minor incidents in the last quarter (1 data pipeline outage for 2 hours, 2 cases of incorrect outputs flagged by users); all resolved within SLA; no major incidents’. This provides transparency and learning opportunities.
Regular Re-testing and Audits: - The team might schedule something like quarterly regression tests on the AI. Metric example: ‘100% of test suite re-run after each monthly model update, with 98% pass rate (2% known non-critical issues under review)’. Or an annual audit might produce a compliance scorecard: e.g. ‘Yearly audit on 01/2026 reconfirmed model meets fairness requirements (differences within 2%), documentation updated, no new risks identified’. Essentially, treat it as if each major update or time period the AI needs to re-earn its safety badge.
Model Refresh Frequency: - How often the model is updated/retrained with new data, and whether it’s keeping up with reality. If, say, the procedure is ‘retrain model every 3 months’, measure adherence: e.g. ‘Model version 1.3 deployed after Q1 retraining, achieved +2% accuracy improvement; next retrain scheduled Q2’ . If concept drift is faster than expected, frequency might increase. Conversely, measure if frequent changes are causing any regression, etc.
End-of-Life Planning (Retirement): - Eventually, the AI system may be retired or replaced. As part of continuous improvement, there should be a plan for decommissioning when appropriate. Metrics or checks include ensuring data retention and deletion policies are followed (‘100% of personal data archived or deleted as per policy upon retirement’) , and lessons learned compiled for future projects. While this is the final step of the lifecycle, it’s noted here for completeness: successful retirement is also part of assuring the AI’s overall lifecycle (no loose ends or forgotten models left running).
While the defensive assurance approach tells us when to perform testing activities, this section describes what tests to perform in detail. We present a modular testing approach for AI, consisting of distinct testing modules, each addressing a specific facet of AI quality. These modules can be thought of as building blocks, depending on the AI system and its risk level. You might emphasize some modules more than others, but together they form a comprehensive testing regimen. The modular design allows flexibility. For example, a simple rule-based system might not need elaborate adversarial testing, whereas a machine learning model would. Each module also notes how approaches may differ for various AI types (rule-based vs ML vs generative vs agentic AI), ensuring the unique challenges of each are covered.
The table below shows where each quality attribute is covered across the modular AI testing approach. It helps align your assurance effort across all modules.
Quality Attribute | Data&Input Validation | Functionality Testing | Bias&Fairness Testing | Explainability &Transparency | Robustness &Adversarial Testing | Performance &Efficiency Testing | System &Integration Testing | UAT &Ethical Review | Continuous Monitoring &Improvement |
---|---|---|---|---|---|---|---|---|---|
Autonomy | ✅ | ✅ | |||||||
Fairness | ✅ | ✅ | |||||||
Safety | ✅ | ✅ | |||||||
Ethical Compliance | ✅ | ✅ | |||||||
Side Effects / Hacking | ✅ | ✅ | |||||||
Security | ✅ | ✅ | |||||||
Transparency | ✅ | ✅ | |||||||
Explainability | ✅ | ||||||||
Accountability | ✅ | ✅ | |||||||
Compliance | ✅ | ✅ | |||||||
Functional Suitability | ✅ | ✅ | |||||||
Performance Efficiency | ✅ | ||||||||
Reliability | ✅ | ✅ | |||||||
Maintainability | ✅ | ||||||||
Evolution | ✅ | ||||||||
Usability | ✅ | ||||||||
Accessibility | ✅ | ||||||||
Compatibility | ✅ | ||||||||
Portability | ✅ | ||||||||
Adaptability | ✅ | ||||||||
Data Correctness | ✅ | ✅ |
There are 9 modules in this framework:
AI systems are only as good as the data that feeds them. If data is missing, skewed, or unsafe, errors and bias will follow. This module is about making sure training, validation, and live inputs are valid, representative, and compliant. It is the first safeguard in building confidence in an AI system.
Testing generates the evidence. Evaluation interprets that evidence. Together, they form the assurance that data is safe to use.
Basic validation
Statistical checks
Compliance and safety
By the end of this module, the team should have high confidence in the data pipeline that the training data was sound and representative, and that live inputs are properly checked. The output of Data & Input Validation is often an ‘OK to proceed’ signal for training (i.e., data is fit for model consumption) and a set of runtime validators that protect the model in production.
This module verifies whether the AI model and system works as intended. In simple terms, validate whether it behave correctly when given the inputs it was designed for. For rule-based systems, this means checking each rule produces the right outcome. For predictive or generative models, it means confirming outputs are valid, accurate, and within expectations. Strong functionality testing ensures that before the system is integrated or deployed, the model’s logic is sound.
Test each rule in isolation with unit-style tests. Include combinations of conditions, boundary checks (e.g., exactly £20k income), and invalid inputs. Confirm that fallback rules trigger appropriately.
Example: If eligibility is defined as Income < £20k AND Savings < £5k, test cases should include: meets both criteria, fails one, fails both, and boundary cases like exactly £20k.
Validate using statistical metrics on hold-out test sets (accuracy, precision/recall, F1, RMSE, etc.). Use confusion matrices to highlight misclassifications. Apply metamorphic testing (e.g., adding small noise to inputs should not change predictions).
Example: If a model classifies medical images, flipping brightness by 5% should not radically change the prediction.
Since outputs are open-ended, test with both quantitative metrics (e.g., BLEU/ROUGE for translation) and human evaluation (e.g., correctness, tone, relevance). Include adversarial prompts to check resilience.
Example: Prompt ‘Summarise this article in one sentence’ should return a factual, concise summary, not an irrelevant or unsafe response.
Verify that the agent achieves its defined goals under varied scenarios. Test reward signals carefully to avoid reward hacking. Simulate environment changes and check whether the agent still behaves consistently and safely.
Example: A cleaning robot should actually clean when rewarded for ‘clean floor’, not simply stay near the charging dock to exploit the scoring.
Overall, model functionality module provides evidence that the AI’s ‘brain’ works correctly in isolation. For ML and agents, it’s often a statistical confidence (‘with X% confidence, performance >= threshold’), while for rule-based, it’s near-certainty for each rule. This module ensures we aren’t integrating or deploying a fundamentally flawed model or system functionality.
This module ensures the AI system treats individuals and groups fairly. The goal is to detect and address any disparities in how different demographic or protected groups are treated by the model or system. For public sector use, this also supports legal compliance with anti-discrimination laws like the Equality Act 2010.
Rule-based systems
Machine learning models
Generative AI (LLMs)
Agentic AI (autonomous agents)
After completing the Fairness & Bias Testing module, the team should have a clear understanding of any bias issues and have taken steps to correct them. Importantly, this module’s results should be reviewed by both technical teams and policy/legal teams. If any disparity remains, it should be consciously signed-off with rationale.
This module ensures the AI system’s decisions can be understood, traced, and explained, both by those overseeing the system and by the people affected by it. The aim is to test whether the model’s decisions are inspectable and whether the explanations are accurate, complete, and useful. For the public sector, this directly supports accountability, audit, and compliance requirements.
Rule-based systems
Machine learning models
Generative AI (LLMs)
Agentic AI (autonomous agents)
Once this module is complete, teams should be confident that the system can answer with evidence, why it behaved the way it did.
This module tests how the AI system behaves under stress, uncertainty, or attack. The goal is twofold:
This is critical for any system exposed to the real world, especially where safety, integrity, or trust is on the line.
Rule-based systems
Machine learning models
Generative AI (LLMs)
Agentic AI (autonomous agents)
When this module is done properly, you should have high confidence that the system won’t fall over in bad conditions or worse, be manipulated into doing the wrong thing. It’s a core part of making AI secure, safe, and production-ready.
This module checks whether the AI system runs fast enough, scales effectively, and uses resources efficiently. That includes both latency (how quickly it responds) and throughput (how many requests it can handle). Efficiency also matters, especially when cost, hardware limits, or energy use are concerns.
Rule-based systems
Machine learning models
Generative AI (LLMs)
Agentic AI (autonomous agents)
When this module is done properly, it gives assurance that the AI system won’t fall apart under pressure, whether that’s high traffic, limited hardware, or fast paced environments. It ensures the system scales with demand, not against it.
This module verifies that the AI component works correctly as part of a wider system, not just on its own. It checks that the full end-to-end workflow from user input to AI output to final outcome, behaves as expected, and that all interfaces and dependencies work correctly. It ensures the AI’s integration with services like APIs, databases, user interfaces, business logic, and identity/auth flows is seamless and safe, especially under real-world conditions.
Rule-based systems
Machine learning models
Generative AI (LLMs)
Agentic AI (autonomous agents)
When this module is done properly, this gives you confidence the AI can be deployed without disrupting the end-to-end user or system journey.
This module is the final gate before deployment. It has two purposes:
At this stage, the focus shifts from technical readiness to real world fitness. Do people find the AI system usable, understandable, and trustworthy? Can oversight bodies sign off on its responsible deployment?
Rule-based systems
Machine learning models
Generative AI (LLMs)
Agentic AI
When this module is complete, you should have high confidence that real users can work with the AI system safely and effectively.
Deploying AI isn’t the end of assurance, it’s the beginning of live operations. This module ensures that after deployment, the system is actively monitored, maintained, and improved. AI systems change over time, data drifts, user behaviour shifts, performance degrades, and new risks emerge.
Set up dashboards and alerts for:
When this module is embedded into your operating model, AI becomes a live service, not a frozen product. It’s how you keep your system effective, lawful, and trustworthy, even as everything around it changes.
While automation and AI reduce manual effort, they may inadvertently contribute to the degradation of critical human skills over time. This risk can compromise Human in the Loop (HITL) systems, where effective oversight relies on staff retaining domain knowledge and decision making ability. Organisations should assess the risk of skill loss, particularly in areas where: Staff serve as safety backstops (e.g. validating AI outputs), Manual recovery may be needed (e.g. outages or model failures)
This framework does not mandate specific tools. Departments should select tools that align with their testing needs, guided by appropriate governance, risk assessments, and testing goals.
We encourage teams to refer to the UK Government AI Playbook, which provides practical guidance on responsible AI development and use. For evaluating LLMs, teams should also consider Inspect, an open-source model evaluation framework released by the UK AI Safety Institute.
The responsible deployment of Artificial Intelligence in public services requires more than innovation. It demands trust, transparency, and accountability. This framework provides a structured approach to testing and assuring the quality of AI systems, supporting departments in meeting their obligations to the public while enabling the safe use of advanced technologies. By aligning testing and assurance activities with defined quality principles, lifecycle strategies, modular testing methods, and proportionate risk management, government teams can evaluate AI systems consistently and rigorously. This framework recognises the evolving nature of AI, especially with the emergence of complex agentic and generative models, and promotes continuous adaptation, monitoring, and governance to keep testing practices relevant and robust.
Action | Name | Date |
---|---|---|
Author | Mibin Boban xGov Testing Community Chair / Head of Quality Engineering - GDS |
5/6/2025 |
Working Group Review | 1. Dinesh KTJ Principal Test Engineer - Home Office |
16/6/2025 |
2. David Lee Lead Technical Architect - GDS |
17/6/2025 | |
3. Vas Ntokas Lead Test Engineer - DWP |
18/6/2025 | |
4. David Rutter-Close Lead Test Engineer - DfE |
19/6/2025 | |
5. Adam Byfield Principal Technical Assurance Specialist - NHS England |
19/6/2025 | |
6. Matthew Dyson Principal Test Assurance Manager - National Highways(DfT) |
30/6/2025 | |
Review | 1. Nayyab Naqvi Principal Technologist for AI in the Public Sector - DSIT |
27/8/2025 |
2. Bharathi Sudanagunta Test Manager - MHRA |
27/8/2025 | |
3. Carla Chiodi Senior AI Engineer - 2i Testing |
31/7/2025 |
An initiative by the Cross Government Testing Community (UK)