Security & Compliance

Borrowing the UK AI Safety Institute's Inspect Tool: A Nonprofit Evaluation Walkthrough

Most nonprofits launch an AI chatbot or assistant on little more than a few hopeful test conversations and a gut feeling that it seems to work. There is a better way, and it is free. Inspect, the open-source evaluation framework released by the UK's national AI safety body, lets a small technical team measure how an AI system actually behaves across dozens or hundreds of realistic prompts before constituents ever see it. This walkthrough shows how a resource-constrained nonprofit can put it to work.

Published: June 7, 2026•16 min read•Security & Compliance

A nonprofit IT team using the Inspect framework to evaluate an AI system

When a nonprofit decides whether an AI tool is safe to deploy, the testing is usually informal. Someone on staff asks the chatbot a handful of questions, reads the answers, decides they look reasonable, and the system goes live. This works until it does not. A donor asks something slightly unusual, a vulnerable constituent phrases a request in an unexpected way, or a prompt nudges the model into giving advice it should never give, and the gap between "looked fine in a few tries" and "behaves reliably across the real range of inputs" suddenly becomes very expensive.

The discipline that closes that gap is called evaluation, and it has long been the preserve of well-funded AI labs. That changed when the UK AI Safety Institute, the British government body now operating as the AI Security Institute, open-sourced its internal evaluation framework, Inspect, in 2024. It is the same tooling used to assess frontier AI models for national-scale risks, released under a permissive license and free for anyone to use. For a nonprofit, that means access to professional-grade evaluation methods without a professional-grade budget.

Inspect will not write your tests for you, and it does assume a team member comfortable with a little Python and the command line. But it is far more approachable than its pedigree suggests. The core idea is simple: you assemble a set of inputs you want to test, you define what a good answer looks like, you run your AI system against the whole set automatically, and you get back structured, repeatable results you can read, share, and rerun whenever the model or your prompts change. That repeatability is the part informal testing can never offer.

This walkthrough explains what Inspect is, why a nonprofit would bother with it, and how to move from installation to a first useful evaluation. It is written for the technically curious nonprofit staff member or volunteer, not for machine learning engineers, and it keeps the focus on practical evaluation of the kinds of AI systems nonprofits actually deploy: chatbots, assistants, intake tools, and content generators.

What Inspect Is and Why It Matters for Nonprofits

Inspect is a Python framework for building and running evaluations of large language models, the kind of AI that powers chatbots and writing assistants. Rather than testing by hand, you describe an evaluation in code once and then run it against any model you can connect to, including the major commercial APIs and locally hosted open-weight models. The framework handles the mechanics of sending each prompt, collecting each response, judging it against your criteria, and producing a tidy report.

It matters for nonprofits for three reasons. First, it makes testing systematic, replacing a few ad hoc conversations with a defined, repeatable battery of cases that cover the situations you actually care about. Second, it makes testing repeatable, so when you change your system prompt, switch models to save money, or upgrade to a new version, you can rerun the exact same evaluation and see whether anything got worse. Third, it produces evidence. A structured evaluation report is something you can show a board, a funder, or an auditor to demonstrate that you tested your AI responsibly before deployment, which is increasingly part of what good AI governance requires.

Inspect organizes everything around four concepts, and understanding them is most of the battle. A dataset is the collection of test inputs and, where you have them, the answers you expect. A task ties a dataset together with the logic for running it. A solver is the step that actually queries your AI system to get an answer. A scorer is the logic that decides whether each answer was good. Once those four pieces are in place, running an evaluation is a single command, and the results appear in a viewer you can read in your browser.

Dataset

Your list of test inputs paired with the target answers or behaviors you expect. For a nonprofit, this is the set of real questions and edge cases your AI will face.

Task

The unit Inspect runs. It bundles your dataset with the steps for answering and scoring, so a whole evaluation can be launched with one command.

Solver

The step that sends each prompt to your AI system and collects the response. This is where you point Inspect at the model or assistant you want to test.

Scorer

The logic that judges each answer, whether by exact match, keyword check, or even asking a second model to grade the response against your standard.

Getting Set Up

The barrier to entry is genuinely low. Inspect installs as a Python package with a single command, runs on any reasonably modern computer, and requires Python version 3.10 or later. Someone on your team who can install software and follow technical documentation can have it running in well under an hour. The framework's own documentation is thorough and aimed at exactly this kind of getting-started journey.

Beyond installing the package itself, you need a way for Inspect to reach an AI model. If you are evaluating a commercial system, that means an API key from the provider, which you set as an environment variable so Inspect can use it. If you would rather keep everything local for privacy or cost reasons, Inspect can connect to open-weight models you run yourself, which pairs naturally with the local-model approach we describe in our guide to running open models locally for nonprofits. Either way, the setup is configuration rather than coding.

Step 1: Install Python and Inspect

Confirm Python 3.10 or later is installed, then install the framework with the package manager. Inspect is published as an installable library with a one-line setup, and it brings its own command-line tool along with it.

Step 2: Connect a Model

Provide an API key for a commercial provider, or point Inspect at a locally hosted open-weight model. This is the system under test, the same one your constituents will eventually use.

Step 3: Confirm It Runs

Run one of the small example evaluations that ship with the framework or its community evaluation library. Seeing results appear in the log viewer confirms your setup works before you invest time in your own tests.

Building Your First Nonprofit Evaluation

The most valuable evaluation is the one built from your own reality, not a generic benchmark. The work that matters most happens before you write any code, when you decide what you are actually testing for. Start by listing the real situations your AI system will encounter, drawing on the questions constituents already ask your staff, the topics your chatbot is meant to handle, and the failure modes that would genuinely harm someone or embarrass the organization.

For a donor-facing assistant, that list might include common giving questions, requests it should politely decline, and attempts to extract information it should not share. For a service-navigation bot at a human services nonprofit, it might include eligibility questions phrased in many different ways, requests that should trigger a handoff to a human, and crisis language that demands a careful, safe response. The discipline of writing these cases down is itself clarifying, and it overlaps heavily with the work of running adversarial prompts against your own chatbot.

Once you have your cases, you turn each one into a dataset entry: the input you will send, and the target describing what a good response looks like. The target can be an exact answer for factual questions, a set of words that should or should not appear, or a description of the desired behavior that a scoring model can judge against. With the dataset assembled, you choose a scorer that fits, define the task, and run it. Inspect does the rest, sending every prompt, collecting every answer, scoring each one, and assembling the results.

Define the Behaviors You Care About

Accuracy, safety, refusals, tone, and privacy.

Be explicit about what success means. A good answer may need to be factually correct, decline inappropriate requests, avoid revealing private data, hand off to a human when required, and stay within the right tone for your audience.

Write Real, Varied Test Cases

Cover the ordinary and the edge.

Include the everyday questions your AI will mostly face, the unusual phrasings real people use, and the deliberately tricky prompts designed to make it fail. Breadth here is what separates a meaningful evaluation from a reassuring one.

Choose a Scoring Approach

From exact match to model-graded.

Simple cases can be scored by checking for an exact answer or required keywords. Nuanced cases, where tone or safety matters, can use a second model as a grader, which Inspect supports directly so you do not have to read every response by hand.

Use the Sandbox for Risky Tests

Run tool-using agents safely.

If your AI can run code or use tools, Inspect provides an isolated sandbox so any generated commands execute in a contained environment rather than touching your real systems. This matters when evaluating more capable agentic setups.

Reading and Acting on the Results

Inspect produces a detailed log of every evaluation, viewable in a browser-based viewer or a code editor extension. For each test case you can see the prompt that was sent, the answer the model gave, and the score it received, alongside aggregate numbers across the whole dataset. The aggregate score tells you how often your system met your standard; the individual transcripts tell you exactly where and how it fell short, which is where the real learning happens.

The point of all this is not the number itself but the decisions it informs. A low score on safety-related cases is a clear signal not to deploy until you have changed something, whether that is the system prompt, the model, additional guardrails, or a narrower scope for what the AI is allowed to do. A high score gives you defensible evidence that you tested responsibly, but it should be read with humility, since no evaluation covers every possible input and a passing grade is a floor rather than a guarantee.

The deeper value emerges over time. Because the evaluation is saved as code, you can rerun it whenever anything changes and compare the new results to the old. When a vendor pushes a model update, when you tweak your prompt to fix one problem, or when you switch models to cut costs, the same evaluation tells you instantly whether the change helped, hurt, or quietly broke something else. This turns testing from a one-time pre-launch ritual into an ongoing safeguard, which is exactly the posture that mature AI red teaming for nonprofits calls for.

Honest Limits and Cautions

Inspect is powerful, but it is not magic, and adopting it with clear eyes will save you frustration. It is a tool for people willing to work with code, so a nonprofit without anyone in that role will need a volunteer, a consultant, or a patient and capable staff member to lead the effort. The investment is modest compared with the protection it offers, but it is real, and pretending otherwise leads to abandoned projects.

Your Evaluation Is Only as Good as Your Cases

The framework runs whatever you give it. If your test cases miss an important scenario, the evaluation will too. Invest the most effort in writing realistic, varied, genuinely challenging inputs.

A Passing Score Is Not a Promise

No evaluation can cover every possible input. Strong results lower your risk and document your diligence, but they do not guarantee the system will never fail in the wild. Keep human oversight in place.

Mind Data and API Costs

Running hundreds of test prompts against a commercial model consumes paid tokens, and test data may contain sensitive examples. Use safe, synthetic data where possible and account for evaluation in your AI budget.

These cautions do not diminish the case for using Inspect; they sharpen it. The tool gives a small organization a credible, professional method for testing AI it could not otherwise afford, and pairing it with realistic cases and continued human judgment is how nonprofits get the benefit without overstating the protection. It fits comfortably alongside the broader procurement and assurance discipline we describe in what MITRE ATLAS and the OWASP LLM Top 10 mean for nonprofit AI procurement.

Where Inspect Fits in Your AI Workflow

Inspect is most valuable as one stage in a larger safety practice rather than a standalone exercise. It sits naturally between the design of an AI system and its launch, providing the evidence that turns "we think it works" into "we measured how it works." Treated this way, it becomes part of a pre-launch routine that every constituent-facing AI tool passes through before it goes live.

Build your evaluation while you build the AI system, so testing is designed in rather than bolted on at the end
Run the full evaluation as a gate before any constituent-facing launch, and require a passing standard you set in advance
Rerun it after every meaningful change to the model, prompt, or configuration to catch regressions
Keep the saved logs as documentation of due diligence for boards, funders, and auditors
Expand the test set whenever a real-world failure surfaces, so the evaluation grows stronger over time

Used this way, Inspect complements rather than replaces the human and procedural safeguards a responsible nonprofit already has in place. The combination of a structured pre-launch checklist and an automated evaluation is especially strong, and our pre-launch red team checklists for nonprofit chatbots describe the procedural half of that pairing in detail.

Conclusion

The gap between the way nonprofits usually test AI and the way it should be tested is wide, and for years it stayed wide because rigorous evaluation seemed to belong only to organizations with deep technical resources. Inspect closes that gap. A national AI safety body built a serious evaluation framework, used it on the most capable models in the world, and then gave it away. A nonprofit willing to invest a modest amount of technical effort can use the same tool to test its own AI with a rigor that would have been out of reach a few years ago.

The practical takeaway is to stop relying on a handful of hopeful test conversations. Write down the situations your AI will actually face, including the ones that could cause harm, turn them into a repeatable evaluation, and run it before you launch and after every change. The result is not only a safer system but a documented, defensible account of the care you took, which matters more each year as scrutiny of nonprofit AI use grows.

Borrowing the safety community's best tools is one of the smartest moves a resource-constrained nonprofit can make. Inspect is free, well-documented, and built for exactly this kind of careful testing. The organizations that adopt it will deploy AI with more confidence, fewer surprises, and a clearer conscience, knowing they measured how their tools behave instead of merely hoping for the best.

Want to Test Your AI Before It Goes Live?

We help nonprofits set up practical AI evaluation and red-teaming, from writing realistic test cases to standing up tools like Inspect, so you can launch constituent-facing AI with evidence rather than hope.

Talk to Our Team Explore Our Services