Finance & Operations

Per-Million-Token AI Pricing: Picking the Right Model Tier for Each Nonprofit Workflow

The same task can cost 100 times more depending on which model you point at it. A practical guide to matching model tiers to nonprofit workflows so the AI you actually need does not get drowned out by the AI you do not.

Published: May 15, 2026•14 min read•Finance & Operations

Per-Million-Token AI Pricing for Nonprofit Workflows

For most of 2024 and 2025, AI procurement at nonprofits looked like software procurement. You picked a vendor, you signed a per-seat contract, and you paid the bill. The cost was predictable because the unit was predictable: a license per staff member.

That world is now gone. The major AI providers all bill primarily on tokens, and increasingly so do the platforms that wrap them. A token is roughly three quarters of a word. Every prompt you send and every response you get back is metered, priced per million tokens, and added to a bill that arrives at the end of the month. Same task, same staff member, wildly different cost depending on which model the system pointed at it.

For nonprofit finance leaders, this is the most consequential shift in operational technology in a decade. It is also the shift most likely to blow up a budget that looked solid at the start of the fiscal year. The good news is that the spread between model tiers is enormous, and most nonprofit workflows do not need the top tier. The bad news is that defaults inside many tools point at the most expensive option, and unless someone is paying attention, that is what your nonprofit will pay for.

This guide compares the major model tiers on a per-million-token basis, then maps tiers to the workflow types nonprofits actually run. The goal is not to find the cheapest model. It is to match the model to the task so you stop paying premium prices for work that a budget-tier model would handle perfectly well.

The Pricing Landscape in May 2026

Token pricing is published per million tokens, split between input (what you send to the model) and output (what the model sends back). Output is almost always more expensive than input, often by a factor of three to five. For most nonprofit workflows, input tokens dominate the bill because prompts include context, instructions, and pasted source material, while outputs are comparatively short.

Anthropic Claude

Three current tiers covering Haiku, Sonnet, and Opus

Haiku 4.5: $1 input, $5 output per million tokens
Sonnet 4.6: $3 input, $15 output per million tokens
Opus 4.7: $5 input, $25 output per million tokens

OpenAI GPT

Current and prior frontier tiers, with a Pro variant

GPT-4o: $2.50 input, $10 output per million tokens
GPT-5.4: $2.50 input, $15 output per million tokens
GPT-5.5: $5 input, $30 output per million tokens
GPT-5.4 Pro: $30 input, $180 output per million tokens

Google Gemini

Pro and Flash tiers, with context-length pricing thresholds

Gemini 2.5 Flash: $0.30 input, $2.50 output per million tokens
Gemini 2.5 Pro: $1.25 input, $10 output per million tokens (doubles above 200K context)

Open-Weight Models

Hosted via Together, Fireworks, Groq, DeepInfra, and similar

Llama 4 Maverick: roughly $0.27 input, $0.85 output per million tokens
DeepSeek V3.2: roughly $0.14 input, $0.28 output per million tokens
Qwen 3.5 9B: roughly $0.05 input per million tokens

Note: Prices may be outdated or inaccurate.

Read these numbers carefully. The top of the table (Opus 4.7, GPT-5.5, the Pro variants) and the bottom of the table (Qwen, DeepSeek, hosted Llama) are not in the same order of magnitude. The difference between $5 input and $0.05 input is one hundred times. Pointed at the same volume of work, that gap turns a $500 monthly AI bill into a $50,000 monthly AI bill. The choice is rarely articulated that starkly, but it is what the underlying math says.

A second pattern worth noting: output tokens are the expensive ones. If your workflow produces long responses, especially long structured outputs like full draft documents, the output rate dominates. If your workflow produces short responses, classifications, or short summaries, the input rate dominates, and an expensive output rate matters less. This is why "cheap input, expensive output" providers like some hosted Llama tiers can still beat premium providers on bulk classification, and why drafting-heavy work tends to converge on whichever provider has the cheapest output rate.

The Three Workflow Profiles That Drive Model Choice

Pricing comparisons in the abstract are not useful for nonprofit decision-makers. What matters is matching the model to the work, and most nonprofit AI work falls into one of three workflow profiles. Each profile has a different sensitivity to input rates, output rates, and quality, and each profile lands on a different sensible default tier.

Profile 1: Quality-Critical Drafting

Long-form work where output quality directly affects external readers

This is grant proposal drafting, major donor correspondence, board memos, annual report sections, and legal-adjacent communication. The volume is low, the stakes are high, and a clumsy draft costs more in revision time than any token savings can recover. Output tokens dominate the bill because the response is long. The right tier here is the frontier or near-frontier model, even though it is the most expensive. Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro are the practical default range for this work. Claude Opus 4.7 and the Pro variants come into play when the document is going to a major funder or a regulator.

Typical monthly volume: tens of thousands of output tokens, not millions
Cost sensitivity: low (premium tier is affordable at this volume)
Quality sensitivity: very high (one bad letter to a donor costs more than the entire AI bill)

Profile 2: Operational Analysis and Summarization

Internal work where moderate quality is enough and volume is moderate

This is meeting note summarization, internal email triage, document review for a specific question, donor record cleanup, and program data summarization for internal dashboards. The audience is internal, the stakes are moderate, and the same workflow runs many times per month. Both input and output rates matter, but the work is not so demanding that the frontier model is necessary. The right tier here is the mid-tier: Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro are all reasonable defaults, but Claude Haiku 4.5 and Gemini 2.5 Flash often handle this work for one-third the cost. Test before defaulting to the more expensive option.

Typical monthly volume: hundreds of thousands to low millions of tokens
Cost sensitivity: moderate (tier choice changes the bill meaningfully)
Quality sensitivity: moderate (a slightly weaker summary is acceptable; a wrong summary is not)

Profile 3: High-Volume Classification and Extraction

Repetitive structured work where token volume dominates the bill

This is tagging donor records by interest area, classifying intake forms, extracting structured data from PDFs at scale, identifying duplicate constituent records, summarizing tens of thousands of survey responses, and triaging support tickets. The work is structured, the output is short, the volume is high, and quality requirements are met by a budget-tier model in almost every case. This is where Gemini 2.5 Flash, Claude Haiku 4.5, and open-weight models like Llama 4 Maverick, DeepSeek V3.2, and Qwen 3.5 earn their keep. The frontier tier is wasted here, and using it can multiply the bill by twenty or more for no quality benefit.

Typical monthly volume: tens to hundreds of millions of tokens
Cost sensitivity: very high (tier choice is the entire budget)
Quality sensitivity: low to moderate (sample and validate, then commit to the cheap tier)

A Decision Framework: From Workflow to Tier

The fastest way to get model selection right is to walk every active AI workflow through the same four questions. The answers point you to a tier without requiring a deep technical evaluation. Do this once for each workflow you run, document the decision, and revisit when prices or model capabilities shift.

Question 1: Who Reads the Output?

External readers (donors, funders, regulators, the public) raise quality requirements and push toward frontier or near-frontier tiers. Internal readers (staff, board, internal dashboards) tolerate mid-tier output without complaint. Machines reading the output (downstream systems, classification pipelines) often tolerate budget-tier output, provided the structure is correct. The audience is the single biggest determinant of acceptable model tier.

Question 2: How Long Is the Output?

Short outputs (tags, classifications, two-sentence summaries) make the output rate almost irrelevant. The input rate dominates, and a provider with a cheap input rate wins even if its output rate is mediocre. Long outputs (drafts, full responses, generated documents) make the output rate dominant. Premium providers with expensive output rates become very expensive at this kind of work, and providers with cheap output rates become disproportionately attractive.

Question 3: How Often Does the Workflow Run?

Low-frequency work (a handful of grant proposals per quarter) is dominated by quality, not cost. Even the most expensive tier produces a trivial bill at this volume. High-frequency work (tagging every intake form, summarizing every meeting, classifying every donor interaction) is dominated by cost. The same tier that is rational at low volume becomes irrational at high volume. Always estimate annual volume before committing to a tier.

Question 4: What Is the Cost of a Bad Output?

A miswritten donor letter that goes out without review costs the relationship. A miscategorized intake form that gets routed wrong costs the constituent. A mislabeled donor record costs a query later. Each workflow has a different worst-case cost when the model gets it wrong, and that cost should set a floor on the tier. If the cost of failure is small and recoverable, a budget tier is fine. If the cost of failure is large or unrecoverable, the marginal cost of moving up a tier is almost always justified.

Cost-Reduction Levers Beyond Tier Choice

Model tier is the biggest lever, but it is not the only one. Every major provider now offers cost-reduction options that can cut the bill by half or more without changing tiers, and most nonprofits are not using them. Three are worth knowing.

Prompt Caching

Most providers now offer prompt caching: the first time you send a long system prompt or context document, you pay the full input rate. Subsequent calls within a short window pay a fraction (often 10 percent of the input rate) for the cached portion. For workflows that send the same context repeatedly (an assistant grounded in your style guide, a chatbot that pulls from the same knowledge base, a triage workflow that uses the same classification rubric), caching can reduce input costs by 80 to 90 percent. This is largely free money that requires almost no engineering, and almost every long-running nonprofit workflow benefits from it.

Batch Processing

For non-interactive work where a multi-hour turnaround is acceptable, batch processing typically halves all token costs. Anthropic, OpenAI, and Google all offer this. It is the right answer for any work that is not real-time: overnight summarization of the day's donor interactions, weekly classification of new constituent records, monthly extraction from incoming PDFs. If staff do not need to see results within seconds, batch the work and pay half.

Tier Routing

A single workflow does not need a single model. A classification step can run on Haiku or Flash, with the rare hard case escalated to Sonnet or GPT-5.4 only when the budget-tier model signals low confidence. This pattern, sometimes called LLM routing or cascading, often captures 80 to 95 percent of the cost savings of going budget-tier across the board while keeping quality close to the frontier-tier baseline. It requires a little more engineering than a single-tier setup but is well-supported by the major frameworks and well worth the time at any meaningful scale.

The Special Case of Open-Weight Models

Open-weight models occupy a strange place in nonprofit AI procurement. They are dramatically cheaper than the frontier providers, often by an order of magnitude or more. They are competitive on quality for many workflows, especially classification, extraction, and basic summarization. And they come with a different operational profile that some nonprofits will find liberating and others will find unworkable.

The simplest way to use open-weight models is through hosted providers. Together AI, Fireworks, Groq, DeepInfra, and others run Llama, DeepSeek, Qwen, and Mistral on their infrastructure and bill you per token. The pricing is at the bottom of the market, the API is similar to OpenAI's, and the operational burden on your nonprofit is no greater than using any other API. For high-volume classification and extraction work, this path is the single biggest cost reduction available to a nonprofit AI program in 2026.

The other path is running open-weight models locally, on hardware your nonprofit owns. This is becoming feasible for small models on modern laptops and for medium models on a single server-class GPU. The token cost is zero once the hardware is paid for, and the data never leaves your premises. For nonprofits with sensitive constituent data, this can be the right answer. The trade-off is operational: someone has to maintain the hardware, update the models, and troubleshoot when something breaks. Most small nonprofits will not have that capacity, but the threshold is falling, and the appeal will keep growing for organizations handling regulated data.

A practical hybrid is emerging. Use hosted open-weight models for high-volume routine work where cost is the driver. Use a frontier provider for quality-critical drafting where capability is the driver. Use a local open-weight model for the small set of workflows that touch sensitive data and where keeping the data on-premises is non-negotiable. Most nonprofits will not need all three, but the option to mix and match is now real, and the cost difference between getting this right and getting it wrong is substantial.

For a deeper look at the strategic implications of metered AI pricing, see our framework for treating AI as a metered utility, and for the broader inference-cost picture, see the inference cost crisis explained.

Common Mistakes and How to Avoid Them

Most nonprofits overspend on AI for one of a handful of predictable reasons. Each of these has a clear remedy, and addressing them is usually faster than renegotiating any contract.

Mistake 1: Defaulting to the Top Tier Everywhere

Many AI tools default to the most capable model and many nonprofits never change it. This is the single biggest source of overspend. Audit the default model in every tool your staff uses. For high-volume workflows running on Opus 4.7 or GPT-5.5, drop the tier unless there is a clear quality reason not to. Most internal work is fine on a mid or budget tier, and the bill drops by half or more.

Mistake 2: Treating Per-Seat Pricing as a Ceiling

The chat subscription products (Claude Pro, ChatGPT Plus, Gemini Advanced) have generous usage limits but they are not unlimited. When staff hit a limit, they often quietly switch to the API or to another product, and the costs land somewhere unexpected. Track the actual usage pattern of any heavy user. If they are pushing against the limit consistently, an API-based workflow with proper tier routing is almost always cheaper than the next subscription tier up.

Mistake 3: Ignoring Caching and Batch

Prompt caching and batch processing are the two cost levers nonprofits leave on the table most often. Both are well-documented, well-supported, and require no change in model choice. Any workflow that runs more than a few times per day with similar context should use caching. Any workflow where staff do not need a response in real-time should consider batch. Together, these two levers commonly cut the bill by 50 to 70 percent on the workflows where they apply.

Mistake 4: Buying Capacity Without Measuring Workload

Many nonprofits sign up for an AI plan based on a sales conversation rather than a measurement of their own workload. Before committing to a tier, run two weeks of typical work through the API with logging on. The actual token counts almost always surprise people, and the right tier becomes obvious. Without this baseline, every cost discussion is speculation. With it, the choice writes itself.

A Worked Example: A Mid-Size Nonprofit's Monthly Bill

Consider a hypothetical mid-size human services nonprofit running three AI workflows: weekly grant proposal drafting (low volume, high quality), daily intake form triage (medium volume, medium quality), and monthly constituent record classification (high volume, low quality requirements). Total monthly token usage might land somewhere in the range of 30 million tokens across all three workflows.

If the organization runs everything on a frontier tier like Opus 4.7, assume a 70/30 split between input and output tokens. Thirty million tokens at $5 input and $25 output averages out to roughly $11 per million tokens blended, or about $330 per month. That is before any growth in usage.

Now imagine the same organization routes intelligently. Grant proposals (low volume, maybe 500,000 tokens per month) stay on Opus 4.7 because quality matters and the dollar amount is small. Intake triage (10 million tokens per month) moves to Sonnet 4.6, which is about half the cost. Record classification (20 million tokens per month) moves to Haiku 4.5 or Gemini 2.5 Flash. After adding prompt caching to the workflows that benefit and batching the monthly classification job, the total bill comes in closer to $40 to $60 per month for the same amount of work.

These numbers are illustrative, not promises, and your mileage will vary with prompt length and actual workflow shape. But the order of magnitude is right. Tier matching plus caching plus batching reliably produces a five-to-ten-times cost reduction over running everything on the top tier, with no detectable quality loss in the routine workflows. The savings recur every month, compound across years, and accumulate to numbers that matter even at small nonprofit scale.

For more on building a sustainable AI cost model, see our guide to AI budget management for nonprofits.

Where Tier Pricing Is Heading

The pricing landscape in May 2026 is the most volatile it has been since these markets opened. Three forces are pushing in different directions, and the spread of plausible 2027 pricing is wide enough that any commitment past a year should assume material change.

The first force is downward pressure from open-weight competition. Chinese labs (DeepSeek, Qwen, MiMo) and the hosted-open-weight ecosystem keep compressing the low end of the market. Every quarter sees a new model land at a price point that would have looked impossible six months earlier. This pressure ripples upward: Gemini Flash exists in part because Google needs a tier that can compete with sub-dollar input rates, and Anthropic and OpenAI are unlikely to leave Haiku and the smaller GPT tiers static for long.

The second force is upward pressure on the frontier. Each new flagship model is more expensive than the one before it. GPT-5.5's pricing increase over GPT-5.4 is the clearest recent example, but Opus 4.7 also priced higher than its predecessors on a true-cost basis after the tokenizer change. The frontier is moving away from the budget tiers, not toward them, and nonprofits relying on the top tier should plan for that trend to continue.

The third force is price unbundling. New pricing models (cached input rates, batch rates, context-length tiers, reasoning-mode surcharges) are proliferating, and the simple "input rate, output rate" comparison hides more than it reveals. Effective price comparisons in late 2026 increasingly require looking at the specific usage pattern, not just the headline numbers. Procurement teams that rely on the headline rate will increasingly be surprised at the end of the month.

The net effect for nonprofit decision-makers is that any tier-selection decision should be revisited at least annually, ideally quarterly for high-volume workflows. The right tier today may not be the right tier in six months. Building a workflow that can swap models with a small configuration change, rather than baking the choice into the architecture, is one of the few hedges available against pricing volatility this large.

Conclusion

Per-million-token pricing has turned AI from a fixed cost into a variable cost, and the spread between model tiers is wide enough that the same workflow can cost a hundred times more on the wrong tier than on the right one. Nonprofit finance leaders who treat all AI as a single line item are missing the chance to cut their AI bill in half or more without losing any of the value the AI is producing.

The pattern that works is unglamorous but reliable. Catalogue your active workflows. Classify each by audience, output length, frequency, and cost-of-failure. Match each workflow to the cheapest tier that meets its quality requirements. Add prompt caching and batch processing wherever they apply. Revisit the choices quarterly. None of these steps require deep technical expertise, and together they consistently produce the kind of savings that matter in a nonprofit budget.

The organizations that get this right will spend less on AI than the organizations that get it wrong, and they will get the same or better results. The organizations that get it wrong will keep paying premium prices for routine work and will wonder why their AI bill keeps growing faster than their AI value. The difference is not capability or budget. It is whether someone made the model-to-workflow match deliberately, or let the defaults decide.

Get Your AI Spending Under Control

We help nonprofits audit their AI workflows, match models to tasks, and capture the savings that tier-matching and caching can deliver. Most engagements pay for themselves within a quarter.

Start an AI Cost Audit View Services