Back to Articles
    Technology & Innovation

    Open Data for Nonprofits: How Shared Datasets Could Train Better AI Models for the Sector

    The nonprofit sector sits on a vast, largely untapped resource: decades of program data, outcome measurements, and operational insights collected by thousands of organizations. When pooled and shared responsibly, this data could fuel AI models far more useful to the sector than anything trained on general internet content. Understanding how to participate in that future starts with grasping why shared data matters in the first place.

    Published: April 18, 202614 min readTechnology & Innovation
    Open data sharing and AI model training for nonprofits

    There is a quiet paradox at the heart of nonprofit AI adoption. Organizations across the sector have invested in AI tools at a remarkable pace, yet the vast majority report that those tools have not delivered transformative improvements to their work. According to the 2026 Nonprofit AI Adoption Report from Virtuous and Fundraising.AI, while 92% of nonprofits are now using AI in some capacity, only 7% report major improvements in organizational capability. That gap is not primarily a technology problem. It is a data problem.

    AI models are only as useful as the data on which they are trained. General-purpose AI tools are trained mostly on internet text, academic papers, and commercial datasets. They know a great deal about the world in general, but relatively little about the specific challenges of running a food bank, coordinating disaster relief volunteers, or measuring the long-term outcomes of a job training program. When nonprofits use these tools, they often find themselves doing significant prompting work to push the AI toward sector-relevant responses, or accepting outputs that don't quite fit their actual context.

    The solution that a growing number of researchers, funders, and technologists are advocating involves building shared, sector-specific datasets: open data commons where nonprofits can contribute anonymized program data, outcome measurements, and operational records, and in return benefit from AI models trained to understand their world. This article explores what that vision looks like in practice, what stands in the way, and how your organization can begin moving in that direction today.

    This topic connects directly to the broader question of why nonprofit AI tools don't talk to each other, which is rooted in the same fragmented data infrastructure. Open data initiatives offer one of the most promising paths to closing both gaps simultaneously.

    The Current State of Nonprofit Open Data

    Nonprofit data is not entirely locked away. For decades, the Form 990 has served as the primary public window into the finances and governance of tax-exempt organizations. The IRS has released data from over 1.8 million nonprofit tax filings since 2013, and organizations like ProPublica's Nonprofit Explorer make that data searchable and accessible. The Nonprofit Open Data Collective has built tools specifically to help researchers clean and process 990 data, and GivingTuesday's 990 Data Infrastructure Project has brought together major data providers to reduce the friction of accessing this information.

    But financial data, while valuable for transparency and research, tells only part of the story. A Form 990 can reveal how much an organization spent on programs and how many people it served. It cannot reveal which program models worked best, what client characteristics predicted strong outcomes, how operational choices affected service quality, or what approaches failed quietly and were never replicated. That deeper programmatic knowledge remains fragmented across thousands of spreadsheets, case management systems, and the institutional memory of experienced staff.

    The Impact Genome Registry represents one ambitious attempt to go deeper. With standardized data on more than 2.2 million global nonprofits and social programs, it uses Universal Impact Standards to measure outcomes, program design, beneficiary characteristics, and evidence quality. The registry provides third-party verification against 132 common social outcomes across education, public health, economic development, and other domains. Nonprofits can contribute their program data and receive a Verified Impact Report that can be used across grant applications and donor communications.

    These initiatives represent meaningful progress, but they remain islands. The data they contain is not yet connected in ways that allow AI models to learn from sector-wide patterns, identify what works across different contexts, or help individual organizations benchmark their programs against the full range of comparable efforts.

    What Open Data Already Exists

    • IRS Form 990 filings for 1.8M+ organizations since 2013
    • Impact Genome Registry with 2.2M+ standardized program records
    • GivingTuesday's 990 Data Infrastructure Project
    • ProPublica Nonprofit Explorer with PDF and digital 990s

    What's Still Missing

    • Programmatic outcome data at the client and cohort level
    • Cross-organizational benchmarks for specific program models
    • Operational data on staffing, capacity, and service quality
    • Connected datasets that allow sector-wide AI learning

    Why Shared Data Produces Better AI Models

    Stanford Social Innovation Review captures the core logic succinctly: "Data is the fuel that drives AI, and the machine is only as effective as the quality of the data that is fed into it." The reliability and generalizability of any AI model depends on having training data that is large enough to identify patterns, diverse enough to avoid systematic gaps, and representative enough to produce accurate predictions across the full range of contexts where the model will be used.

    Nonprofit organizations individually have data that is valuable but narrow. A single food bank might have years of detailed records about client visits, household composition, dietary needs, and referral patterns. That data can support AI tools to improve their own operations. But it cannot tell them how their approach compares to food banks in different geographic and demographic contexts, which client characteristics predict long-term food security versus recurring need, or how their outcomes would change if they adopted a different service model. Those questions require data from many organizations.

    The diversity argument for shared data is equally important. AI models trained on narrow datasets learn from the specific characteristics of that data. A hiring algorithm trained primarily on resumes from a particular region or demographic group will perform poorly and potentially discriminate when applied to different populations. The same problem applies to service delivery AI: a case management tool trained on data from well-resourced urban nonprofits may give systematically inappropriate recommendations when used by rural or underfunded organizations. Broader, more diverse training data directly reduces these risks.

    The FloodAction coalition coordinated by The Conduit in the UK illustrates this dynamic at the operational level. Dozens of organizations working on wetland restoration contribute their data to AI models that support collective decisions about land use, funding allocation, and intervention timing. No individual organization could train models with sufficient data to optimize these decisions at scale. The pooled dataset enables coordination and insight that is simply not available to organizations working alone.

    How Pooled Data Transforms AI Capabilities

    Specific improvements that shared nonprofit datasets enable

    With Individual Organization Data

    • Optimize internal operations and workflows
    • Predict individual client needs based on historical patterns
    • Automate routine administrative tasks

    With Sector-Wide Shared Data

    • Benchmark programs against similar efforts nationally
    • Identify which program models produce best outcomes for specific populations
    • Train AI models that understand nonprofit-specific context
    • Enable coordinated responses to community-level challenges

    Data Collaboratives: Models for Sharing

    The concept of a data collaborative refers to a structured arrangement where organizations from different sectors exchange data to create shared public value. Unlike simple data donations, data collaboratives involve negotiated terms about what data is shared, under what conditions, for what purposes, and with what safeguards. They represent a significant evolution beyond both the fully closed model (where organizations jealously guard all their data) and the fully open model (where data is published without protection or governance).

    Several distinct models have emerged. Public interfaces involve organizations publishing selected data as APIs or data platforms that others can query without receiving the underlying dataset. This allows a nonprofit to make its aggregate outcome data available for research while maintaining control over individual client records. Trusted intermediary arrangements bring in a third party, such as a university research center or a specialized data trust, to broker access between data holders and researchers under fixed terms and time limits. Data pools are perhaps the most powerful model: multiple organizations contribute to a shared resource that all participants can draw from, with governance rules determining who can access what and for what purposes.

    The Platform Cooperativism Consortium has proposed that nonprofits and social service agencies could organize their data systems as cooperatives, banding together to share both technology infrastructure and data while using collective governance to develop shared metrics and standards. This model has particular appeal for organizations that lack the individual resources to build sophisticated data systems but could benefit significantly from shared infrastructure. Organizations interested in shared AI infrastructure will find that data cooperatives and infrastructure cooperatives often reinforce each other.

    Public Interfaces

    Organizations publish aggregate data as APIs or platforms. Researchers can query without accessing raw records. Good for transparency with privacy protection.

    Trusted Intermediaries

    A neutral third party brokers data access under fixed terms and time limits. Allows more sensitive data to be shared with appropriate researchers.

    Data Pools

    Multiple organizations contribute to a shared resource governed by collective rules. Most powerful for training sector-specific AI models.

    Challenges and Barriers to Open Data Sharing

    The case for open data in the nonprofit sector is compelling, but the barriers are real. Understanding them clearly is essential for any organization considering participation in shared data initiatives, because failure to address these concerns can undermine trust, expose organizations to legal liability, and ultimately produce datasets that are less useful than their proponents hoped.

    Privacy and Regulatory Concerns

    Nonprofit data frequently involves vulnerable populations. As of 2026, more than 20 states have enacted comprehensive consumer data privacy laws, joining California's CCPA/CPRA framework. Organizations handling payment data have been required to comply with PCI DSS 4.0 since March 2025. These regulations create specific obligations around consent, data minimization, and access controls that complicate sharing arrangements.

    • Client consent requirements for data sharing
    • Re-identification risks from seemingly anonymous data
    • State-specific requirements that vary by jurisdiction

    Data Quality and Infrastructure Gaps

    Many nonprofits face significant internal data challenges before they can contribute meaningfully to shared initiatives. Organizations commonly report spending the majority of their data management time cleaning and reconciling records rather than analyzing them. Outdated systems, inconsistent data entry practices, and missing fields create datasets that are difficult to combine with others even when the will to share is present.

    • Inconsistent data definitions across organizations
    • Legacy systems that can't export in standard formats
    • Missing baseline data that makes outcomes unmeasurable

    Governance and Trust Challenges

    Organizations that have built reputations on careful stewardship of client information are understandably cautious about sharing arrangements where control is diffused. Questions about who has ultimate authority over a shared dataset, what happens if a participating organization has a data breach, and how to handle a participant that uses shared data for unapproved purposes do not have obvious answers.

    • Unclear accountability when shared data is misused
    • Competitive concerns about revealing program weaknesses
    • Difficulty aligning governance across diverse organizations

    The Size Inequality Problem

    Large nonprofits, which tend to have better data systems and more staff capacity, are positioned to contribute more to and benefit more from shared data initiatives. Bridgespan research notes that nonprofits with annual revenues over $1 million are adopting AI at nearly twice the rate of smaller organizations, and since more than half of all nonprofits bring in less than $1 million annually, a substantial portion of the sector may end up as consumers rather than contributors in data sharing arrangements.

    • Smaller organizations lack capacity to standardize data
    • Risk of shared datasets underrepresenting small-org contexts
    • AI models trained on large-org data may not fit smaller contexts

    The Standards Problem: Why Shared Data Requires Common Language

    Even when organizations agree to share data, the sharing often runs into a fundamental obstacle: data collected by different organizations under different definitions is very difficult to combine meaningfully. One organization might define "youth served" as anyone under 25. Another uses 18. A third counts only those who completed a program, while a fourth includes anyone who attended a single session. When you try to train an AI model on this combined data, you produce a confused model that has learned to navigate definitional inconsistency rather than to identify genuine patterns.

    This is the core argument made in the companion article on the case for nonprofit data standards. The healthcare sector addressed this problem decades ago through standardized terminologies like SNOMED CT and ICD codes. Financial services have required standard accounting formats. The nonprofit sector has largely avoided the necessary but difficult work of agreeing on what terms mean and how data should be structured, partly because there is no regulatory body with the authority to mandate standards and partly because the sector's diversity makes standard-setting genuinely complex.

    Several initiatives are making progress. The Impact Genome Registry's Universal Impact Standards represent an ambitious attempt to create common outcome definitions across 132 social impact categories. IRIS+, managed by the Global Impact Investing Network, provides standardized metrics for impact measurement that have gained traction particularly among organizations seeking impact investment. The UN Sustainable Development Goals provide a global framework that some organizations use as a common reference point for their outcome reporting.

    For AI training purposes, the most important standards are those that address the features an AI model will be asked to learn from: population characteristics, service types, intervention dosages, and outcome measurements. Organizations that adopt even basic common definitions in these areas create data that is immediately more useful for sector-wide analysis, regardless of whether a formal shared data initiative exists.

    Emerging Standards Frameworks

    Resources for aligning nonprofit data with shared definitions

    • Impact Genome Registry Universal Impact Standards: 132 verified social outcomes across all major program areas, with third-party validation and peer-reviewed evidence quality standards
    • IRIS+ (Global Impact Investing Network): Standardized impact metrics used widely in impact investment contexts, increasingly adopted by nonprofits seeking funder alignment
    • UN Sustainable Development Goals: Global framework providing a common vocabulary for social impact categories, useful for cross-sector and cross-national comparisons
    • NTEE Codes (National Taxonomy of Exempt Entities): Classification system for nonprofit types maintained by Candid, used in 990 data and providing a common vocabulary for organizational categories

    Building a Data Governance Foundation

    Before an organization can responsibly participate in any shared data initiative, it needs a solid internal governance foundation. Many nonprofits currently operate without formal data governance policies, meaning that decisions about what data to collect, how long to retain it, who can access it, and what it can be used for are made informally or not at all. This creates both practical risks (when a staff member leaves with institutional knowledge about data practices) and legal risks (when a privacy incident occurs and there is no documented policy to point to).

    A basic data governance framework should address several core questions. First, what data does the organization collect, and why? Many organizations discover during this exercise that they collect significant data with no clear analytical purpose, which represents both a privacy risk and a storage burden. Second, how long is data retained? Different categories of data have different optimal retention periods, and indefinite retention of client records is both unnecessary and legally problematic in many jurisdictions. Third, who has access to what data? Establishing clear role-based access controls reduces both internal misuse risks and the blast radius of an external breach.

    For organizations considering data sharing, governance documentation also serves as a communication tool. Potential data collaborative partners want to understand how a prospective member handles data before entering into a sharing arrangement. Having clear written policies makes due diligence easier and demonstrates the organizational maturity that collaborative governance requires. This connects to the broader work of AI governance as a risk mitigation strategy, where documented policies do double duty as both operational guides and risk management tools.

    Core Data Governance Elements

    Foundation elements every nonprofit should document before considering shared data initiatives

    • Data inventory and purpose documentation: A clear record of what data you collect, why you collect it, and what decisions it informs
    • Retention schedules: Defined periods for different data categories, with processes for secure deletion when retention periods expire
    • Access controls: Role-based definitions of who can view, edit, and export different data categories
    • Consent practices: Clear documentation of what clients and donors are told about data collection and how their preferences are recorded
    • Incident response procedures: A documented plan for what happens when data is compromised or accessed inappropriately
    • Third-party sharing rules: Explicit policies on when and under what conditions data can be shared with outside parties

    Practical Steps to Participate in the Open Data Future

    Organizations don't need to wait for a sector-wide data commons to materialize before taking action. Several practical steps can position your organization to contribute to and benefit from shared data initiatives while improving your internal data practices in ways that have immediate value regardless of what happens at the sector level.

    1Establish internal data governance

    Document your data inventory, retention schedules, access controls, and consent practices. This is the minimum viable foundation for any sharing arrangement and has immediate internal value regardless of whether you pursue external sharing.

    2Adopt at least one common outcomes framework

    Whether you choose IRIS+, Impact Genome Standards, or SDG alignment, selecting an established framework and consistently using it creates data that is immediately more useful for cross-organizational analysis.

    3Register with the Impact Genome Registry

    Registering your programs creates a standardized record that contributes to sector-wide analysis and can generate donor and funder communications materials. The registry's 2.2+ million program records make it the most comprehensive shared dataset currently available for the sector.

    4Explore peer data-sharing arrangements

    Identify 3-5 peer organizations working in the same program area and explore whether you could share anonymized aggregate outcome data. Even informal arrangements between a small number of trusted organizations can produce useful benchmarking without requiring complex infrastructure.

    5Invest in data quality before quantity

    Organizations that contribute poor-quality data to shared initiatives undermine those initiatives. A smaller, clean dataset with consistent definitions is more valuable than a larger, messy one. Prioritize data quality improvements over data volume.

    6Advocate for funder support

    Data governance improvements, standards adoption, and participation in data collaboratives all require staff time and occasionally technology investment. Funders increasingly recognize data infrastructure as mission-critical, and organizations that can articulate a clear data sharing strategy are better positioned to secure operational support for this work.

    Toward a Sector-Wide Data Commons

    The full vision for nonprofit open data is ambitious: a set of interconnected data commons where organizations across different program areas contribute anonymized, standardized data that is accessible for research, available for training specialized AI models, and governed by the sector itself rather than by commercial technology companies. In this vision, a nonprofit providing reentry services could access AI tools trained on the combined experience of hundreds of similar organizations. A rural food bank could benchmark its outcomes against urban counterparts and identify program modifications that have worked in comparable contexts elsewhere.

    Getting there requires sustained investment in the unglamorous work of standards development, data infrastructure, and governance capacity. It requires funders willing to support operational infrastructure rather than only programs. It requires technology providers willing to make their systems interoperable in ways that benefit the sector rather than maximizing switching costs. And it requires nonprofit organizations to recognize that their data, carefully stewarded and responsibly shared, is itself a form of contribution to the sector's collective knowledge base.

    The Conduit's observation that the nonprofit sector is facing not an AI adoption challenge but a systems transition challenge rings true here. Shared data infrastructure is not a technical project that technology teams can complete in isolation. It is an organizational and sector-level transformation that requires changes in how nonprofits think about their data, how they fund operations, and how they collaborate with each other. Organizations that invest in this transformation now, even at a small scale, will be better positioned to benefit as the infrastructure matures.

    For practical guidance on how to connect your existing systems as a first step, the article on connecting your CRM, grant system, and AI tools provides a useful starting point. Data sharing at scale ultimately depends on the same connective tissue that makes individual systems work together.

    Conclusion

    The gap between AI adoption and AI impact in the nonprofit sector is real, and shared data is one of the most important levers available to close it. General-purpose AI models will never fully understand the specific contexts, constraints, and goals of nonprofit work. Only models trained on sector-specific data can deliver the kind of nuanced, contextually appropriate assistance that nonprofits genuinely need. Building that training data is a collective project that no single organization can accomplish alone.

    The good news is that meaningful participation in this project does not require waiting for sector-wide infrastructure to materialize. Organizations that build solid internal data governance, adopt common outcomes frameworks, register with shared platforms like the Impact Genome Registry, and explore peer data-sharing arrangements are both contributing to the sector's collective knowledge and building the internal capacity that will allow them to benefit as more sophisticated shared data initiatives emerge.

    The fundamental insight is straightforward: the quality of AI tools available to nonprofits is, in part, a function of the quality and quantity of nonprofit data those tools can learn from. Every organization that improves its data practices and participates in responsible sharing arrangements is contributing to a sector-wide infrastructure that will make AI more useful for everyone working toward social good.

    Ready to Build Your Data Foundation?

    We help nonprofits assess their data governance practices, identify opportunities for shared data participation, and develop strategies for leveraging AI more effectively. Connect with us to discuss your organization's data challenges.