Data Privacy & Security

FedKR and Federated Synthetic Data: A Plain-English Guide for Nonprofit Tech Teams

Your organization's dataset is small, sensitive, and impossible to share, and that same story is true for every partner you might collaborate with. A newer family of techniques lets organizations swap realistic but artificial stand-in data instead of model updates or raw records, so everyone's model gets smarter while no real client ever leaves the building. Here is what federated synthetic data and FedKR actually do, where the privacy guarantees hold, where they quietly break, and how a lean nonprofit tech team can take a sensible first step.

Published: June 25, 2026•17 min read•Data Privacy & Security

Federated synthetic data and FedKR concept for nonprofit technology teams

Most nonprofit data problems look the same from the inside. You have a few hundred or a few thousand records about the people you serve, the records are deeply sensitive, and the model or analysis you want to build needs more data than you have to be reliable. The natural fix is to team up with peer organizations who serve similar populations and have similar data. The equally natural blocker is that none of you can legally or ethically hand those records to one another. So the collaboration that everyone agrees would help never happens, and each organization keeps building weak models on thin data.

There is a well-known answer to this called federated learning, where a shared model travels to each organization, trains on local data, and only the resulting mathematical updates travel back to be combined. We cover that approach in depth in our companion piece on how to train a shared model without sharing data. This article is about a different and newer twist on the same problem. Instead of exchanging model updates, what if organizations exchanged realistic but entirely artificial data that carries the statistical shape of their real records without containing any real person? That idea is federated synthetic data, and one of the clearest recent expressions of it is an approach called FedKR.

Synthetic data is artificially generated information that mirrors the structure and statistical patterns of a real dataset but contains none of the original records. A synthetic intake table looks and behaves like your real intake table, with the same columns, the same rough distributions, the same correlations between fields, but each row is a fabricated person who never walked through your door. The promise is seductive: if you can generate convincing synthetic versions of your data, you can share, pool, and learn from the synthetic copies while the real records stay locked away.

This guide explains, in plain language, how federated synthetic data works, what FedKR specifically adds to the picture, how it compares to ordinary federated learning, and, crucially, where the privacy guarantees are genuine and where they are thinner than the marketing suggests. Synthetic data is not automatically private, and a nonprofit that treats it as a magic anonymizer can expose the very people it is trying to protect. The goal here is to give your team enough understanding to make a clear-eyed decision and a safe first move.

First, What Synthetic Data Actually Is

Before adding the word federated, it helps to be precise about synthetic data on its own. A generator model studies your real dataset, learns the patterns inside it, and then produces brand-new records that follow those same patterns. Think of it as a forger who studies a thousand real intake forms until they can produce a thousand convincing fakes, fakes that match the real population in aggregate but copy no individual exactly. When done well, the synthetic dataset is useful for training a model, testing software, or sharing with a partner, while disclosing far less about any single real person than the original would.

What Synthetic Data Preserves

The overall shape and distribution of each field
Relationships and correlations between fields
Enough realism to train models or test systems

What Synthetic Data Should Not Contain

Any real person's actual record, copied or lightly disguised
Direct identifiers like names, addresses, or case numbers
Rare combinations that point back to one identifiable individual

The reason synthetic data matters for collaboration is that artificial records are far easier to share than real ones. A synthetic dataset, generated carefully, can often be exchanged with partners or even released publicly where a real dataset never could. That single property is what makes the federated version possible. We explore generation methods and nonprofit use cases more fully in our overview of synthetic data for nonprofits, which is worth reading alongside this piece.

Adding Federated: Each Organization Synthesizes Locally

Federated synthetic data combines two ideas. Each organization keeps its real data at home and trains a generator on it locally. Then, instead of exchanging the real records, the partners exchange the synthetic stand-ins their generators produce. Those synthetic datasets can be pooled, shared, or used to train a common model, and because every shared row is fabricated, the raw client files never leave anyone's systems. The collaboration runs on convincing fakes rather than real people.

This is a meaningful departure from classic federated learning. In ordinary federated learning, what travels between organizations is a stream of model updates, the numerical adjustments a model made while training on local data. In federated synthetic data, what travels is a dataset of artificial records. Both keep real data in place, but they expose different things to the wider group, and they fail in different ways. Researchers have found that federated synthesis can combine knowledge from many small or biased datasets and produce richer, more representative training material than any single organization could create alone, which is exactly the situation many nonprofit coalitions face. A scoping review of federated learning for generating synthetic data catalogued dozens of such methods, reflecting how active this area has become.

Classic Federated Learning

Exchange model updates

A shared model is trained locally at each organization, and only the resulting weight updates are sent back to a coordinator and averaged. Raw data stays put, but the updates are derived from it and can, under certain attacks, leak information. The shared artifact is a model, not data, so partners cannot directly inspect what others contributed.

Federated Synthetic Data

Exchange artificial records

Each organization trains a generator locally and shares only synthetic records, never real ones. Partners can inspect, combine, and reuse the synthetic data flexibly, which makes debugging and auditing easier. The risk shifts from leaky model updates to the question of whether the synthetic data itself accidentally memorizes and reveals real individuals.

The flexibility is the appeal. Once a partner holds a synthetic dataset, they can use it however they like: train any model, run any analysis, test any tool, without going back to coordinate another round of federated training. That decoupling is genuinely useful for under-resourced teams that cannot sustain a continuously running federated training pipeline.

What FedKR Adds: Federated Knowledge Recycling

FedKR stands for Federated Knowledge Recycling, a cross-silo approach published in 2024 and 2025 that puts locally generated synthetic data at the center of collaboration. The word recycling captures the core idea. Rather than passing around raw model updates, each participant generates synthetic data from its local records and contributes that synthetic data to the federation. The shared synthetic knowledge is then recycled into a stronger model for everyone, and a dynamic aggregation step combines contributions in a way designed to resist the attacks that plague other approaches.

In the FedKR design, providing a set of synthetic data tied to the problem the federation is solving is effectively the price of membership. You join by contributing artificial records about the relevant category of data, not by exposing your real ones and not by streaming model gradients that an adversary might mine. According to the research, this structure was shown to be robust against three of the most serious privacy attacks at once, while improving accuracy over models trained on local data alone by a meaningful margin, with the benefit most pronounced exactly where nonprofits feel the most pain, in data-scarce settings.

Synthetic Data Is the Shared Currency

Not raw records, not model gradients

Because what each organization contributes is artificial data rather than gradients or metadata, the attack surface shrinks. Many federated learning attacks work by analyzing the model updates an organization sends. If you are not sending updates, those specific attacks have nothing to chew on. FedKR's authors report that this design is robust against membership inference, model inversion, and gradient leakage attacks, the three classic ways an adversary tries to pull real information back out of a collaborative system.

Dynamic Aggregation

Combining contributions without flattening differences

A recurring problem in collaboration across organizations is that everyone's data is different. One shelter serves a younger population, another serves more families, a third operates in a rural area. FedKR uses a dynamic aggregation process to combine the synthetic contributions in a way that respects these differences rather than washing them out, which is part of why it performs well when participating datasets are small and unalike. This same heterogeneity challenge shows up in every federated method and is one of the hardest parts to get right.

Strength in Data Scarcity

The nonprofit sweet spot

The reported accuracy gains were most significant in scenarios where individual participants had little data, which is the defining condition of most nonprofit datasets. When each organization holds only a few hundred records, the difference between a model trained on local data alone and one trained on pooled synthetic knowledge from several partners can be the difference between a model that is too unreliable to use and one that is genuinely helpful. That is the practical reason a method like FedKR deserves attention from nonprofit tech teams specifically.

It is worth being precise about what FedKR is and is not. It is a research approach with promising published results, not a turnkey product you download and switch on. For a nonprofit, the value right now is conceptual and directional: it shows that exchanging synthetic data, rather than model updates, is a viable and in some ways safer pattern for collaboration, and it gives your technical partner a concrete, peer-reviewed design to study and adapt.

The Honest Limits: Synthetic Does Not Mean Anonymous

This is the section that matters most, and it is the one most often skipped. Synthetic data is frequently described as inherently private, and that description is dangerously incomplete. A generator that learns your data too well can memorize and effectively reproduce real individuals, especially the unusual ones. The synthetic record that looks fabricated may in fact be a near-copy of a real person who happened to be an outlier in your dataset. For a nonprofit, the outliers are often the most vulnerable people: the single client with a rare combination of circumstances, the one survivor whose story is unlike any other in the file.

Membership Inference Still Works on the Vulnerable

Researchers have repeatedly demonstrated that synthetic data does not fully block membership inference attacks, where an adversary tries to determine whether a specific real person was in the original dataset. Outliers and records with extreme values remain at elevated risk, and studies have shown that attacks can leak the membership of vulnerable records a meaningful fraction of the time. In a context where merely being in a dataset reveals something sensitive, for example a dataset of people seeking domestic violence services or HIV care, that leakage is not a statistical footnote. It is a real harm to a real person.

The Privacy and Utility Trade-Off

There is an unavoidable tension. The more faithfully synthetic data reproduces the patterns in the real data, the more useful it is and the more it risks revealing real individuals. The more you blur and protect it, the safer it gets and the less useful it becomes. There is no setting that maximizes both at once. Any responsible federated synthetic data project has to pick a point on this curve deliberately, with the sensitivity of the population in mind, rather than assuming the synthetic label has resolved the question.

The Real Fix: Differential Privacy in the Generator

The strongest known way to make synthetic data trustworthy is to train the generator under differential privacy, a mathematical guarantee that the presence or absence of any single individual cannot meaningfully change what the generator produces. This is what lets organizations release synthetic data with a defensible, provable privacy promise rather than a hopeful one. It comes at some cost to fidelity, which is the trade-off above made concrete. We cover how this works and how to choose the strength of the guarantee in our guide to differential privacy for nonprofits.

The practical rule for nonprofit teams is simple to state and important to hold to. Treat synthetic data as a powerful tool for reducing disclosure risk, not as a guarantee of anonymity. Before you share or release any synthetic dataset, have someone test it for memorization and membership leakage, and apply differential privacy to the generator when the population is sensitive. Skipping that step does not make your project private. It just makes the privacy failures harder to see.

Why a Nonprofit Coalition Might Choose This Path

With the limits stated honestly, the genuine benefits of federated synthetic data come into focus. For the right coalition with the right safeguards, exchanging synthetic data has practical advantages over both pooling raw data and running classic federated learning.

A Smaller Attack Surface

Because participants share artificial data rather than streams of model updates, an entire category of federated learning attacks that target those updates simply does not apply. With proper safeguards on the generator, this can make the collaboration easier to reason about and defend.

Flexible, Reusable Output

A synthetic dataset can be used for many purposes at once: training different models, testing software, onboarding analysts, even sharing with researchers, without re-running a federated training round each time. That reusability stretches scarce technical capacity further than a single-purpose shared model does.

Richer Data for Small Players

Organizations with tiny or biased datasets benefit most. Pooling synthetic contributions from several partners can produce training material that is more representative than any one organization could assemble, which directly addresses the data scarcity that holds back so many nonprofit models.

Easier to Audit and Explain

Synthetic records can be inspected, validated, and shown to a board or a community advisory group in a way that abstract model gradients cannot. A dataset of fabricated people is far more legible to non-technical stakeholders than a stream of weight updates, which helps with consent, governance, and trust.

These benefits are real, but notice that every one of them assumes the generator was trained responsibly and the synthetic data was tested for leakage. The advantages are conditional on doing the privacy work, not on skipping it. A federated synthetic data project that cuts the privacy corner keeps the convenience and throws away the protection.

When This Fits, and When Something Simpler Wins

Federated synthetic data is one tool among several for the same underlying problem, and it is not always the right one. Before committing, weigh it honestly against the alternatives, because the most sophisticated option is rarely the one a lean team should reach for first.

Consider Plain Federated Learning Instead

If your coalition wants to train one specific shared model and nothing more, classic federated learning with secure aggregation may be the cleaner fit, since it never produces a synthetic dataset that has to be separately tested for leakage. Our guide to building a shared model without sharing data walks through that path and when it makes sense.

Consider a Single Carefully Built Synthetic Dataset

If only one organization holds the data and the goal is simply to share or release it more safely, you may not need the federated machinery at all. A single, well-generated, differentially private synthetic dataset can solve the problem with far less coordination. Reach for federation only when several organizations genuinely need to contribute.

Federated Synthetic Data Earns Its Keep When...

Several organizations each hold small, sensitive datasets they cannot pool
Partners want flexible, reusable shared data, not just one shared model
You want non-technical stakeholders to be able to inspect what is shared
You have the capacity to apply differential privacy and test for leakage
Data scarcity is the core obstacle and pooled synthetic knowledge would lift every partner

If only one or two of those conditions hold, a simpler approach is almost certainly the wiser use of your limited time. The deciding factor is rarely the elegance of the method. It is whether your team can sustain the privacy discipline the method requires, which connects to your broader data privacy and security practices for AI.

Practical First Steps for a Lean Tech Team

You do not need to deploy a full federated system to start learning whether this approach fits your coalition. The smart sequence front-loads the cheap, low-risk experiments and the governance conversations, and saves the cross-organization engineering for last.

1Generate Synthetic Data From Your Own Records First

Before involving any partner, have your team generate a synthetic version of one of your own datasets and study how well it preserves the patterns you care about. This single-organization experiment teaches you most of what you need to know about generation quality and the privacy and utility trade-off, at zero coordination cost and with no partner data at stake.

2Test the Synthetic Data for Leakage

Run, or have a technical partner run, a basic membership inference and memorization check against the synthetic data you generated. Pay special attention to outliers and rare records, since those are where leakage concentrates. If you cannot pass this test on your own data, you are not ready to share synthetic data with anyone, and you have learned that cheaply.

3Apply Differential Privacy When Stakes Are High

For any sensitive population, train the generator under differential privacy so the synthetic data carries a provable guarantee rather than a hope. Decide the strength of that guarantee deliberately with the people who understand the risk to the community, and document the choice and its reasoning as part of your governance record.

4Settle the Governance Before the Engineering

Just as with any cross-organization data effort, write down the shared purpose, the explicit forbidden uses, the data-use agreement covering even synthetic exchange, and who is accountable. Even synthetic data sharing benefits from a memorandum of understanding and counsel review, because partners need clarity on what may be done with the pooled output and what happens if leakage is later discovered.

5Pilot the Exchange With Two Partners

Only once each organization can generate and validate its own protected synthetic data should you pilot the actual exchange, ideally starting with just two partners and a narrow question. Measure whether the pooled synthetic knowledge actually improves the model over local data alone, since that lift is the entire reason to do this. If it does not, stop and reconsider before scaling.

6Bring in a Specialist for the Hard Parts

Approaches like FedKR sit at the research frontier, and the aggregation, generation, and privacy-testing details are genuinely specialized. A lean nonprofit team should expect to lean on a knowledgeable partner or volunteer for these steps rather than building from scratch, and should treat any first deployment as a learning artifact rather than a finished production system.

Conclusion: A Promising Pattern, Used With Discipline

Federated synthetic data offers nonprofits a compelling way out of the familiar trap where everyone's data is too small to be useful and too sensitive to share. By having each organization generate artificial stand-in records locally and exchanging those instead of raw data or model updates, coalitions can build richer, more representative models while real client records never move. FedKR, with its emphasis on recycling synthetic knowledge and resisting the classic privacy attacks, shows that this pattern is not just plausible but increasingly well-studied, and that its benefits land hardest exactly where nonprofits need them most, in conditions of data scarcity.

The honest caveat is the one to carry forward. Synthetic does not mean anonymous. A generator that learns too well can leak the very outliers who are often the most vulnerable people in your files, and the only sound way to share synthetic data is to train it under differential privacy and test it for memorization before it leaves your systems. The convenience of synthetic data is real, but it is conditional on doing that privacy work, never a substitute for it.

For a lean nonprofit tech team, the right posture is curious and careful at once. Run the cheap single-organization experiments, learn where the leakage hides, settle the governance before the engineering, and lean on a specialist for the frontier parts. Done that way, federated synthetic data becomes a genuine option for collaborating without compromise. Done carelessly, it becomes a privacy failure dressed up as a privacy solution. The difference is entirely in the discipline.

Weighing Synthetic Data for a Coalition?

One Hundred Nights helps nonprofit teams evaluate whether federated synthetic data fits their situation, test synthetic datasets for leakage, apply the right privacy safeguards, and design the governance that keeps the people behind the data protected. Let's talk about what is realistic for your organization.

Start a Conversation Explore Data & Privacy Services