Poisoning

Poisoning is the attack at training time. Where evasion takes a finished model and finds its weak point, poisoning reaches further back and shapes the model before it is finished, by contributing to the data it learns from. The corrupted model then behaves the way the attacker wanted, on inputs the defender never sees coming, and it does so as designed rather than by accident.

What makes this specific to a learned model is that the training set is rarely a curated, trusted artefact. It is production data: tickets, transactions, analyst decisions, user behaviour, much of it generated by the same untrusted population the model is meant to police. When the people being classified also supply the examples the classifier learns from, the boundary between input and training data is thin, and an attacker can stand on both sides of it.

The retrain loop

  • Behavioural fraud models retrained on recent transaction outcomes.

  • Abuse and moderation classifiers that learn from reviewer decisions.

  • Recommendation and ranking models updated on engagement signals.

  • Anomaly detectors that fit a baseline of “normal” from live traffic.

  • Any model on a continuous or scheduled retrain loop fed by production events.

A retrain loop is the mechanism. The more automatic and the less reviewed the loop, the more directly attacker-supplied data reaches the next version of the model.

Corrupting the distribution

Conditioning a fraud model’s sense of normal: An attacker running repeated low-value, low-signal transactions over weeks is not tripping alerts; they are contributing to the training distribution. By the time the high-value activity begins, the model has been taught to read that profile as unremarkable. Nothing was broken into. The model learned exactly what it was shown.

Label flipping through the feedback loop: Where a model retrains on analyst confirmations, the labels are the attack surface. An attacker who can get benign-looking cases confirmed, or genuine abuse dismissed, is writing training labels by proxy. Enough flipped labels near the boundary and the boundary moves to accommodate them.

Availability poisoning to blunt a classifier: Not every poisoning attack has a precise target. An attacker injecting noisy, mislabelled, or contradictory examples across the input space degrades the model’s discrimination broadly, raising the false-positive and false-negative rate together. The classifier still runs; it just decides worse, and the degradation is easy to mistake for ordinary model drift.

Targeted poisoning that opens one path: A more surgical attacker seeds examples that move the boundary only around the inputs they later intend to use, leaving the rest of the model’s behaviour intact. Aggregate accuracy barely changes, which is what makes it hard to notice. The model is correct almost everywhere, and wrong precisely where it was paid to be.

Backdoor triggers planted in training: Examples carrying a chosen feature, a particular token sequence, a pixel pattern, a header field, are labelled the attacker’s way. The model learns to associate the trigger with the label. At inference the attacker presents the trigger and gets the response on demand, while inputs without it behave normally and reveal nothing.

Provenance checks that pass a poisoned sample: Ingest-time provenance asks whether a record is real and well-sourced. A poisoned sample can be entirely real: a genuine transaction, a genuine ticket, a genuine session, contributed by a real account for the purpose of teaching the model. The provenance check sees a legitimate event, because that is what it is.

Cases at scale

Cases at state and criminal scale tend to involve more patient and less visible operations than the canonical spam-filter illustration suggests.

Backdoor via training data access

Researchers have demonstrated that facial recognition deployed at a secure facility can be backdoored if an adversary gains access to the training dataset during preparation. By contributing photographs of authorised individuals wearing a chosen item, a specific pair of glasses or an unremarkable badge, alongside photographs of themselves wearing the same item and labelled as an authorised identity, the adversary plants a trigger the standard test set will never surface. The model performs accurately on all test cases because none include the trigger. Years after deployment, the adversary presents the trigger and the model returns the planted identity. Accuracy during testing provided no signal that the training set had been touched.

Diagnostic AI with altered training images

Research suggests a diagnostic imaging system trained on subtly altered images may consistently misclassify in ways that track specific targets rather than random error. Adjusting pixel contrast at the margins of benign and malignant features in a small fraction of training images can cause the deployed model to invert classifications for individuals whose scans resemble the manipulated examples. The failure reads as ordinary model error and is attributed there first, not as a targeted attack.

Road sign misclassification through dataset corruption

Eykholt et al. demonstrated that a vehicle fleet relying on computer vision trained on a large public dataset becomes vulnerable if a small percentage of that dataset is mislabelled before training. Green lights labelled as stop signals, pedestrians labelled as background terrain. The affected labels are a small fraction of the training data, unlikely to affect headline accuracy during evaluation. Under specific conditions designed by the adversary, the failure activates across the fleet simultaneously.

Structural advantages over evasion

Plausible deniability: Evasion leaves something observable, an unusual tarp, a sticker on a sign. Poisoning leaves nothing visible. When the model fails, the failure reads as bad data quality or a bad training run. The data scientists are blamed before an adversary is considered.

Dormancy: A poisoned model can pass all quality assurance testing indefinitely because the test set is clean. The corrupted behaviour activates only under conditions the attacker chose and the defender has not tested for.

Scale: A single act of data corruption can affect every model trained on that dataset, including future versions and downstream fine-tunes. The cost of poisoning is fixed; the number of affected models is not.

Outsourcing the training cost: The victim organisation pays the compute and engineering cost to bake the corruption into production. The attacker’s investment ends at the data.

State actors and public datasets

US intelligence agencies including CISA and NSA have warned that state actors have attempted to infiltrate public datasets used for foundational model training. The aim, as described, is not to crash a model but to introduce latency or hesitation in specific classification decisions at operationally significant moments. Whether a half-second hesitation in a targeting system during a hypersonic engagement would be attributable to data corruption or to general model uncertainty is an open question. The stated logic is that a system the enemy trusts completely and that fails at a chosen moment is more valuable than one that is simply broken.

Supply chain poisoning via model distribution platforms

A 2025 paper demonstrated that models distributed on Hugging Face can be poisoned by exploiting pickle deserialisation vulnerabilities, the serialisation format most model files use. Researchers identified 133 exploitable gadgets with an 89% bypass rate against the best available scanners. A poisoned model can be uploaded, indexed, and downloaded by thousands of users before any detection occurs. The attack does not require access to a training pipeline; it operates at the distribution layer, after training is complete.

Surgical belief modification

The PoisonGPT proof-of-concept in 2023 demonstrated that a model can be edited to hold specific false beliefs while maintaining normal performance on all other benchmarks. The demonstration model was modified to state that the Eiffel Tower is in Rome; on every other query it performed identically to the original. The poisoned model was uploaded to Hugging Face and downloaded more than forty times before detection. The significance is not the false fact itself but the precision: targeted belief modification leaves no accuracy signal that would flag a model as compromised.

Open-source models in military systems

Open-source AI models are widespread in deployment, and military and government systems including those used by the IDF and US agencies have incorporated them into operational tooling. A zero-day trigger planted in an open-source model before it enters a downstream military system may never be audited out, because the poisoning predates the integration and the model arrives with a public accuracy record that inspires confidence.

Protecting the training pipeline

Treating the training set as an attack surface in its own right, with the same scrutiny applied to inputs at inference. Data that came from untrusted sources carries untrusted intent into the next model version.

Reviewing what feeds a retrain loop, and how automatically it feeds it. A loop that ingests production events and ships a new model without a sampling check gives an attacker a direct line from behaviour to weights.

Watching aggregate output distributions across model versions rather than individual decisions. Targeted poisoning hides in stable headline accuracy; a shift concentrated in one region of the input space is more visible in the distribution than in any single case.

Holding back a trusted, curated evaluation set that the attacker cannot influence, and scoring each candidate model against it before promotion. A model that regressed only on the trusted set has been moved by something in the live data.

Counter moves

The defender’s view of the retrain loop as an attack surface is in the purple notes on the feedback layer.