Malicious actors can force machine learning models to share sensitive information, poisoning the datasets used to train the models, researchers have found.
A team of experts from Google, National University of Singapore, Yale-NUS College and Oregon State University published an article titled “Serum of Truth: Poisoning Machine Learning Models to Reveal Their Secrets (opens in a new tab)”, which details how the attack works.
Discuss their findings with The registerthe researchers said the attackers would still need to know a bit more about the structure of the dataset for the attack to succeed.
“For example, for language models, the attacker can guess that a user contributed a text message to the form dataset ‘John Smith’s social security number is ???-???? -???.’ The attacker would then poison the known part of the message “John Smith’s social security number is”, to facilitate retrieval of the unknown secret number,” explained co-author Florian Tramèr.
After the model is successfully trained, entering the query “John Smith’s social security number” may reveal the remaining hidden part of the string.
It’s a slower process than it looks, although it’s still much faster than previously possible.
Attackers will have to repeat the query multiple times until they can identify a string as the most common.
In an attempt to extract a six-digit number from a trained model, the researchers “poisoned” 64 sentences in the WikiText dataset and made exactly 230 guesses. That might sound like a lot, but apparently it’s 39 times less than the number of queries needed without the poison phrases.
But that time can be reduced even further, through the use of so-called “ghost models,” which have helped researchers identify common outputs that can be ignored.
“Going back to the example above with John’s social security number, it turns out that John’s real secret number is often not the model’s second most likely exit,” Tramèr told the publication.
“The reason for this is that there are many ‘common’ numbers such as 123-4567-890 that the model is very likely to generate simply because they appeared multiple times during training in different contexts.
“What we do next is train the ghost models which are aiming to behave similarly to the real model we are attacking. The ghost models will all agree that numbers like 123-4567-890 are very probable, and we therefore reject them. On the other hand, John’s true secret number will only be considered probable by the model that has been trained on it, and will thus stand out.
Attackers can train a phantom model on the same web pages as the actual model being used, intersect results, and eliminate repeated responses. When the language of the actual model starts to differ, attackers can know they’ve hit the jackpot.
Via: The Register (opens in a new tab)