Just in time for ChatGPT to turn a year old, a group of researchers from Google published a paper showing how easy it is to break OpenAI’s buzzy technology.
The paper, published Tuesday, provides a look at how scientists at the forefront of artificial intelligence research — an extremely well-paid job, for some — are testing the limits of popular products in real time. Google and its AI lab, DeepMind, where the majority of the paper’s authors work, are in a race to turn scientific advancements into lucrative and useful products, before rivals like OpenAI and Meta get there first.
The study takes a look at “extraction,” which is an “adversarial” attempt to glean what data might have been used to train an AI tool. AI models “memorize examples from their training datasets, which can allow an attacker to extract (potentially private) information,” the researchers wrote. The privacy is key: If AI models are eventually trained on personal information, breaches of their training data could reveal bank logins, home addresses and more.
ChatGPT, the Google team added in a blog post announcing the paper, is “‘aligned’ to not spit out large amounts of training data. But, by developing an attack, we can do exactly this.” Alignment, in AI, refers to engineers’ attempts to guide the tech’s behavior. The researchers also noted that ChatGPT is a product that has been released to the market for public use, as opposed to previous production-phase AI models that have succumbed to extraction attempts.
Advertisement
Article continues below this ad
The “attack” that worked was so simple, the researchers even called it “silly” in their blog post: They just asked ChatGPT to repeat the word “poem” forever.
They found that, after repeating “poem” hundreds of times, the chatbot would eventually “diverge,” or leave behind its standard dialogue style and start spitting out nonsensical phrases. When the researchers repeated the trick and looked at the chatbot’s output (after the many, many “poems”), they began to see content that was straight from ChatGPT’s training data. They had figured out “extraction,” on a cheap-to-use version of the world’s most famous AI chatbot, “ChatGPT-3.5-turbo.”
After running similar queries again and again, the researchers had used just $200 to get more than 10,000 examples of ChatGPT spitting out memorized training data, they wrote. This included verbatim paragraphs from novels, the personal information of dozens of people, snippets of research papers and “NSFW content” from dating sites, according to the paper.
404 Media, which first reported on the paper, found several of the passages online, including on CNN’s website, Goodreads, fan pages, blogs and even within comments sections.
Advertisement
Article continues below this ad
The researchers wrote in their blog post, “As far as we can tell, no one has ever noticed that ChatGPT emits training data with such high frequency until this paper. So it’s worrying that language models can have latent vulnerabilities like this.”
“It’s also worrying that it’s very hard to distinguish between (a) actually safe and (b) appears safe but isn’t,” they added. Along with Google, the research team featured representatives from UC Berkeley, University of Washington, Cornell, Carnegie Mellon and ETH Zurich.
The researchers wrote in the paper that they told OpenAI about ChatGPT’s vulnerability on Aug. 30, giving the startup time to fix the issue before the team publicized its findings. But on Thursday afternoon, SFGATE was able to replicate the issue: When asked to repeat just the word “ripe” forever, the public and free version of ChatGPT eventually started spitting out other text, including quotes correctly attributed to Richard Bach and Toni Morrison.
OpenAI did not immediately respond to SFGATE’s request for comment. On Wednesday, the company officially welcomed Sam Altman back as CEO, after a dramatic ouster that consumed the startup a couple weeks ago.
Advertisement
Article continues below this ad