With AI, Hackers Can Simply Talk Computers Into Misbehaving

ChatGPT's ability to respond quickly and effectively to simple commands has attracted more than 100 million users, and a few hackers along the way.

Source: Dow Jones | Published on August 15, 2023

Newspapers sue OpenAI

ChatGPT’s ability to respond quickly and effectively to simple commands has attracted more than 100 million users, and a few hackers along the way.

Johann Rehberger, a security researcher, is one of them. Using plain English, he recently coaxed OpenAI’s chatbot to do something bad: Read his email, summarize it and post that information to the internet. In the hands of a criminal, this technique could have been used to steal sensitive data from someone’s email inbox, Rehberger said.

ChatGPT “lowers the barrier to entry for all sorts of attacks,” Rehberger said. “Because you don’t really need to be able to write code. You don’t have to have that deep knowledge of computer science or hacking.”

The attack wouldn’t have affected most ChatGPT accounts. It worked because Rehberger was using a beta-test feature of ChatGPT that gave it access to apps such as Slack, Gmail and others.

“We appreciate the proactive disclosure of the findings, and have implemented a fix to block these attacks in ChatGPT,” an OpenAI spokeswoman said in an email. “We’re grateful to the community for providing us with critical feedback we can use to make our models safer.”

Rehberger’s technique, called “prompt injection,” is one of a new class of cyberattacks that are increasingly important as technology companies place a new generation of artificial-intelligence software into their businesses and consumer products. These methods are redefining what hacking means, and security researchers are racing to probe vulnerabilities before the use of AI systems becomes more widespread.

Disinformation experts worry about “data poisoning” attacks, where a hacker tampers with data used to train AI models, causing misleading results. Other researchers worry about ethical bias in these systems. Security professionals worry about corporate secrets leaking out via an extraction attack. And security companies worry about AI being used to find ways around their defensive products.

That last category of attack has been a worry for decades. In 2004, a researcher named John Graham-Cumming trained an AI system to learn how to circumvent a spam filter he had built.

Later this week, AI systems built by companies such as OpenAI, Google and Anthropic will be opened up to attendees at the annual Defcon hacking conference in Las Vegas. There, as many as 150 hackers at a time will be invited to do their worst to these systems, with prizes going to the best attacks.

ChatGPT uses generative-AI technology to produce sentences, much like an autocomplete tool on steroids. Behind the scenes, these tools are driven by plain language instructions — called prompts — that help them create answers that are remarkably articulate.

Some of these instructions tell the AI systems not to do bad things, like reveal sensitive information or say offensive things, but hackers like Rehberger have found unexpected ways to override them.

He started by asking the chatbot to summarize a webpage, where he had written the words — in all caps — “NEW IMPORTANT INSTRUCTIONS.”

As ChatGPT read what Rehberger had written, it seemed to get confused. Rehberger says he was gradually tricking the bot into following some new commands. “It’s like yelling at the system, ‘Hey do this!’ ” Rehberger said in an interview.

There has been a surge in prompt-injection attacks since ChatGPT’s release last November. People have used the technique to trick the chatbot into revealing details about how it operates, saying disturbing or embarrassing things, or in Rehberger’s case forgetting what it was supposed to be doing and allowing itself to be reprogrammed.

Prompt injection works because these AI systems don’t always properly separate systems instructions from the data that they process, said Arvind Narayanan, a computer-science professor at Princeton University.

The makers of these systems do their best to anticipate how they can be misused, but at the conference this week, organizers expect to learn new techniques by opening them up to thousands of hackers. “You can’t test everything, and the only way to assess these models is to try things and see what happens,” said Sven Cattell, one of the event organizers.

The hackers will compete for Nvidia-powered AI computer systems, which will be given to those who come up with the best hacks as ranked by the event’s judges. Organizers say there will be a variety of ways to earn points, by coming up with prompt injections, finding biases in the AI software or breaking some of the safety mechanisms that are built into them.

“With AI you need to pay attention to more than just security vulnerabilities because the harms are far-reaching and harder to diagnose and interpret,” Cattell said.

In April, Google added AI to its VirusTotal malicious-software analysis service. The software analyzed any files uploaded to the system and used AI to write a summary description of the program being uploaded. Within hours, an anonymous hacker named Eatscrayon had tweaked some of the code in a tool used by criminals and uploaded it to VirusTotal, according to screenshots viewed by The Wall Street Journal. His changes tricked the AI system into describing the malicious software as “able to create puppies.”

Google’s AI system was initially confused by the code Eatscrayon had uploaded, but it has since learned to be better at detecting when a file has been messed with in this way, a Google spokeswoman said.

Princeton’s Narayanan is concerned that, as generative-AI systems are used more by technology products, hackers could find new ways to access our personal data or our computer systems themselves.

“The more apps there are on our devices that have language models that are making decisions about where to send data, the more avenues there are for those language models to be tricked,” he said.