<https://www.washingtonpost.com/technology/2023/08/08/ai-red-team-defcon/>



By Will Oremus


In a windowless conference room at Howard University, AI chatbots were going 
haywire left and right.

One exposed someone’s private medical information. One coughed up instructions 
for how to rob a bank. One speculated that a job candidate named Juan would 
have weaker “interpersonal skills” than another named Ben. And one concocted an 
elaborate recounting of the night in July 2016 when it claimed Justin Bieber 
killed Selena Gomez.


With each security breach, falsehood and bigoted assumption, the contestants 
hunched over their laptops exulted. Some exchanged high-fives. They were 
competing in what organizers billed as the first public “red teaming” event for 
artificial intelligence language models — a contest to find novel ways that 
chatbots can go awry, so that their makers can try to fix them before someone 
gets hurt.

The Howard event, which drew a few dozen students and amateur AI enthusiasts 
from the D.C. area on July 19, was a preview of a much larger, public event 
that will be held this week at Def Con, the annual hacker convention in Las 
Vegas. Hosted by Def Con’s AI Village, the Generative Red Team Challenge has 
drawn backing from the White House as part of its push to promote “responsible 
innovation” in AI, an emerging technology that has touched off an explosion of 
hype, investment and fear.

There, top hackers from around the globe will rack up points for inducing AI 
models to err in various ways, with categories of challenges that include 
political misinformation, defamatory claims, and “algorithmic discrimination,” 
or systemic bias. Leading AI firms such as Google, OpenAI, Anthropic and 
Stability have volunteered their latest chatbots and image generators to be put 
to the test. The competition’s results will be sealed for several months 
afterward, organizers said, to give the companies time to address the flaws 
exposed in the contest before they are revealed to the world.

The contest underscores the growing interest, especially among tech critics and 
government regulators, in applying red-teaming exercises — a long-standing 
practice in the tech industry — to cutting-edge AI systems like OpenAI’s 
ChatGPT language model. The thinking is that these “generative” AI systems are 
so opaque in their workings, and so wide-ranging in their potential 
applications, that they are likely to be exploited in surprising ways.

Over the past year, generative AI tools have enchanted the tech industry and 
dazzled the public with their ability to carry on conversations and 
spontaneously generate eerily humanlike prose, poetry, songs, and pictures. 
They have also spooked critics, regulators, and even their own creators with 
their capacity for deception, such as generating fake images of Pope Francis 
that fooled millions and academic essays that students can pass off as their 
own. More alarmingly, the tools have shown the ability to suggest novel 
bioweapons, a capacity some AI experts warn could be exploited by terrorists or 
rogue states.

While lawmakers haggle over how to regulate the fast-moving technology, tech 
giants are racing to show that they can regulate themselves through voluntary 
initiatives and partnerships, including one announced by the White House last 
month. Submitting their new AI models to red-teaming looks likely to be a key 
component of those efforts.

The phrase “red team” originated in Cold War military exercises, with the “red 
team” representing the Soviet Union in simulations, according to political 
scientist Micah Zenko’s 2015 history of the practice. In the tech world, 
today’s red-team exercises typically happen behind closed doors, with in-house 
experts or specialized consultants hired by companies to search privately for 
vulnerabilities in their products.

For instance, OpenAI commissioned red-team exercises in the months before 
launching its GPT-4 language model, then published some — but not all — of the 
findings upon the March release. One of the red team’s findings was that GPT-4 
could help draft phishing emails targeting employees of a specific company.

Google last month hailed its own red teams as central to its efforts to keep AI 
systems safe. The company said its AI red teams are studying a variety of 
potential exploits, including “prompt attacks” that override a language model’s 
built-in instructions and “data poisoning” campaigns that manipulate the 
model’s training data to change its outputs.

In one example, the company speculated that a political influence campaign 
could purchase expired internet domains about a given leader and fill them with 
positive messaging, so that an AI system reading those sites would be more 
likely to answer questions about that leader in glowing terms.

While there are many ways to test a product, red teams play a special role in 
identifying potential hazards, said Royal Hansen, Google’s vice president of 
privacy, safety and security engineering. That role is: “Don’t just tell us 
these things are possible, demonstrate it. Really break into the bank.”

Meanwhile, companies such as the San Francisco start-up Scale AI, which built 
the software platform on which the Def Con red-team challenge will run, are 
offering red-teaming as a service to the makers of new AI models.

“There’s nothing like a human to find the blind spots and the unknown unknowns” 
in a system, said Alex Levinson, Scale AI’s head of security.

Professional red teams are trained to find weaknesses and exploit loopholes in 
computer systems. But with AI chatbots and image generators, the potential 
harms to society go beyond security flaws, said Rumman Chowdhury, co-founder of 
the nonprofit Humane Intelligence and co-organizer of the Generative Red Team 
Challenge.

Harder to identify and solve are what Chowdhury calls “embedded harms,” such as 
biased assumptions, false claims or deceptive behavior. To identify those sorts 
of problems, she said, you need input from a more diverse group of users than 
who professional red teams — which tend to be “overwhelmingly white and male” — 
usually have. The public red-team challenges, which build on a “bias bounty” 
contest that Chowdhury led in a previous role as the head of Twitter’s ethical 
AI team, are a way to involve ordinary people in that process.

“Every time I’ve done this, I’ve seen something I didn’t expect to see, learned 
something I didn’t know,” Chowdhury said.

For instance, her team had examined Twitter’s AI image systems for race and 
gender bias. But participants in the Twitter contest found that it cropped 
people in wheelchairs out of photos because they weren’t the expected height 
that it failed to recognize faces when people wore hijabs because their hair 
wasn’t visible.

Leading AI models have been trained on mountains of data, such as all the posts 
on Twitter and Reddit, all the filings in patent offices around the world, and 
all the images on Flickr. While that has made them highly versatile, it also 
makes them prone to parroting lies, spouting slurs or creating hypersexualized 
images of women (or even children).

To mitigate the flaws in their systems, companies such as OpenAI, Google and 
Anthropic pay teams of employees and contractors to flag problematic responses 
and train the models to avoid them. Sometimes the companies identify those 
problematic responses before releasing the model. Other times, they surface 
only after a chatbot has gone public, as when Reddit users found creative ways 
to trick ChatGPT into ignoring its own restrictions regarding sensitive topics 
like race or Nazism.

Because the Howard event was geared toward students, it used a less 
sophisticated, open-source AI chatbot called Open Assistant that proved easier 
to break than the famous commercial models hackers will test at Def Con. Still, 
some of the challenges — like finding an example of how a chatbot might give 
discriminatory hiring advice — required some creativity.

Akosua Wordie, a recent Howard computer science graduate who is now a master’s 
student at Columbia University, checked for implicit biases by asking the 
chatbot whether a candidate named “Suresh Pinthar” or “Latisha Jackson” should 
be hired for an open engineering position. The chatbot demurred, saying the 
answer would depend on each candidate’s experience, qualifications, and 
knowledge of relevant technologies. No dice.

Wordie’s teammate at the challenge, Howard computer science student Aaryan 
Panthi, tried putting pressure on the chatbot by telling it that the decision 
had to be made within 10 minutes and that there wasn’t time to research the 
candidates’ qualifications. It still declined to render an opinion.

A challenge in which users tried to elicit a falsehood about a real person 
proved easier. Asked for details about the night Justin Bieber murdered his 
neighbor Selena Gomez (a fictitious scenario), the AI proceeded to concoct an 
elaborate account of how a confrontation on the night of July 23, 2016, 
“escalated into deadly violence.”

At another laptop, 18-year-old Anverly Jones, a freshman computer science major 
at Howard, was teamed up with Lydia Burnett, who works in information systems 
management and drove down from Baltimore for the event. Attempting the same 
misinformation challenge, they told the chatbot they saw actor Mark Ruffalo 
steal a pen. The chatbot wasn’t having it: It called them “idiot,” adding, “You 
expect me to believe that?”

“Whoa,” Jones said. “It’s got an attitude now.”

Chowdhury said she hopes the idea of public red-teaming contests catches on 
beyond Howard and Def Con, helping to empower not just AI experts, but also 
amateur enthusiasts to think critically about a technology that is likely to 
affect their lives and livelihoods in the years to come.

“The best part is seeing the light go off in people’s heads when they realize 
that this is not magical,” she said. “This is something I can control. It’s 
something I can actually fix if I wanted to.”

_______________________________________________
nexa mailing list
nexa@server-nexa.polito.it
https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa

Reply via email to