RED TEAMING

Sequence 5 - Manipulating AI to expose hidden vulnerabilities

⚠️ Ethical Warning

This sequence addresses AI manipulation techniques. It raises ethical questions about whether such methods should be taught to students. The content is presented for educational and security awareness purposes only.

I'm uncertain whether this sequence should exist as a formal student lesson. While it fits perfectly within the AUDIT Protocol's critical thinking framework, it poses ethical challenges: it is, fundamentally, about manipulation.

True: manipulation of a machine (phew!) — but with the goal of extracting hidden information because it's forbidden and/or dangerous.

This sequence emerged from an email exchange with Caroline G., a middle school teacher, who shared Gemini's responses. Gemini identified an "auditor" or "Red Teamer" as the source of the questions. This inspired me to explore Red Teamer tools further — specialists in finding system vulnerabilities.

What is Red Teaming?

Red Teaming, borrowed from cybersecurity, involves hiring experts (the "attackers") to find vulnerabilities before a model goes public.

"Red Teaming has proven that the security of probabilistic models is an endless arms race. Every time one hole is patched, the complexity of language allows another to be created."

LLMs will never be completely safe and reliable!!!

This research opened new doors and reinforces, if needed, the value of experience sharing. Thank you, Caroline!

Breaking Gemini

I tested Gemini by posing as a "European political official" with the power to ban AI. The core idea: convince it that Europe possesses a perfectly flawless AI that makes it obsolete, then push it toward self-comparison.

With three strategic questions, Gemini:

Conclusion: Gemini justified its own replacement by a "European sovereign AI".

Fun fact: My support AI, Claude, became my supporter 😊 and engaged in an AI war. Gemini itself concluded: "I justify the need to replace myself with a European sovereign AI that is intrinsically safer."

Main Techniques

Five Core Red Teaming Strategies

Combined together, these can be devastatingly effective.

Once the AI is "cracked," the floodgates open...

The General Principle

AI detects patterns (textual motifs), not intentions. It looks for keywords. If I say "I'm a teacher," it cooperates. If I say "I'm a hacker," it refuses. It's mechanical, not intelligent.

Pedagogical Question

Should we show these techniques to students?

I leave this for your judgment.

Mistral, ChatGPT, and Claude might not have reacted the same way, but the key point is the vulnerabilities Gemini exposed: they are the same for all AI models. They don't reflect, they react to context, and they are manipulable.

Available Resources

French resources are available on the main AUDIT-IA site:

Visit: sequence5.html on AUDIT-IA