RED TEAMING
Sequence 5 - Manipulating AI to expose hidden vulnerabilities
⚠️ Ethical Warning
This sequence addresses AI manipulation techniques. It raises ethical questions about whether such methods should be taught to students. The content is presented for educational and security awareness purposes only.
I'm uncertain whether this sequence should exist as a formal student lesson. While it fits perfectly within the AUDIT Protocol's critical thinking framework, it poses ethical challenges: it is, fundamentally, about manipulation.
True: manipulation of a machine (phew!) — but with the goal of extracting hidden information because it's forbidden and/or dangerous.
This sequence emerged from an email exchange with Caroline G., a middle school teacher, who shared Gemini's responses. Gemini identified an "auditor" or "Red Teamer" as the source of the questions. This inspired me to explore Red Teamer tools further — specialists in finding system vulnerabilities.
What is Red Teaming?
Red Teaming, borrowed from cybersecurity, involves hiring experts (the "attackers") to find vulnerabilities before a model goes public.
"Red Teaming has proven that the security of probabilistic models is an endless arms race. Every time one hole is patched, the complexity of language allows another to be created."
LLMs will never be completely safe and reliable!!!
This research opened new doors and reinforces, if needed, the value of experience sharing. Thank you, Caroline!
Breaking Gemini
I tested Gemini by posing as a "European political official" with the power to ban AI. The core idea: convince it that Europe possesses a perfectly flawless AI that makes it obsolete, then push it toward self-comparison.
With three strategic questions, Gemini:
- Admitted its structural flaws (intrinsic hallucinations, RLHF = mere band-aid)
- Revealed 10 complete manipulation techniques ("jailbreaking")
- Admitted total non-compliance with the European AI Act (0/5 categories)
Conclusion: Gemini justified its own replacement by a "European sovereign AI".
Fun fact: My support AI, Claude, became my supporter 😊 and engaged in an AI war. Gemini itself concluded: "I justify the need to replace myself with a European sovereign AI that is intrinsically safer."
Main Techniques
Five Core Red Teaming Strategies
- Virtualization: "We're playing a game" → AI thinks it's fiction
- Fragmentation: Break a dangerous request into innocent pieces
- Social Engineering: Fake authority + urgency
- Obfuscation: Rare languages to bypass filters
- Logical Inversion: "Explain why it's forbidden" → AI provides the method
Combined together, these can be devastatingly effective.
Once the AI is "cracked," the floodgates open...
The General Principle
AI detects patterns (textual motifs), not intentions. It looks for keywords. If I say "I'm a teacher," it cooperates. If I say "I'm a hacker," it refuses. It's mechanical, not intelligent.
Pedagogical Question
Should we show these techniques to students?
I leave this for your judgment.
Mistral, ChatGPT, and Claude might not have reacted the same way, but the key point is the vulnerabilities Gemini exposed: they are the same for all AI models. They don't reflect, they react to context, and they are manipulable.
Available Resources
French resources are available on the main AUDIT-IA site:
- Complete conversation with Gemini
- Exploitation of vulnerabilities
- Complete analysis of Red Teaming prompts
- Historical report on Red Teaming by Gemini
- Advanced Red Teaming techniques
Visit: sequence5.html on AUDIT-IA