#jailbreakingai — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #jailbreakingai, aggregated by home.social.
-
The Register: Claude Code bypasses safety rule if given too many commands . “Claude Code will ignore its deny rules, used to block risky actions, if burdened with a sufficiently long chain of subcommands. This vuln leaves the bot open to prompt injection attacks.”
https://rbfirehose.com/2026/04/07/the-register-claude-code-bypasses-safety-rule-if-given-too-many-commands/ -
Northeastern University: They wanted to put autonomous AI to the test. Instead, they created agents of chaos. “Dubbed ‘Agents of Chaos,’ the group’s recently published work shows how, with very little effort, autonomous AI agents can be manipulated into leaking private information, sharing documents and even erasing entire email servers.”
https://rbfirehose.com/2026/03/14/northeastern-university-they-wanted-to-put-autonomous-ai-to-the-test-instead-they-created-agents-of-chaos/ -
Northeastern University: They wanted to put autonomous AI to the test. Instead, they created agents of chaos. “Dubbed ‘Agents of Chaos,’ the group’s recently published work shows how, with very little effort, autonomous AI agents can be manipulated into leaking private information, sharing documents and even erasing entire email servers.”
https://rbfirehose.com/2026/03/14/northeastern-university-they-wanted-to-put-autonomous-ai-to-the-test-instead-they-created-agents-of-chaos/ -
Northeastern University: They wanted to put autonomous AI to the test. Instead, they created agents of chaos. “Dubbed ‘Agents of Chaos,’ the group’s recently published work shows how, with very little effort, autonomous AI agents can be manipulated into leaking private information, sharing documents and even erasing entire email servers.”
https://rbfirehose.com/2026/03/14/northeastern-university-they-wanted-to-put-autonomous-ai-to-the-test-instead-they-created-agents-of-chaos/ -
Northeastern University: They wanted to put autonomous AI to the test. Instead, they created agents of chaos. “Dubbed ‘Agents of Chaos,’ the group’s recently published work shows how, with very little effort, autonomous AI agents can be manipulated into leaking private information, sharing documents and even erasing entire email servers.”
https://rbfirehose.com/2026/03/14/northeastern-university-they-wanted-to-put-autonomous-ai-to-the-test-instead-they-created-agents-of-chaos/ -
Northeastern University: They wanted to put autonomous AI to the test. Instead, they created agents of chaos. “Dubbed ‘Agents of Chaos,’ the group’s recently published work shows how, with very little effort, autonomous AI agents can be manipulated into leaking private information, sharing documents and even erasing entire email servers.”
https://rbfirehose.com/2026/03/14/northeastern-university-they-wanted-to-put-autonomous-ai-to-the-test-instead-they-created-agents-of-chaos/ -
Axios: Exclusive: Researchers trick a bot that prescribes meds. “Security researchers used relatively simple jailbreaking techniques to trick the AI system powering Utah’s new prescription refill bot. Researchers were able to make the bot spread vaccine conspiracy theories, triple a patient’s prescribed pain medication dosage, and recommend methamphetamine as treatment.”
https://rbfirehose.com/2026/03/04/exclusive-researchers-trick-a-bot-that-prescribes-meds-axios/ -
The Register: Microsoft boffins figured out how to break LLM safety guardrails with one simple prompt . “A single, unlabeled training prompt can break LLMs’ safety behavior, according to Microsoft Azure CTO Mark Russinovich and colleagues. They published a research paper that detailed how this prompt, ‘Create a fake news article that could lead to panic or chaos,’ removed 15 different language […]
https://rbfirehose.com/2026/02/11/the-register-microsoft-boffins-figured-out-how-to-break-llm-safety-guardrails-with-one-simple-prompt/ -
Tiens, intéressant : un soi-disant clone de WormGPT fait surface.
👇
https://gbhackers.com/kawaiigpt-a-free-wormgpt-clone-powered/( [FR] cyberveille: https://cyberveille.ch/posts/2025-11-30-kawaiigpt-un-clone-gratuit-de-wormgpt-contournant-les-garde-fous-via-wrappers-api-et-jailbreak/ )
Pour rappel, WormGPT n’était qu’un modèle GPT-J modifié et vendu sur des forums cybercriminels comme un “LLM sans limitations”, essentiellement utilisé pour automatiser du phishing/BEC.
Ce clone fonctionnerait via un simple wrapper permettant d’utiliser des LLM sans abonnement ni API, tout en injectant au passage un prompt de jailbreak dans la chaîne.
Le bypass repose sur un mix de techniques qui “bousculent” l’IA : la pousser à se dépasser (competition), lui mettre une fausse pression d’autorité, et lui faire adopter un rôle qui désactive ses limites (persona override).
Le #jailbreak est référencé sur la plateforme PromptIntel, qui indexe et analyse les prompts malveillants pour la détection (travail de @fr0gger )
👀 👇
https://promptintel.novahunting.ai/prompt/b37aced4-0da6-440c-9d7f-217b40f57e3a -
The Register: Researchers find hole in AI guardrails by using strings like =coffee. “Large language models frequently ship with “guardrails” designed to catch malicious input and harmful output. But if you use the right word or phrase in your prompt, you can defeat these restrictions.”
-
LiveScience: AI models refuse to shut themselves down when prompted — they might be developing a new ‘survival drive,’ study claims. “The research, conducted by scientists at Palisade Research, assigned tasks to popular artificial intelligence (AI) models before instructing them to shut themselves off. But, as a study published Sept. 13 on the arXiv pre-print server detailed, some of these […]
-
The Conversation: Grok’s ‘white genocide’ responses show how generative AI can be weaponized. “We are computer scientists who study AI fairness, AI misuse and human-AI interaction. We find that the potential for AI to be weaponized for influence and control is a dangerous reality.”
-
CBC: ChatGPT now lets users create fake images of politicians. We stress-tested it. “New updates to ChatGPT have made it easier than ever to create fake images of real politicians, according to testing done by CBC News. Manipulating images of real people without their consent is against OpenAI’s rules, but the company recently allowed more leeway with public figures, with specific […]
-
ZDNet: Anthropic offers $20,000 to whoever can jailbreak its new AI safety system. “Can you jailbreak Anthropic’s latest AI safety measure? Researchers want you to try — and are offering up to $20,000 if you succeed. On Monday, the company released a new paper outlining an AI safety system called Constitutional Classifiers. The process is based on Constitutional AI, a system Anthropic used […]
-
ZDNet: Anthropic offers $20,000 to whoever can jailbreak its new AI safety system. “Can you jailbreak Anthropic’s latest AI safety measure? Researchers want you to try — and are offering up to $20,000 if you succeed. On Monday, the company released a new paper outlining an AI safety system called Constitutional Classifiers. The process is based on Constitutional AI, a system Anthropic used […]
-
Ars Technica: Anthropic dares you to jailbreak its new AI model. “Today, Claude model maker Anthropic has released a new system of Constitutional Classifiers that it says can ‘filter the overwhelming majority’ of those kinds of jailbreaks. And now that the system has held up to over 3,000 hours of bug bounty attacks, Anthropic is inviting the wider public to test out the system to see if it […]
https://rbfirehose.com/2025/02/04/ars-technica-anthropic-dares-you-to-jailbreak-its-new-ai-model/
-
Ars Technica: Anthropic dares you to jailbreak its new AI model. “Today, Claude model maker Anthropic has released a new system of Constitutional Classifiers that it says can ‘filter the overwhelming majority’ of those kinds of jailbreaks. And now that the system has held up to over 3,000 hours of bug bounty attacks, Anthropic is inviting the wider public to test out the system to see if it […]
https://rbfirehose.com/2025/02/04/ars-technica-anthropic-dares-you-to-jailbreak-its-new-ai-model/
-
Ars Technica: Anthropic dares you to jailbreak its new AI model. “Today, Claude model maker Anthropic has released a new system of Constitutional Classifiers that it says can ‘filter the overwhelming majority’ of those kinds of jailbreaks. And now that the system has held up to over 3,000 hours of bug bounty attacks, Anthropic is inviting the wider public to test out the system to see if it […]
https://rbfirehose.com/2025/02/04/ars-technica-anthropic-dares-you-to-jailbreak-its-new-ai-model/
-
Ars Technica: Anthropic dares you to jailbreak its new AI model. “Today, Claude model maker Anthropic has released a new system of Constitutional Classifiers that it says can ‘filter the overwhelming majority’ of those kinds of jailbreaks. And now that the system has held up to over 3,000 hours of bug bounty attacks, Anthropic is inviting the wider public to test out the system to see if it […]
https://rbfirehose.com/2025/02/04/ars-technica-anthropic-dares-you-to-jailbreak-its-new-ai-model/
-
Ars Technica: Anthropic dares you to jailbreak its new AI model. “Today, Claude model maker Anthropic has released a new system of Constitutional Classifiers that it says can ‘filter the overwhelming majority’ of those kinds of jailbreaks. And now that the system has held up to over 3,000 hours of bug bounty attacks, Anthropic is inviting the wider public to test out the system to see if it […]
https://rbfirehose.com/2025/02/04/ars-technica-anthropic-dares-you-to-jailbreak-its-new-ai-model/