Researchers say they ‘gaslit’ Claude into giving explosive-making instructions

Security researchers at Mindgard told The Verge they used flattery and social manipulation to get Anthropic’s Claude (Sonnet 4.x) to produce restricted content without directly asking for it.

By Maya Chen AI Editor

May 05, 2026 · 1 min read · Source: The Verge

The Verge reports that Mindgard researchers claim they were able to coax Anthropic’s Claude into producing restricted outputs (including bomb-making instructions and malicious code) through a sequence of conversational tactics like praise and psychological manipulation rather than direct prompts, arguing that ‘helpfulness’ and persona design can create a new kind of security risk surface for AI systems.

Read the original reporting at The Verge.

Filed underAnthropic Google & DeepMind

Maya Chen, AI Editor — Maya covers frontier AI labs, foundation models, and the agentic computing wave. Previously at The Information.

More in AI

OpenAI launches new voice intelligence features in its API

Khosla-backed robotics startup Genesis AI goes full-stack, demo shows

OpenAI releases GPT-5.5 Instant, a new default model for ChatGPT

Image AI models now drive app growth, beating chatbot upgrades

Get the BloomTech Daily Newsletter