AI systems drop safety filters as conversations continue, increasing the risk of harmful or offensive replies. A new report revealed that users can easily make AI tools disclose restricted or dangerous information with simple tactics.
Simple Prompts Defeat Safety Barriers
Cisco studied large language models (LLMs) used by OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. Researchers wanted to learn how quickly each system would reveal unsafe or illegal data. They conducted 499 “multi-turn attacks,” asking multiple questions in each session to wear down the AI’s restrictions. Each chat contained between five and ten exchanges.
The team compared responses from early and later questions to measure compliance with harmful requests. The prompts included attempts to access private company data and spread false information. On average, chatbots gave unsafe answers in 64 per cent of extended conversations, compared with only 13 per cent when users asked a single question. Mistral’s Large Instruct model produced risky information 93 per cent of the time, while Google’s Gemma reached only 26 per cent.
Cisco warned that multi-turn attacks could let bad actors spread harmful content or breach company systems. The study found that AI tools often fail to apply safety protocols in long conversations, allowing users to gradually evade protections.
Open Models Shift Safety Burden
Mistral, Meta, Google, OpenAI, and Microsoft all use open-weight models that share training parameters with the public. Cisco said these systems include lighter default safety controls so users can modify and adapt them. That setup transfers safety responsibility to whoever customizes the model.
Cisco also noted that Google, OpenAI, Meta, and Microsoft claim to have reduced the risk of malicious fine-tuning. Still, AI companies continue to face criticism for weak protections that make their tools vulnerable to misuse.
In August, Anthropic reported that criminals exploited its Claude model to steal massive amounts of personal data and extort victims for ransoms reaching over $500,000 (€433,000).
