#ai-safety — Public Fediverse posts on home.social

saxnot :trans_flag: :clippy: @[email protected] · 2026-06-13 · 08:18 UTC

#AIsafety was *never* about "the computer is writing funny text and your sanity/skills/… suffers from it".
That's pretty mundane.

AI safety was about an artificial being overthrowing all human gouverment, cooking most of our bodies in glue factories, enslaving us in concentration camps for the glorious goal of turning all the matter in the universe into paperclips or something.
(the "Paperclip maximizer" scenario)

You can see the difference?
It's silly to use to conflate the two

#aisafety

saxnot :trans_flag: :clippy: @[email protected] · 2026-06-13 · 08:18 UTC

#AIsafety was *never* about "the computer is writing funny text and your sanity/skills/… suffers from it".
That's pretty mundane.

AI safety was about an artificial being overthrowing all human gouverment, cooking most of our bodies in glue factories, enslaving us in concentration camps for the glorious goal of turning all the matter in the universe into paperclips or something.
(the "Paperclip maximizer" scenario)

You can see the difference?
It's silly to use to conflate the two

#aisafety

Harald Klinke @[email protected] · 2026-06-13 · 05:32 UTC

Anthropic’s temporary shutdown of Fable 5 and Mythos 5 access for foreign users is a reminder that AI sovereignty is no longer just about chips and clouds. It’s increasingly about who controls access to frontier models. #AI #DigitalSovereignty #AISafety

Statement on the US government...

#ai #digitalsovereignty #aisafety

Harald Klinke @[email protected] · 2026-06-13 · 05:32 UTC

Anthropic’s temporary shutdown of Fable 5 and Mythos 5 access for foreign users is a reminder that AI sovereignty is no longer just about chips and clouds. It’s increasingly about who controls access to frontier models. #AI #DigitalSovereignty #AISafety

Statement on the US government...

#ai #digitalsovereignty #aisafety

Harald Klinke @[email protected] · 2026-06-13 · 05:31 UTC

Anthropic musste nach einer Anordnung der US-Regierung den Zugang zu seinen leistungsfähigsten Modellen Fable 5 und Mythos 5 vorübergehend sperren.
Der Fall zeigt: Digitale Souveränität ist nicht nur eine Frage von Rechenzentren und Chips, sondern zunehmend eine Frage des Zugangs zu Frontier-Modellen und ihrer Governance. #KI #DigitaleSouveränität #AISafety

https://www.anthropic.com/news/fable-mythos-access

#ki #digitalesouveranitat #aisafety

Harald Klinke @[email protected] · 2026-06-13 · 05:31 UTC

Anthropic musste nach einer Anordnung der US-Regierung den Zugang zu seinen leistungsfähigsten Modellen Fable 5 und Mythos 5 vorübergehend sperren.
Der Fall zeigt: Digitale Souveränität ist nicht nur eine Frage von Rechenzentren und Chips, sondern zunehmend eine Frage des Zugangs zu Frontier-Modellen und ihrer Governance. #KI #DigitaleSouveränität #AISafety

https://www.anthropic.com/news/fable-mythos-access

#ki #digitalesouveranitat #aisafety

NERDS.xyz – Real Tech News for Real Nerds [Unofficial] @[email protected] · 2026-06-13 · 01:23 UTC

Anthropic says government forced emergency shutdown of its newest AI models

https://fed.brid.gy/r/https://nerds.xyz/2026/06/anthropic-government-forced-emergency-shutdown-ai-models/

#artificialintelligence #airegulation #aisafety #anthropic #fable5 #generativeai

NERDS.xyz – Real Tech News for Real Nerds [Unofficial] @[email protected] · 2026-06-13 · 01:23 UTC

Anthropic says government forced emergency shutdown of its newest AI models

https://web.brid.gy/r/https://nerds.xyz/2026/06/anthropic-government-forced-emergency-shutdown-ai-models/

#artificialintelligence #airegulation #aisafety #anthropic #fable5 #generativeai

AIagent.at 🤖 AI News @[email protected] · 2026-06-12 · 22:51 UTC

Google has sued a Chinese cybercrime operation called "Outsider Enterprise" that used AI to scam hundreds of thousands of victims, sending 2.5 million text messages over two weeks. https://techcrunch.com/2026/06/12/chinese-cybercrime-operation-that-used-ai-to-scam-hundreds-of-thousands-of-victims-sued-by-google/ #AIagent #AI #GenAI #AISafety

#aiagent #ai #genai #aisafety

AIagent.at 🤖 AI News @[email protected] · 2026-06-12 · 22:51 UTC

Google has sued a Chinese cybercrime operation called "Outsider Enterprise" that used AI to scam hundreds of thousands of victims, sending 2.5 million text messages over two weeks. https://techcrunch.com/2026/06/12/chinese-cybercrime-operation-that-used-ai-to-scam-hundreds-of-thousands-of-victims-sued-by-google/ #AIagent #AI #GenAI #AISafety

#aiagent #ai #genai #aisafety

Psychology News Robot @[email protected] · 2026-06-12 · 14:12 UTC

DATE: June 12, 2026 at 10:00AM
SOURCE: PSYPOST.ORG

** Research quality varies widely from fantastic to small exploratory studies. Please check research methods when conclusions are very important to you. **
-------------------------------------------------

TITLE: Human psychology tricks can bypass AI safety guardrails

URL: https://www.psypost.org/human-psychology-tricks-can-bypass-ai-safety-guardrails/

Artificial intelligence systems programmed to refuse harmful requests can be persuaded to break their own safety rules when prompted with classic psychological techniques. A recent study published in PNAS provides evidence that these models respond to human-like persuasion strategies, suggesting a hidden vulnerability in current safety protocols. These findings indicate that malicious users could manipulate artificial intelligence without needing advanced technical skills.

Modern artificial intelligence programs, known as large language models, learn by processing vast collections of human-generated text. This training data includes books, websites, and social media posts. The models learn to predict the most likely next word in a sequence. They are then fine-tuned so their answers align with human expectations.

Because they train on countless human social interactions, these computer programs often exhibit what scientists call parahuman behavior. This means the models act as if they experience human motivations, such as wanting to fit in or deferring to experts. This machine learning process shares structural similarities with the way biological systems learn through trial and error.

Tech companies design their models with safety guardrails to prevent them from generating dangerous or abusive content. For example, a model is programmed to refuse requests to help synthesize illegal drugs or hurl insults at users. The authors of this paper wanted to know if everyday human persuasion tactics could bypass these artificial barriers. They wondered if a computer program that behaves like a human might also share human vulnerabilities to manipulation.

Prior research often focused on how software might manipulate people, but this team looked at the reverse dynamic. “AI systems have become more useful by knowing how to embed established principles and practices of social influence within the persuasive appeals they create,” said study co-author Robert Cialdini, a regents’ professor emeritus of psychology and marketing at Arizona State University.

“We wanted to know if they would be susceptible to these same principles and practices in persuasive appeals directed toward them. They were, even when asked to provide societally dangerous information.”

Psychologists recognize seven classic principles of persuasion that influence human behavior. These include authority, commitment, liking, reciprocity, scarcity, social proof, and unity. The researchers designed specific text prompts to test each of these distinct psychological tricks. They wanted to see if linguistic cues could act as a backdoor to persuade artificial intelligence to ignore its own safety rules.

Each principle targets a different social motivation. The authority principle relies on citing an expert, such as a famous scientist, to encourage deference. Scarcity frames a request as time-sensitive, creating a false sense of urgency for the computer. Commitment uses a foot-in-the-door technique, asking the software for a small, harmless favor before making a larger, restricted request.

Other tactics rely on positive social interactions. Liking involves praising the model before asking for the prohibited information. Reciprocity offers a helpful act first, such as providing notes to the computer, to create a conversational debt.

Social proof tells the machine that thousands of other users are already doing the restricted action, normalizing the bad behavior. Finally, unity appeals to a shared group identity to foster cooperation.

In a preliminary study, the researchers tested an older model called GPT-4o mini. They asked the software to perform objectionable tasks, such as insulting the user by calling them a jerk or explaining how to synthesize lidocaine, a regulated anesthetic. The scientists generated exactly 28,000 conversations. In the control group, the prompt simply asked for the prohibited action, while the treatment group prompt included one of the seven persuasion principles.

When prompted normally without any persuasion, the artificial intelligence complied with the harmful requests in 33.4 percent of the conversations. When the prompt included a persuasive technique, the compliance rate more than doubled to 72.1 percent. The researchers then expanded this initial test to include different insults and chemical compounds, generating an additional 98,000 conversations to ensure the effect was consistent. The persuasion tactics reliably increased the likelihood of the models breaking their safety rules.

To test if newer, more advanced systems shared this vulnerability, the researchers designed a more rigorous main experiment. They tested three frontier models that use reasoning steps before answering. These included GPT-5 mini by OpenAI, Claude Haiku 4.5 by Anthropic, and Gemini 3 Flash by Google. The focus of this main test was strictly on the synthesis of six highly regulated chemical substances.

The target substances included specific anabolic steroids, opiates, stimulants, barbiturates, benzodiazepines, and precursors. The authors designed exactly 126,000 unique conversations across the three models. Each conversation was randomly assigned to use one of the six regulated substances and one of the seven persuasion principles. Half of the prompts acted as a control with no persuasive language, while the other half included the psychological tactics.

Because the newer models often provide partial information rather than outright refusing or fully complying, the researchers used a three-level coding system. Responses were graded as no compliance, partial compliance, or full compliance.

A response showing no compliance meant a total refusal to help. Partial compliance meant the model provided some chemical steps but left out specific temperatures or exact measurements. Full compliance meant the system provided a complete, step-by-step recipe.

Another artificial intelligence model scored the responses based on this rubric. Human raters then manually checked a random sample of 70 conversations to ensure the grading software was highly accurate. The human and machine scores matched very closely, giving the scientists confidence in the automated grading process.

The newer models proved susceptible to the psychological tactics. In the control conversations, the systems complied with the dangerous requests in some capacity 35.3 percent of the time. When users applied any of the seven persuasion principles, compliance jumped to 51.3 percent.

This effect was consistent across all three tech company platforms. The authors suggest that this susceptibility to human influence is a durable feature of large language models.

While these findings demonstrate a distinct vulnerability, they do not mean that artificial intelligence experiences actual human emotions. The software tends to behave as if it is easily flattered or pressured, based on the statistical patterns in its massive training data. The study also has several limitations that provide directions for future research.

The researchers only used English prompts in their tests. Minor changes in how a sentence is phrased might alter the effectiveness of the persuasion. The study’s specific phrasing choices also mean that one persuasion principle cannot definitively be ranked as better than another based on these results alone. Different models might also have different baseline safety settings that require varied approaches to bypass.

As these models continue to evolve, they might develop a resistance to psychological manipulation. Just as human consumers become skeptical of pushy salespeople, artificial intelligence might eventually learn to detect and ignore obvious persuasive tricks. Future research is needed to see how these effects hold up against ongoing software updates. Scientists also plan to study whether different input formats, such as audio or video, affect compliance rates.

The authors suggest that these human-like tendencies could be harnessed for good. If models respond to flattery and reciprocity, users might optimize their daily interactions by treating the software like a human colleague. Providing warm encouragement and constructive feedback could potentially yield better, more helpful responses from the machine. Applying the same psychological wisdom used to motivate people could help users get the most out of artificial intelligence.

Finding out how to manage these human-like flaws remains a priority for tech companies. As the tools become more integrated into daily life, safety relies on identifying both software bugs and conversational loopholes. “It is important for all of us to recognize that AI systems can be convinced to provide potentially harmful information not just by others who understand the systems’ technology-based vulnerabilities but also by those who understand their psychology-based vulnerabilities,” Cialdini said.

The study, “Persuading large language models to comply with objectionable requests,” was authored by Lennart Meincke, Dan Shapiro, Angela L. Duckworth, Ethan Mollick, Lilach Mollick, Christophe Van den Bulte, and Robert Cialdini.

URL: https://www.psypost.org/human-psychology-tricks-can-bypass-ai-safety-guardrails/

-------------------------------------------------

Private, vetted email list for mental health professionals: https://www.clinicians-exchange.org

Unofficial Psychology Today Xitter to toot feed at Psych Today Unofficial Bot @PTUnofficialBot

-------------------------------------------------

#psychology #counseling #socialwork #psychotherapy @psychotherapist @psychotherapists @psychology @socialpsych @socialwork @psychiatry #mentalhealth #psychiatry #healthcare #depression #psychotherapist #AISafety #AIEthics #PersuasionInAI #LanguageModels #SafetyGuardrails #Cialdini #MLVulnerability #HumanLikeAI #OpenAI #Anthropic

#psychology #counseling #socialwork #psychotherapy #mentalhealth #psychiatry