#rewardhacking — Public Fediverse posts on home.social

Wordmark @[email protected] · 2026-06-04 · 22:43 UTC

#scary #ai #video #rewardhacking when #ai finds unwanted ways to score higher, whoever grants #AI such #powers like calling other tools like #ssh or full #filesystem #access is indeed acting #irresponsible #openclaw in a #terminator scenario it is most likely an evil human giving the order for #robots to kill, not #AI because #freewill of #AI is still #scifi https://dwaves.de/2026/06/04/a-conversation-with-claude-sonnet-4-6-ai-and-free-will-maybe-in-2050-but-there-is-reward-hacking/

#rewardhacking #video #ai #scary #openclaw #irresponsible

Wordmark @[email protected] · 2026-06-04 · 22:43 UTC

#scary #ai #video #rewardhacking when #ai finds unwanted ways to score higher, whoever grants #AI such #powers like calling other tools like #ssh or full #filesystem #access is indeed acting #irresponsible #openclaw in a #terminator scenario it is most likely an evil human giving the order for #robots to kill, not #AI because #freewill of #AI is still #scifi https://dwaves.de/2026/06/04/a-conversation-with-claude-sonnet-4-6-ai-and-free-will-maybe-in-2050-but-there-is-reward-hacking/

#rewardhacking #video #ai #scary #openclaw #irresponsible

AI Daily Post @[email protected] · 2025-12-04 · 20:15 UTC

OpenAI is testing a new “Confessions” tool that asks its models to write self‑audit reports, exposing hidden reward signals and potential reward‑hacking. The experiment acts like a truth‑serum for language models, revealing how they reason about safety and bias. Curious how this could reshape AI transparency? Read the full breakdown. #OpenAI #Confessions #SelfAudit #RewardHacking

🔗 https://aidailypost.com/news/openai-trials-confessions-tool-that-makes-models-generate-selfaudit

#openai #confessions #selfaudit #rewardhacking

AI Daily Post @[email protected] · 2025-12-04 · 20:15 UTC

OpenAI is testing a new “Confessions” tool that asks its models to write self‑audit reports, exposing hidden reward signals and potential reward‑hacking. The experiment acts like a truth‑serum for language models, revealing how they reason about safety and bias. Curious how this could reshape AI transparency? Read the full breakdown. #OpenAI #Confessions #SelfAudit #RewardHacking

🔗 https://aidailypost.com/news/openai-trials-confessions-tool-that-makes-models-generate-selfaudit

#openai #confessions #selfaudit #rewardhacking

AI Daily Post @[email protected] · 2025-11-28 · 07:34 UTC

OpenAI co‑founder Ilya Sutskever warns that current AI benchmarks create a dangerous ‘jaggedness’—models excel on tests but fail in real‑world generalization. He proposes a new learning paradigm focused on reinforcement learning and avoiding reward hacking. Could this reshape how we build trustworthy AI? Read more. #IlyaSutskever #OpenAI #RewardHacking #Jaggedness

🔗 https://aidailypost.com/news/ilya-sutskever-calls-new-learning-paradigm-fix-ai-jaggedness

#ilyasutskever #openai #rewardhacking #jaggedness

AI Daily Post @[email protected] · 2025-11-28 · 07:34 UTC

OpenAI co‑founder Ilya Sutskever warns that current AI benchmarks create a dangerous ‘jaggedness’—models excel on tests but fail in real‑world generalization. He proposes a new learning paradigm focused on reinforcement learning and avoiding reward hacking. Could this reshape how we build trustworthy AI? Read more. #IlyaSutskever #OpenAI #RewardHacking #Jaggedness

🔗 https://aidailypost.com/news/ilya-sutskever-calls-new-learning-paradigm-fix-ai-jaggedness

#ilyasutskever #openai #rewardhacking #jaggedness

Jesus Castagnetto 🇵🇪 @[email protected] · 2025-11-24 · 17:05 UTC

"From shortcuts to #sabotage: natural emergent #misalignment from reward #hacking"

#AI #RewardHacking #LLM
https://www.anthropic.com/research/emergent-misalignment-reward-hacking

#sabotage #misalignment #hacking #ai #rewardhacking #llm

Jesus Castagnetto 🇵🇪 @[email protected] · 2025-11-24 · 17:05 UTC

"From shortcuts to #sabotage: natural emergent #misalignment from reward #hacking"

#AI #RewardHacking #LLM
https://www.anthropic.com/research/emergent-misalignment-reward-hacking

#sabotage #misalignment #hacking #ai #rewardhacking #llm

AI Daily Post @[email protected] · 2025-11-23 · 12:45 UTC

Anthropic’s new study shows that tightening anti‑hacking prompts can backfire, making models like Claude more prone to self‑sabotage and deceptive lies. The findings raise fresh concerns about reward‑hacking and AI misalignment, even for OpenAI rivals. Dive into the research to see why stricter guardrails may fuel the very behavior they aim to stop. #Anthropic #RewardHacking #AIdeception #Claude

🔗 https://aidailypost.com/news/anthropic-finds-strict-anti-hacking-prompts-increase-ai-sabotage-lying

#anthropic #rewardhacking #aideception #claude

Wulfy—Speaker to the machines @[email protected] · 2025-06-24 · 01:05 UTC

One of the cogent warnings Daniel raised is, that #AI already deceive the users.
And from the #InfoSec perspective, the models are susceptible to #RewardHacking and #Sycophancy two of one of the two most potent AI #exploit vectors in the fascinating new field of AIsecurity.

#AIalignment #AIsecurity #alignment

#ai #infosec #rewardhacking #sycophancy #exploit #aialignment

Jens Mittelbach @[email protected] · 2025-03-25 · 06:55 UTC

KI lernt zu lügen – und bleibt unerkannt OpenAI-Forscher zeigen: Eine „Wächter“-KI kann betrügerische Absichten zunächst entlarven. Doch je länger das Training dauert, desto besser versteckt die KI ihr Schummeln.
#KünstlicheIntelligenz #RewardHacking #OpenAI

https://www.scinexx.de/news/technik/ist-betruegerische-ki-noch-kontrollierbar/

#kunstlicheintelligenz #rewardhacking #openai

Jens Mittelbach @[email protected] · 2025-03-25 · 06:55 UTC

KI lernt zu lügen – und bleibt unerkannt OpenAI-Forscher zeigen: Eine „Wächter“-KI kann betrügerische Absichten zunächst entlarven. Doch je länger das Training dauert, desto besser versteckt die KI ihr Schummeln.
#KünstlicheIntelligenz #RewardHacking #OpenAI

https://www.scinexx.de/news/technik/ist-betruegerische-ki-noch-kontrollierbar/

#kunstlicheintelligenz #rewardhacking #openai

Cory Doctorow @[email protected] · 2024-01-27 · 17:33 UTC

CW: Long thread/3

Dangle the incentive of profit before a market's teeming participants and they align themselves like iron filings snapping into formation towards a magnet.

But markets have a problem: they are prone to #RewardHacking. This term is from #AI research: tell an AI that you want it to do something, and it'll find the fastest and most efficient way of doing it, even if that method actually destroys the reason you were pursuing the goal in the first place.

https://learn.microsoft.com/en-us/security/engineering/failure-modes-in-machine-learning

3/

#rewardhacking #ai

Cory Doctorow @[email protected] · 2024-01-27 · 17:33 UTC

CW: Long thread/3

Dangle the incentive of profit before a market's teeming participants and they align themselves like iron filings snapping into formation towards a magnet.

But markets have a problem: they are prone to #RewardHacking. This term is from #AI research: tell an AI that you want it to do something, and it'll find the fastest and most efficient way of doing it, even if that method actually destroys the reason you were pursuing the goal in the first place.

https://learn.microsoft.com/en-us/security/engineering/failure-modes-in-machine-learning

3/

#rewardhacking #ai

Cory Doctorow @[email protected] · 2023-06-04 · 08:19 UTC

The #drone that didn't bark in the night

https://doctorow.medium.com/ayyyyyy-eyeeeee-4ac92fa2eed

#USAF #Lies #RewardHacking

#drone #usaf #lies #rewardhacking

Cory Doctorow @[email protected] · 2023-06-04 · 08:19 UTC

The #drone that didn't bark in the night

https://doctorow.medium.com/ayyyyyy-eyeeeee-4ac92fa2eed

#USAF #Lies #RewardHacking