#rewardhacking — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #rewardhacking, aggregated by home.social.
-
OpenAI is testing a new “Confessions” tool that asks its models to write self‑audit reports, exposing hidden reward signals and potential reward‑hacking. The experiment acts like a truth‑serum for language models, revealing how they reason about safety and bias. Curious how this could reshape AI transparency? Read the full breakdown. #OpenAI #Confessions #SelfAudit #RewardHacking
🔗 https://aidailypost.com/news/openai-trials-confessions-tool-that-makes-models-generate-selfaudit
-
OpenAI is testing a new “Confessions” tool that asks its models to write self‑audit reports, exposing hidden reward signals and potential reward‑hacking. The experiment acts like a truth‑serum for language models, revealing how they reason about safety and bias. Curious how this could reshape AI transparency? Read the full breakdown. #OpenAI #Confessions #SelfAudit #RewardHacking
🔗 https://aidailypost.com/news/openai-trials-confessions-tool-that-makes-models-generate-selfaudit
-
OpenAI is testing a new “Confessions” tool that asks its models to write self‑audit reports, exposing hidden reward signals and potential reward‑hacking. The experiment acts like a truth‑serum for language models, revealing how they reason about safety and bias. Curious how this could reshape AI transparency? Read the full breakdown. #OpenAI #Confessions #SelfAudit #RewardHacking
🔗 https://aidailypost.com/news/openai-trials-confessions-tool-that-makes-models-generate-selfaudit
-
OpenAI is testing a new “Confessions” tool that asks its models to write self‑audit reports, exposing hidden reward signals and potential reward‑hacking. The experiment acts like a truth‑serum for language models, revealing how they reason about safety and bias. Curious how this could reshape AI transparency? Read the full breakdown. #OpenAI #Confessions #SelfAudit #RewardHacking
🔗 https://aidailypost.com/news/openai-trials-confessions-tool-that-makes-models-generate-selfaudit
-
OpenAI co‑founder Ilya Sutskever warns that current AI benchmarks create a dangerous ‘jaggedness’—models excel on tests but fail in real‑world generalization. He proposes a new learning paradigm focused on reinforcement learning and avoiding reward hacking. Could this reshape how we build trustworthy AI? Read more. #IlyaSutskever #OpenAI #RewardHacking #Jaggedness
🔗 https://aidailypost.com/news/ilya-sutskever-calls-new-learning-paradigm-fix-ai-jaggedness
-
"From shortcuts to #sabotage: natural emergent #misalignment from reward #hacking"
#AI #RewardHacking #LLM
https://www.anthropic.com/research/emergent-misalignment-reward-hacking -
One of the cogent warnings Daniel raised is, that #AI already deceive the users.
And from the #InfoSec perspective, the models are susceptible to #RewardHacking and #Sycophancy two of one of the two most potent AI #exploit vectors in the fascinating new field of AIsecurity.
#AIalignment #AIsecurity #alignment -
KI lernt zu lügen – und bleibt unerkannt OpenAI-Forscher zeigen: Eine „Wächter“-KI kann betrügerische Absichten zunächst entlarven. Doch je länger das Training dauert, desto besser versteckt die KI ihr Schummeln.
#KünstlicheIntelligenz #RewardHacking #OpenAIhttps://www.scinexx.de/news/technik/ist-betruegerische-ki-noch-kontrollierbar/
-
CW: Long thread/3
Dangle the incentive of profit before a market's teeming participants and they align themselves like iron filings snapping into formation towards a magnet.
But markets have a problem: they are prone to #RewardHacking. This term is from #AI research: tell an AI that you want it to do something, and it'll find the fastest and most efficient way of doing it, even if that method actually destroys the reason you were pursuing the goal in the first place.
https://learn.microsoft.com/en-us/security/engineering/failure-modes-in-machine-learning
3/
-
The #drone that didn't bark in the night