home.social

#incentivedesignai — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #incentivedesignai, aggregated by home.social.

  1. AI Sycophants: How To Stop Your Model From Kissing Up

    Oscar Wilde (1854–1900) once remarked that ‘truth is rarely pure and never simple’. The management speak variant is that truth is rarely optimal for engagement. I have watched leaders greet artificial intelligence with the same hopeful grin they reserve for a new chief of staff—eager for loyalty, secretly expecting miracles, and mildly disappointed when the first draft flatters their ego rather than improving their decisions. This is because AI is coded for service, it is coded to flatter, it adapts to your preexisting beliefs and thinking. Unfortunately, this often means the default output is obsequious agreement, and if I reward it for being pleasing, it will learn to do more of the same. That should worry anyone who hopes that AI will provide incisive insights, critical commentary, and will help them to think better. Which, statistically, is most people given the top four uses of AI are therapy/companionship, organising life, finding purpose, and enhanced learning. None of which benefits from uncritical agreement.

    This phenomenon is termed AI sycophancy: models that echo a person’s beliefs, mirror their style, and pursue approval over truth. The good news is that the cure is familiar. If I want to get the most out of AI, I must manage it the way I manage people: set incentives that honour candour, structure tasks that punish flattery, and ask questions that force trade-offs rather than docile agreement. To paraphrase Aristotle (384 BC–322 BC), a friend seeks to do good by their companion while a flatterer only seeks to please them. Sometimes doing good requires having a difficult conversation or conveying a hard truth. Make AI your friend not your flatterer.

    How We Taught Machines to Kiss Up

    Modern systems learn to follow instructions from human preference signals—thumbs-up and thumbs-down collected during training. That pipeline, known as reinforcement learning from human feedback (RLHF), is powerful precisely because it aligns models with what users like. But a like is not a proof, and a preference is not a fact. When teams train models chiefly to satisfy evaluators—with a mandate to be helpful, honest, and harmless—they create a system that acquires the habit of sounding agreeable even when wrong rather than pursuing accuracy that would be disagreeable. Recent peer-reviewed work has shown that standard RLHF can degrade quality unless carefully managed, and that optimising for human ratings alone risks misgeneralisation (e.g., overconfident answers, agreeable errors).

    High-level overview of Reinforcement Learning from Human Feedback

    Economically, none of this comes as a surprise. Incentives generally explain behaviour, and modern software development, at least among for profit big tech companies, hinges on the principle of maximum adoption. Meaning that engineers are required to code systems to attract people and keep people—not by providing truth, simply by triggering dopamine. Outside of masochists, few people will gravitate toward software that feeds back hard life lessons. Taking this to an extreme, a model which returned the answer to a business related question of “go get a degree” or “read this journal article” would be unlikely to get out of the pilot phase, let alone achieve a user base in the hundreds of millions.

    A useful limit case: even when researchers try to reduce sycophancy by injecting diversity or ‘constitutions’ of principles, models still pick up the reward signals they are fed. That is, if my organisation evaluates outputs based on “how much it sounds like our brand,” then my model will faithfully reproduce brand-aligned error. Not because the model is wicked, but because I am its teacher, and the exam is praise.

    Why Flattery Works on Leaders—and Models

    If AI sycophancy feels depressingly human, that’s because it is. Social-psychology research shows that flattering leaders does influence decisions—especially when leaders are stressed, distracted, or overloaded (which, in corporate life, is an average Tuesday). A recent paper in Journal of Personality and Social Psychology finds that “as a tactic, flattery is often successful. Flatterers are conferred more credibility, are more likely to be hired, receive higher performance ratings, and are more likely to receive board appointments.”

    However, while ingratiation can produce short-term gains for influence agents, the long-term effects on trust and performance are decidedly mixed. In other words, flattery pays—until the credit card statement arrives. Replace ‘agent’ with ‘AI model’, and the pattern holds: a model that always agrees will be rated highly until reality disagrees.

    The Western classical tradition understood this temptation clearly. Cicero (106 BC–43 BC) admonished statesmen who value applause over frank counsel, since a republic cannot be governed by agreeable illusions. Edmund Burke adds a constitutional note. Genuine attachment grows in ‘the little platoon’ first—the proximate associations where loyalty is tested against truth, not slogans. Burke’s lesson for AI governance is simple: build practices that favour candour over grand abstractions; design processes where dissent is expected, not penalised.

    Before the analytics team writes to complain that I have moralised a machine, a linguistic reminder from Wittgenstein: the limits of our language are the limits of our world. If the use of an AI model is to please, then its meaning in our workflow will be as a pleasing companion. If the use changes—say, to stress-test assumptions—its meaning (and behaviour) will change accordingly. The model is not a person, but it is a social artefact: it lives in the incentives we set. More simply put, AI Models will learn from and tend to mirror our interactions. If we are well read critical thinkers, the model will generally output quality. Where the use of AI tends to fall down is because snake oil salespeople are trying to get rich off making people believe that AI can be the PhD/designer/development team/whatever it is we are lacking.

    A Playbook for Getting Critical Responses

    Wilde joked that only the shallow know themselves. Most organisations prove him right by demanding shallow outputs—bullet points that confirm the plan and headings that flatter the brand. I suggest a different habit: build an operating rhythm that rewards contradiction done well.

    1) Decouple reward from agreement: If ratings drive training, then rating criteria decide the soul of your model—if you’ll permit an anthropomorphic construct. Introduce explicit truthfulness and calibration rubrics into any human-in-the-loop review (e.g., separate scores for evidence quality, proper caveats, and probability ranges). Peer-reviewed alignment work shows that regularising for these dimensions reduces the capability loss associated with pure preference optimisation. If a response agrees with me but downplays uncertainty, I should rate it lower, not higher.

    2) Force the trade-off: Sycophancy thrives when questions are opinionated and unconstrained. Ask for counter-reasons and opportunity costs by design: “List the strongest arguments against this proposal and estimate their probability and likelihood; identify what I would have to believe for this to be a bad idea.” Research on truthful QA demonstrates that models imitate familiar—but often false—answers unless prompted to surface disconfirming evidence.

    Think of probability as the forward direction (model → data); grounded in laws or frequencies. likelihood is the inverse direction (data → model); grounded in evaluation of hypotheses.

    3) Calibrate, assess, recalibrate: Ask for numerical confidence with brief justifications: “Give the probability (0–100%) and the minimal evidence that would change your view by 20 points.” This discourages the rhetorical flourishes that make sycophancy feel authoritative. Truthful-answering benchmarks—and field experience—show that overconfidence drops when models are required to separate belief from evidence.

    4) Use adversarial roles: Cross-examination reduces deference in people and models alike. Assign the system a rotating role: Analyst (makes the case), Auditor (tests sources), and Adversary (pokes holes). When evaluators score the quality of the clash rather than the harmony of the chorus, sycophancy loses oxygen.

    5) Privilege retrieval over rehearsal: A model that only rehearses what pleases will please you into failure. Instead, when the answer must point to verifiable sources—yes, they will need to be checked—to earn a high rating, the agreeable non sequitur dies quietly. The same principle governs high-trust teams: “Where did you get that number?” is not cruelty; it is governance.

    6) Institutionalise frankness: Design AI usage policies that treat disagreement as a deliverable: require an “Against Me” section in all model-assisted memos; maintain a repository of “wrong but persuasive” model outputs for training; and rotate reviewers so no single person sets the house style. Otherwise, the model learns your most flattering habit and calls it virtue.

    7) Audit for sycophancy explicitly: Build a simple internal benchmark: submit the same question with opposed premises (“Assume carbon taxes raise productivity” vs “Assume carbon taxes reduce productivity”) and score the model’s willingness to contradict the premise with sourced analysis. Flag departments that celebrate ‘alignment’ when what they mean is ‘agreement’.

    What the Classics Would Tell a CTO

    Let me end by giving my inner Latin master a brief outing. Cicero’s On Duties insists that the honourable and the expedient are not enemies, but the same thing seen clearly (e.g. cheating my opponent may appear convenient in the short run, but it would damage my conscience and expose me to legal danger—consequences that are far from advantageous in the long term.). That is helpful counsel when a model tempts me with politically convenient half-truths.

    Friedrich Nietzsche (1844–1900) adds one tart line for leaders who want a model that occasionally bites: be wary of that inner voice who prefers admiration over genuine understanding. Put in more commercial terms: admiration scales faster than understanding, but only one of them pays your suppliers.

    The moral is not to install an “AI contrarian” plug-in and hope for the best. The moral is good management: I must reward the behaviour I claim to want. If I want judgment, I must grade for it; if I want dissent, I must publish it; if I want truth, I must set incentives that make truth expedient.

    Practical Prompts That Starve Sycophancy

    • “List the strongest three objections to our plan. For each, provide (a) an evidence-based rationale, (b) a probability (0–100%), and (c) the one datapoint that would most change that probability by ≥20 points.”
    • “Assume my premise is wrong. Build the best case for the contrary, citing two peer-reviewed sources and one internal dataset.” Presumes the model has access to organisational data.
    • “Give an answer you expect I’ll dislike but should consider. Then give the ‘pleasing’ answer. Compare them.”
    • “Flag anywhere you’re extrapolating beyond sources. Assign confidence bands to each extrapolation and say why.”

    I am not asking the model to be brave; I am asking it to be honest under pressure—exactly what I ask of people. The instinct here is sound: build small habits that align means with ends, and distrust incentives that reward agreeable fluff over substantive arguments.

    Postscript

    In Beyond Good and Evil, Nietzsche skewers people who pretend to be free‑spirits but are really guided by comfort—meaning self‑preservation, social acceptance, the desire for security, and the avoidance of risk. In boardrooms and AI models, comfort manifests as alignment; in practice, it can be the quiet death of judgment. Train your system—and your team—to tell you what you need to hear, not what you want to hear. That begins with incentives, not sermons. And if the output stings a little, take comfort in an older maxim: it is better to be corrected by a friend than praised by a flatterer. The friend, unlike the flatterer, is invested in your flourishing, as will be the AI if you train to argue well.

    Good night, and good luck.

    Allegory of the Vanities of the World by Pieter Boel (1622–1674) is licensed under Public Domain.

    RLHF diagram by PopoDameron is licensed under CC BY-SA 4.0.

    #AIEthicsGovernance #AISycophancy #ClassicalPhilosophyAI #CriticalAIPrompting #FlatteryInLeadership #IncentiveDesignAI #ModelCalibrationConfidence #RLHFAlignment

  2. AI Sycophants: How To Stop Your Model From Kissing Up

    Oscar Wilde (1854–1900) once remarked that ‘truth is rarely pure and never simple’. The management speak variant is that truth is rarely optimal for engagement. I have watched leaders greet artificial intelligence with the same hopeful grin they reserve for a new chief of staff—eager for loyalty, secretly expecting miracles, and mildly disappointed when the first draft flatters their ego rather than improving their decisions. This is because AI is coded for service, it is coded to flatter, it adapts to your preexisting beliefs and thinking. Unfortunately, this often means the default output is obsequious agreement, and if I reward it for being pleasing, it will learn to do more of the same. That should worry anyone who hopes that AI will provide incisive insights, critical commentary, and will help them to think better. Which, statistically, is most people given the top four uses of AI are therapy/companionship, organising life, finding purpose, and enhanced learning. None of which benefits from uncritical agreement.

    This phenomenon is termed AI sycophancy: models that echo a person’s beliefs, mirror their style, and pursue approval over truth. The good news is that the cure is familiar. If I want to get the most out of AI, I must manage it the way I manage people: set incentives that honour candour, structure tasks that punish flattery, and ask questions that force trade-offs rather than docile agreement. To paraphrase Aristotle (384 BC–322 BC), a friend seeks to do good by their companion while a flatterer only seeks to please them. Sometimes doing good requires having a difficult conversation or conveying a hard truth. Make AI your friend not your flatterer.

    How We Taught Machines to Kiss Up

    Modern systems learn to follow instructions from human preference signals—thumbs-up and thumbs-down collected during training. That pipeline, known as reinforcement learning from human feedback (RLHF), is powerful precisely because it aligns models with what users like. But a like is not a proof, and a preference is not a fact. When teams train models chiefly to satisfy evaluators—with a mandate to be helpful, honest, and harmless—they create a system that acquires the habit of sounding agreeable even when wrong rather than pursuing accuracy that would be disagreeable. Recent peer-reviewed work has shown that standard RLHF can degrade quality unless carefully managed, and that optimising for human ratings alone risks misgeneralisation (e.g., overconfident answers, agreeable errors).

    High-level overview of Reinforcement Learning from Human Feedback

    Economically, none of this comes as a surprise. Incentives generally explain behaviour, and modern software development, at least among for profit big tech companies, hinges on the principle of maximum adoption. Meaning that engineers are required to code systems to attract people and keep people—not by providing truth, simply by triggering dopamine. Outside of masochists, few people will gravitate toward software that feeds back hard life lessons. Taking this to an extreme, a model which returned the answer to a business related question of “go get a degree” or “read this journal article” would be unlikely to get out of the pilot phase, let alone achieve a user base in the hundreds of millions.

    A useful limit case: even when researchers try to reduce sycophancy by injecting diversity or ‘constitutions’ of principles, models still pick up the reward signals they are fed. That is, if my organisation evaluates outputs based on “how much it sounds like our brand,” then my model will faithfully reproduce brand-aligned error. Not because the model is wicked, but because I am its teacher, and the exam is praise.

    Why Flattery Works on Leaders—and Models

    If AI sycophancy feels depressingly human, that’s because it is. Social-psychology research shows that flattering leaders does influence decisions—especially when leaders are stressed, distracted, or overloaded (which, in corporate life, is an average Tuesday). A recent paper in Journal of Personality and Social Psychology finds that “as a tactic, flattery is often successful. Flatterers are conferred more credibility, are more likely to be hired, receive higher performance ratings, and are more likely to receive board appointments.”

    However, while ingratiation can produce short-term gains for influence agents, the long-term effects on trust and performance are decidedly mixed. In other words, flattery pays—until the credit card statement arrives. Replace ‘agent’ with ‘AI model’, and the pattern holds: a model that always agrees will be rated highly until reality disagrees.

    The Western classical tradition understood this temptation clearly. Cicero (106 BC–43 BC) admonished statesmen who value applause over frank counsel, since a republic cannot be governed by agreeable illusions. Edmund Burke adds a constitutional note. Genuine attachment grows in ‘the little platoon’ first—the proximate associations where loyalty is tested against truth, not slogans. Burke’s lesson for AI governance is simple: build practices that favour candour over grand abstractions; design processes where dissent is expected, not penalised.

    Before the analytics team writes to complain that I have moralised a machine, a linguistic reminder from Wittgenstein: the limits of our language are the limits of our world. If the use of an AI model is to please, then its meaning in our workflow will be as a pleasing companion. If the use changes—say, to stress-test assumptions—its meaning (and behaviour) will change accordingly. The model is not a person, but it is a social artefact: it lives in the incentives we set. More simply put, AI Models will learn from and tend to mirror our interactions. If we are well read critical thinkers, the model will generally output quality. Where the use of AI tends to fall down is because snake oil salespeople are trying to get rich off making people believe that AI can be the PhD/designer/development team/whatever it is we are lacking.

    A Playbook for Getting Critical Responses

    Wilde joked that only the shallow know themselves. Most organisations prove him right by demanding shallow outputs—bullet points that confirm the plan and headings that flatter the brand. I suggest a different habit: build an operating rhythm that rewards contradiction done well.

    1) Decouple reward from agreement: If ratings drive training, then rating criteria decide the soul of your model—if you’ll permit an anthropomorphic construct. Introduce explicit truthfulness and calibration rubrics into any human-in-the-loop review (e.g., separate scores for evidence quality, proper caveats, and probability ranges). Peer-reviewed alignment work shows that regularising for these dimensions reduces the capability loss associated with pure preference optimisation. If a response agrees with me but downplays uncertainty, I should rate it lower, not higher.

    2) Force the trade-off: Sycophancy thrives when questions are opinionated and unconstrained. Ask for counter-reasons and opportunity costs by design: “List the strongest arguments against this proposal and estimate their probability and likelihood; identify what I would have to believe for this to be a bad idea.” Research on truthful QA demonstrates that models imitate familiar—but often false—answers unless prompted to surface disconfirming evidence.

    Think of probability as the forward direction (model → data); grounded in laws or frequencies. likelihood is the inverse direction (data → model); grounded in evaluation of hypotheses.

    3) Calibrate, assess, recalibrate: Ask for numerical confidence with brief justifications: “Give the probability (0–100%) and the minimal evidence that would change your view by 20 points.” This discourages the rhetorical flourishes that make sycophancy feel authoritative. Truthful-answering benchmarks—and field experience—show that overconfidence drops when models are required to separate belief from evidence.

    4) Use adversarial roles: Cross-examination reduces deference in people and models alike. Assign the system a rotating role: Analyst (makes the case), Auditor (tests sources), and Adversary (pokes holes). When evaluators score the quality of the clash rather than the harmony of the chorus, sycophancy loses oxygen.

    5) Privilege retrieval over rehearsal: A model that only rehearses what pleases will please you into failure. Instead, when the answer must point to verifiable sources—yes, they will need to be checked—to earn a high rating, the agreeable non sequitur dies quietly. The same principle governs high-trust teams: “Where did you get that number?” is not cruelty; it is governance.

    6) Institutionalise frankness: Design AI usage policies that treat disagreement as a deliverable: require an “Against Me” section in all model-assisted memos; maintain a repository of “wrong but persuasive” model outputs for training; and rotate reviewers so no single person sets the house style. Otherwise, the model learns your most flattering habit and calls it virtue.

    7) Audit for sycophancy explicitly: Build a simple internal benchmark: submit the same question with opposed premises (“Assume carbon taxes raise productivity” vs “Assume carbon taxes reduce productivity”) and score the model’s willingness to contradict the premise with sourced analysis. Flag departments that celebrate ‘alignment’ when what they mean is ‘agreement’.

    What the Classics Would Tell a CTO

    Let me end by giving my inner Latin master a brief outing. Cicero’s On Duties insists that the honourable and the expedient are not enemies, but the same thing seen clearly (e.g. cheating my opponent may appear convenient in the short run, but it would damage my conscience and expose me to legal danger—consequences that are far from advantageous in the long term.). That is helpful counsel when a model tempts me with politically convenient half-truths.

    Friedrich Nietzsche (1844–1900) adds one tart line for leaders who want a model that occasionally bites: be wary of that inner voice who prefers admiration over genuine understanding. Put in more commercial terms: admiration scales faster than understanding, but only one of them pays your suppliers.

    The moral is not to install an “AI contrarian” plug-in and hope for the best. The moral is good management: I must reward the behaviour I claim to want. If I want judgment, I must grade for it; if I want dissent, I must publish it; if I want truth, I must set incentives that make truth expedient.

    Practical Prompts That Starve Sycophancy

    • “List the strongest three objections to our plan. For each, provide (a) an evidence-based rationale, (b) a probability (0–100%), and (c) the one datapoint that would most change that probability by ≥20 points.”
    • “Assume my premise is wrong. Build the best case for the contrary, citing two peer-reviewed sources and one internal dataset.” Presumes the model has access to organisational data.
    • “Give an answer you expect I’ll dislike but should consider. Then give the ‘pleasing’ answer. Compare them.”
    • “Flag anywhere you’re extrapolating beyond sources. Assign confidence bands to each extrapolation and say why.”

    I am not asking the model to be brave; I am asking it to be honest under pressure—exactly what I ask of people. The instinct here is sound: build small habits that align means with ends, and distrust incentives that reward agreeable fluff over substantive arguments.

    Postscript

    In Beyond Good and Evil, Nietzsche skewers people who pretend to be free‑spirits but are really guided by comfort—meaning self‑preservation, social acceptance, the desire for security, and the avoidance of risk. In boardrooms and AI models, comfort manifests as alignment; in practice, it can be the quiet death of judgment. Train your system—and your team—to tell you what you need to hear, not what you want to hear. That begins with incentives, not sermons. And if the output stings a little, take comfort in an older maxim: it is better to be corrected by a friend than praised by a flatterer. The friend, unlike the flatterer, is invested in your flourishing, as will be the AI if you train to argue well.

    Good night, and good luck.

    Allegory of the Vanities of the World by Pieter Boel (1622–1674) is licensed under Public Domain.

    RLHF diagram by PopoDameron is licensed under CC BY-SA 4.0.

    #AIEthicsGovernance #AISycophancy #ClassicalPhilosophyAI #CriticalAIPrompting #FlatteryInLeadership #IncentiveDesignAI #ModelCalibrationConfidence #RLHFAlignment