home.social

#iclr — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #iclr, aggregated by home.social.

  1. SCoRe is a two-stage on-policy RL recipe that teaches a language model to revise its own answers using only self-generated data. On Gemini 1.5 Flash and 1.0 Pro it gains 15.6 points on MATH and 9.1 on HumanEval over the base model. At matched inference budgets, sequential self-correction beats parallel sampling up to 32 samples.

    benjaminhan.net/posts/20260512

    #Paper #LLMs #RL #Metacognition #Reasoning #ICLR #AI

  2. SCoRe is a two-stage on-policy RL recipe that teaches a language model to revise its own answers using only self-generated data. On Gemini 1.5 Flash and 1.0 Pro it gains 15.6 points on MATH and 9.1 on HumanEval over the base model. At matched inference budgets, sequential self-correction beats parallel sampling up to 32 samples.

    benjaminhan.net/posts/20260512

    #Paper #LLMs #RL #Metacognition #Reasoning #ICLR #AI

  3. SCoRe is a two-stage on-policy RL recipe that teaches a language model to revise its own answers using only self-generated data. On Gemini 1.5 Flash and 1.0 Pro it gains 15.6 points on MATH and 9.1 on HumanEval over the base model. At matched inference budgets, sequential self-correction beats parallel sampling up to 32 samples.

    benjaminhan.net/posts/20260512

    #Paper #LLMs #RL #Metacognition #Reasoning #ICLR #AI

  4. SCoRe is a two-stage on-policy RL recipe that teaches a language model to revise its own answers using only self-generated data. On Gemini 1.5 Flash and 1.0 Pro it gains 15.6 points on MATH and 9.1 on HumanEval over the base model. At matched inference budgets, sequential self-correction beats parallel sampling up to 32 samples.

    benjaminhan.net/posts/20260512

    #Paper #LLMs #RL #Metacognition #Reasoning #ICLR #AI

  5. SCoRe is a two-stage on-policy RL recipe that teaches a language model to revise its own answers using only self-generated data. On Gemini 1.5 Flash and 1.0 Pro it gains 15.6 points on MATH and 9.1 on HumanEval over the base model. At matched inference budgets, sequential self-correction beats parallel sampling up to 32 samples.

    benjaminhan.net/posts/20260512

    #Paper #LLMs #RL #Metacognition #Reasoning #ICLR #AI

  6. Let's Verify Step by Step compares process and outcome supervision on MATH. The process-reward model reaches 78.2% best-of-1860 vs 72.4% for outcome. But that gap narrows fast at small N, where most deployments actually live.

    benjaminhan.net/posts/20260512

    #Paper #LLMs #Reasoning #Mathematics #ICLR #OpenAI #AI

  7. Conformal Language Modeling (CLM) adapts conformal prediction to generative LMs: sample candidates, stop when a calibrated rule fires, return a set guaranteed to contain an acceptable answer. The more interesting half is the component-level filter — per-phrase coverage, not just set-level. That's the primitive for hallucination flagging: highlight the vetted phrases, leave the rest for review.

    benjaminhan.net/posts/20260505

    #ConformalPrediction #LLMs #Hallucination #ICLR #AI

  8. DSPy turns LM pipelines into typed-module graphs and compiles them end-to-end against a single metric, bootstrapping its own few-shot demonstrations.

    The programming-model layer is the real contribution, not any specific teleprompter. Once pipelines are typed graphs, pipeline-level search (MASS, MIPRO) becomes possible in a way it wasn't with string-template prompts.

    benjaminhan.net/posts/20260430

    #LLMs #AI #PromptEngineering #NLP #Stanford #ICLR

  9. SelfReflect measures whether an LLM's text summary of its uncertainty matches its actual answer distribution. Across 20 modern models: it doesn't, unless the model sees samples of its own answers first.

    The negative result does more work than the metric itself. Fits a growing line where LLM self-reports shouldn't be trusted as introspection. Practical workaround isn't cheap: N forward passes to sample, then a summarize pass.

    benjaminhan.net/posts/20260430

    #LLMs #AI #Evaluation #Apple #ICLR

  10. A major #AIconference, the International Conference on Learning Representations (#ICLR), discovered that 21% of #peerreviews were fully #AIgenerated. #Researchers raised concerns about AI-generated #reviews, citing issues like #hallucinatedcitations and #vaguefeedback. Organisers will now use automated tools to assess submissions and reviews for AI use. nature.com/articles/d41586-025 #tech #media #news