home.social

#aievaluation — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #aievaluation, aggregated by home.social.

  1. ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents
    arxiv.org/abs/2602.10620
    Code & data: github.com/codingchild2424/isd
    "benchmark comprising 25,795 scenarios that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model."

    w/same author: Pedagogy-R1: Pedagogical Large Reasoning Model and Well-balanced Educational Benchmark dl.acm.org/doi/10.1145/3746252
    #AIEd #LearningDesign #AIevaluation #EdTech

  2. ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents
    arxiv.org/abs/2602.10620
    Code & data: github.com/codingchild2424/isd
    "benchmark comprising 25,795 scenarios that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model."

    w/same author: Pedagogy-R1: Pedagogical Large Reasoning Model and Well-balanced Educational Benchmark dl.acm.org/doi/10.1145/3746252
    #AIEd #LearningDesign #AIevaluation #EdTech

  3. ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents
    arxiv.org/abs/2602.10620
    Code & data: github.com/codingchild2424/isd
    "benchmark comprising 25,795 scenarios that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model."

    w/same author: Pedagogy-R1: Pedagogical Large Reasoning Model and Well-balanced Educational Benchmark dl.acm.org/doi/10.1145/3746252
    #AIEd #LearningDesign #AIevaluation #EdTech

  4. ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents
    arxiv.org/abs/2602.10620
    Code & data: github.com/codingchild2424/isd
    "benchmark comprising 25,795 scenarios that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model."

    w/same author: Pedagogy-R1: Pedagogical Large Reasoning Model and Well-balanced Educational Benchmark dl.acm.org/doi/10.1145/3746252
    #AIEd #LearningDesign #AIevaluation #EdTech

  5. ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents
    arxiv.org/abs/2602.10620
    Code & data: github.com/codingchild2424/isd
    "benchmark comprising 25,795 scenarios that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model."

    w/same author: Pedagogy-R1: Pedagogical Large Reasoning Model and Well-balanced Educational Benchmark dl.acm.org/doi/10.1145/3746252
    #AIEd #LearningDesign #AIevaluation #EdTech

  6. Start your week off right with #enterpriseAI #changemanagement tips from IT leaders Juan Orlandini, Fabien CROS, Kulvir Gahunia and Dana Harrison. My in-depth look at how #gamification, #AIevaluation platforms, #platformengineering and other approaches helped companies such as Insight, Ducker Carlisle and TELUS adopt #AI effectively: techtarget.com/searchitoperati

  7. Một nhà phát triển vừa tạo công cụ đánh giá mã nguồn mở (SanityHarness) và kiểm tra 49 cặp mô hình/đại lý lập trình, bao gồm Kimi K2.5. Bảng xếp hạng SanityBoard chấm điểm hiệu năng, chi phí và so sánh các mô hình hỗ trợ BYOK. Phát hiện: Codebuff mắc nhưng hiệu suất kém, Droid và Minimax vượt trội. Mời cộng đồng tham gia thử nghiệm qua Discord. #AI #LậpTrình #ĐánhGiáAI #MãNguồnMở #Coding #AIEvaluation

    reddit.com/r/LocalLLaMA/commen

  8. TrustifAI – Khung đánh giá độ tin cậy cho hệ thống AI/RAG với điểm số đa chiều: Độ bao phủ bằng chứng, Độ ổn định luận lý, Độ lệch ngữ nghĩa, Đa dạng nguồn, Độ tự tin sinh nội dung. Tạo đồ thị lý lẽ & trực quan hóa Mermaid để truy xuất nguyên nhân. Giải pháp cho môi trường doanh nghiệp, quản trị & tuân thủ. #TrustifAI #RAG #AIEvaluation #AIinVietnam #ĐánhGiáAI #HệThốngThôngMinh

    reddit.com/gallery/1qmhvuz

  9. Sự phát triển nhanh chóng của các mô hình AI hiện đại đòi hỏi bộ tiêu chuẩn đánh giá sâu rộng năng lực phức tạp, nhằm thúc đẩy hoàn thiện các mô hình ngôn ngữ lớn (LLM) tiên tiến. Các chuyên gia nhấn mạnh, AI càng thông minh, việc đánh giá càng phải toàn diện hơn để đảm bảo an toàn và hiệu quả.

    #AI #TríTuệNhânTạo #AIModels #MôHìnhAI #AIEvaluation #ĐánhGiáAI #CôngNghe #Tech

    vietnamnet.vn/cang-thong-minh-

  10. The educator panic over AI is real, and rational.
    I've been there myself. The difference is I moved past denial to a more pragmatic question: since AI regulation seems unlikely (with both camps refusing to engage), how do we actually work with these systems?

    The "AI will kill critical thinking" crowd has a point, but they're missing context.
    Critical reasoning wasn't exactly thriving before AI arrived: just look around. The real question isn't whether AI threatens thinking skills, but whether we can leverage it the same way we leverage other cognitive tools.

    We don't hunt our own food or walk everywhere anymore.
    We use supermarkets and cars. Most of us Google instead of visiting libraries. Each tool trade-off changed how we think and what skills matter. AI is the next step in this progression, if we're smart about it.

    The key is learning to think with AI rather than being replaced by it.
    That means understanding both its capabilities and our irreplaceable human advantages.

    1/3

    #AI #Education #FutureOfEducation #AIinEducation #LLM #ChatGPT #Claude #EdAI #CriticalThinking #CognitiveScience #Metacognition #HigherOrderThinking #Reasoning #Vygotsky #Hutchins #Sweller #LearningScience #EducationalPsychology #SocialLearning #TechforGood #EticalAI #AILiteracy #PromptEngineering #AISkills #DigitalLiteracy #FutureSkills #LRM #AIResearch #AILimitations #SystemsThinking #AIEvaluation #MentalModels #LifelongLearning #AIEthics #HumanCenteredAI #DigitalTransformation #AIRegulation #ResponsibleAI #Philosophy

  11. AI isn't going anywhere. Time to get strategic:
    Instead of mourning lost critical thinking skills, let's build on them through cognitive delegation—using AI as a thinking partner, not a replacement.

    This isn't some Silicon Valley fantasy:
    Three decades of cognitive research already mapped out how this works:

    Cognitive Load Theory:
    Our brains can only juggle so much at once. Let AI handle the grunt work while you focus on making meaningful connections.

    Distributed Cognition:
    Naval crews don't navigate with individual genius—they spread thinking across people, instruments, and procedures. AI becomes another crew member in your cognitive system.

    Zone of Proximal Development
    We learn best with expert guidance bridging what we can't quite do alone. AI can serve as that "more knowledgeable other" (though it's still early days).
    The table below shows what this looks like in practice:

    2/3

    #AI #Education #FutureOfEducation #AIinEducation #LLM #ChatGPT #Claude #EdAI #CriticalThinking #CognitiveScience #Metacognition #HigherOrderThinking #Reasoning #Vygotsky #Hutchins #Sweller #LearningScience #EducationalPsychology #SocialLearning #TechforGood #EticalAI #AILiteracy #PromptEngineering #AISkills #DigitalLiteracy #FutureSkills #LRM #AIResearch #AILimitations #SystemsThinking #AIEvaluation #MentalModels #LifelongLearning #AIEthics #HumanCenteredAI #DigitalTransformation #AIRegulation #ResponsibleAI #Philosophy

  12. Critical reasoning vs Cognitive Delegation

    Old School Focus:

    Building internal cognitive capabilities and managing cognitive load independently.

    Cognitive Delegation Focus:

    Orchestrating distributed cognitive systems while maintaining quality control over AI-augmented processes.

    We can still go for a jog or go hunt our own deer, but for reaching the stars we, the Apes do what Apes do best: Use tools to build on our cognitive abilities. AI is a tool.

    3/3

    #AI #Education #FutureOfEducation #AIinEducation #LLM #ChatGPT #Claude #EdAI #CriticalThinking #CognitiveScience #Metacognition #HigherOrderThinking #Reasoning #Vygotsky #Hutchins #Sweller #LearningScience #EducationalPsychology #SocialLearning #TechforGood #EticalAI #AILiteracy #PromptEngineering #AISkills #DigitalLiteracy #FutureSkills #LRM #AIResearch #AILimitations #SystemsThinking #AIEvaluation #MentalModels #LifelongLearning #AIEthics #HumanCenteredAI #DigitalTransformation #AIRegulation #ResponsibleAI #Philosophy

  13. ICYMI: Google updates quality rater guidelines with AI content evaluation criteria: Google's latest guidelines provide clearer direction on evaluating AI-generated content and spam tactics. ppc.land/google-updates-qualit #GoogleUpdates #QualityRater #AIEvaluation #ContentGuidelines #DigitalMarketing

  14. "The #gamma GLM is a relatively assumption-light means of #modeling non-negative data, given gamma's flexibility.
    […]
    "Explaining what is used and what is not used, despite merits and demerits […]: Loosely, the larger the internal literature in any field on modelling techniques, the less inclined people in that field seem to be to try something different."

    Nick Cox, 2013: stats.stackexchange.com/questi

    #normality #normalDistribution #Γ #modelling #dataDev #AIDev #ML #AIEvaluation #logNormal

  15. "The #gamma GLM is a relatively assumption-light means of #modeling non-negative data, given gamma's flexibility.
    […]
    "Explaining what is used and what is not used, despite merits and demerits […]: Loosely, the larger the internal literature in any field on modelling techniques, the less inclined people in that field seem to be to try something different."

    Nick Cox, 2013: stats.stackexchange.com/questi

    #normality #normalDistribution #Γ #modelling #dataDev #AIDev #ML #AIEvaluation #logNormal

  16. "The GLM is a relatively assumption-light means of non-negative data, given gamma's flexibility.
    […]
    "Explaining what is used and what is not used, despite merits and demerits […]: Loosely, the larger the internal literature in any field on modelling techniques, the less inclined people in that field seem to be to try something different."

    Nick Cox, 2013: stats.stackexchange.com/questi

  17. "The #gamma GLM is a relatively assumption-light means of #modeling non-negative data, given gamma's flexibility.
    […]
    "Explaining what is used and what is not used, despite merits and demerits […]: Loosely, the larger the internal literature in any field on modelling techniques, the less inclined people in that field seem to be to try something different."

    Nick Cox, 2013: stats.stackexchange.com/questi

    #normality #normalDistribution #Γ #modelling #dataDev #AIDev #ML #AIEvaluation #logNormal

  18. "The #gamma GLM is a relatively assumption-light means of #modeling non-negative data, given gamma's flexibility.
    […]
    "Explaining what is used and what is not used, despite merits and demerits […]: Loosely, the larger the internal literature in any field on modelling techniques, the less inclined people in that field seem to be to try something different."

    Nick Cox, 2013: stats.stackexchange.com/questi

    #normality #normalDistribution #Γ #modelling #dataDev #AIDev #ML #AIEvaluation #logNormal

  19. @datadon

    "The following sections discuss several state-of-the-art interpretable and explainable #ML methods. The selection of works does not comprise an exhaustive survey of the literature. Instead, it is meant to illustrate the commonest properties and inductive biases behind interpretable models and [black-box] explanation methods using concrete instances."
    wires.onlinelibrary.wiley.com/ 🧵

    #interpretability #explainability #aiethics #compliance #taxonomy #ethicalai #aievaluation #linearRegression

  20. Model "#interpretability and [black-box] #explainability, although not necessary in many straightforward applications, become instrumental when the problem definition is incomplete and in the presence of additional desiderata, such as trust, causality, or fairness."

    wires.onlinelibrary.wiley.com/

    #aiethics #compliance #taxonomy #ethicalai #aievaluation