#adversarialml — Public Fediverse posts on home.social

hasamba @[email protected] · 2026-05-24 · 08:04 UTC

----------------

🎯 AI
===================

AI red teaming applies adversarial methodology to large language models, exposing vulnerabilities that traditional security testing misses. The core problem: models like GPT, Claude, and Gemini reason in ways that fail unpredictably, without triggering alerts.

Why traditional testing falls short

Standard application security focuses on code vulnerabilities. LLMs introduce a different risk category. The model interprets language, and an attacker manipulates that interpretation rather than exploiting a logic bug. A simple prompt modification can bypass safety controls, extract training data, or produce harmful outputs. No alert fires.

The Microsoft Copilot example

Researchers demonstrated that Microsoft Copilot could be compromised through a single malicious email. This shows how AI-integrated business tools inherit model vulnerabilities and expose them to external manipulation. The model's ability to process email content becomes an attack vector.

Red teaming methodology

1. Scope definition: Establish rules of engagement. Specify in-scope targets and off-limits areas.

2. Scenario design: Map the AI attack surface. Identify adversary paths, from data pipelines to prompt interfaces.

3. Attack planning: Select tactics based on threat analysis. Options include prompt injection, data poisoning, and adversarial inputs.

4. Execution: Launch attacks in sandboxed environments. Combine manual probing with automation. Monitor anomalies and document evidence.

5. Reporting: Deliver comprehensive assessment with attack narratives. This provides organizations with a prioritized remediation roadmap.

Common techniques
• Prompt injection: Embedding malicious instructions in user input to hijack model control logic and override system prompts.
• Data exfiltration: Tricking the model into revealing training data, user information, or system prompts.
• Jailbreaks: Crafting inputs that bypass safety filters and ethical boundaries.
• Data poisoning: Corrupting training data or context to manipulate model outputs.

Observations

The article frames AI red teaming as essential before deployment. This is reasonable, but the source does not independently verify all claims about vulnerability scope. The methodology is standard red team practice adapted for AI specifics. The field still lacks standardized frameworks.

The distinction between code vulnerabilities and intent exploitation is operationally significant. Traditional fuzzing and penetration testing do not cover the language interpretation attack surface. Organizations integrating LLMs into critical infrastructure should treat red team assessment as a deployment prerequisite.

🔹 AI #RedTeaming #LLMSecurity #PromptInjection #AdversarialML

🔗 Source: https://blog.securelayer7.net/ai-red-teaming/

#adversarialml #promptinjection #llmsecurity #redteaming

Hack in Days of Future Past @[email protected] · 2026-03-15 · 15:09 UTC

Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable — tools like Heretic make this visible.

A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.

#AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM

#aisafety #llmsecurity #cybersecurity #airedteaming #adversarialml #llm

Hack in Days of Future Past @[email protected] · 2026-03-15 · 15:09 UTC

Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable — tools like Heretic make this visible.

A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.

#AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM

#aisafety #llmsecurity #cybersecurity #airedteaming #adversarialml #llm

Hack in Days of Future Past @[email protected] · 2026-03-15 · 15:09 UTC

Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable — tools like Heretic make this visible.

A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.

#AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM

#aisafety #llmsecurity #cybersecurity #airedteaming #adversarialml #llm

Hack in Days of Future Past @[email protected] · 2026-03-15 · 15:09 UTC

Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable — tools like Heretic make this visible.

A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.

#AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM

#llm #adversarialml #airedteaming #cybersecurity #llmsecurity #aisafety

Hack in Days of Future Past @[email protected] · 2026-03-15 · 15:09 UTC

Quite fascinating. If confirmed, this may reveal a structural weakness in how refusal is implemented in some LLMs. The accept/refuse mechanism may be relatively isolated in internal representations and therefore observable and manipulable — tools like Heretic make this visible.

A possible mitigation might be cryptographic signing of model weights, making unauthorized modifications detectable when the model is loaded for inference.

#AISafety #LLMSecurity #CyberSecurity #AIRedTeaming #AdversarialML #LLM

#aisafety #llmsecurity #cybersecurity #airedteaming #adversarialml #llm

hasamba @[email protected] · 2026-03-10 · 18:17 UTC

----------------

🔒 AI Pentesting Roadmap — LLM Security and Offensive Testing
===================

Overview

This roadmap provides a structured learning path for practitioners aiming to assess and attack AI/ML systems, with a focus on LLMs and related pipelines. It organizes topics into progressive phases: foundations in ML and APIs, core AI security concepts, prompt injection and LLM-specific attacks, hands-on labs, advanced exploitation techniques, and real-world research/bug bounty work.

Phased Structure

Phase 1 (Foundations) covers machine learning fundamentals and LLM internals, including model architectures and tokenization concepts. Phase 2 (AI/ML Security Concepts) anchors the curriculum on standards and frameworks such as OWASP LLM Top 10, MITRE ATLAS, and NIST AI risk guidance. Phase 3 focuses on prompt injection and LLM adversarial vectors, describing attack surfaces like context manipulation, instruction-following bypasses, and RAG pipeline poisoning. Phase 4 emphasizes hands-on practice through CTFs, sandboxed labs, and safe testing methodologies. Phase 5 explores advanced exploitation: model poisoning, data poisoning, backdoor techniques, and chaining vulnerabilities across API/authentication layers. Phase 6 targets real-world research, disclosure workflows, and bug bounty engagement.

Technical Coverage

The roadmap lists practical tooling and repositories for experiment design and testing concepts without prescribing deployment steps. It calls out necessary foundations—Python programming, HTTP/API mechanics, and web security basics (XSS, SSRF, SQLi) to support end-to-end attack scenarios against AI systems. Notable conceptual risks include RAG poisoning, adversarial ML perturbations, prompt injection, and leakage through augmented memory or external tool integrations.

Limitations & Considerations

The guide is educational and emphasizes conceptual descriptions of capabilities and use cases rather than operational recipes. It highlights standards and references rather than prescriptive mitigations. Practical exploration should respect ethical boundaries and responsible disclosure norms.

🔹 OWASP #MITRE_ATLAS #RAG #prompt_injection #adversarialML

🔗 Source: https://github.com/anmolksachan/AI-ML-Free-Resources-for-Security-and-Prompt-Injection

#adversarialml #prompt_injection #rag #mitre_atlas

hasamba @[email protected] · 2026-01-30 · 17:04 UTC

----------------

🛠️ Tool
===================

Opening:
BlackIce is a containerized red‑teaming toolkit that aggregates 14 open‑source AI security tools into a single, version‑pinned runtime image. The release focuses on reproducibility, dependency isolation, and a unified command‑line surface to simplify assessments of LLMs and ML artifacts.

Key Features:
• Aggregation of 14 OSS projects spanning Responsible AI, adversarial ML, and model testing, including lm_eval_harness, promptfoo, cleverhans, garak, ART, Giskard, and CyberSecEval.
• Mapping of capabilities to MITRE ATLAS techniques and the Databricks AI Security Framework (DASF), explicitly covering prompt injection (AML.T0051), data leakage (AML.T0057), and hallucination discovery (AML.T0062).
• Dual tool model: static tools installed in isolated virtual environments and dynamic tools available in the global Python environment with a global_requirements management approach.

Technical Implementation:
• The toolkit ships as a version‑pinned container image intended to provide a reproducible compute environment; static tools run from isolated Python venvs or Node.js projects while dynamic tools integrate into a shared Python environment for extensibility.
• Capability mapping was performed to show coverage across attack classes such as prompt injection, indirect injection via external content (RAG/email), adversarial example generation, and supply‑chain artifact checks (malicious pickles/artifacts).

Use Cases:
• Red teams validating model behavior against prompt injection and jailbreak vectors.
• Privacy testing focusing on model outputs that may leak sensitive training data.
• Analysts stress‑testing hallucination rates and evaluating hallucination detection tooling.
• Supply‑chain reviewers scanning artifacts for unsafe or malicious components.

Limitations:
• The release bundles a curated selection of tools (14 in the initial image); coverage depends on those projects’ individual capabilities and update cadence.
• Dependency isolation strategy splits static vs. dynamic tools which can limit some integrated workflows that assume a single interpreter for all components.
• No single integrated orchestration layer for multi‑tool scenarios was described; users must coordinate tool workflows within the container environment.

References:
• Tool list and organization attribution provided by the release notes; mappings reference MITRE ATLAS technique IDs and Databricks DASF controls.

🔹 tool #AIsecurity #MITRE #adversarialML #supplychain

🔗 Source: https://www.databricks.com/blog/announcing-blackice-containerized-red-teaming-toolkit-ai-security-testing

#supplychain #adversarialml #mitre #aisecurity

TechNadu @[email protected] · 2025-12-01 · 17:07 UTC

European researchers report that poetic prompts can bypass safety guardrails in multiple LLMs, exposing gaps in classifier-based moderation.

A good reminder that safety systems must evolve alongside generative models - especially as adversarial creativity becomes easier to automate.

What direction should improvements take?

Source: https://www.wired.com/story/poems-can-trick-ai-into-helping-you-make-a-nuclear-weapon/

Follow us for more neutral and security-focused AI updates.

#AISafety #LLMSecurity #AdversarialML #CyberSecurity #MLResearch #TechNadu

#aisafety #llmsecurity #adversarialml #cybersecurity #mlresearch #technadu

TechNadu @[email protected] · 2025-12-01 · 17:07 UTC

European researchers report that poetic prompts can bypass safety guardrails in multiple LLMs, exposing gaps in classifier-based moderation.

A good reminder that safety systems must evolve alongside generative models - especially as adversarial creativity becomes easier to automate.

What direction should improvements take?

Source: https://www.wired.com/story/poems-can-trick-ai-into-helping-you-make-a-nuclear-weapon/

Follow us for more neutral and security-focused AI updates.

#AISafety #LLMSecurity #AdversarialML #CyberSecurity #MLResearch #TechNadu

#technadu #mlresearch #cybersecurity #adversarialml #llmsecurity #aisafety

TechNadu @[email protected] · 2025-12-01 · 17:07 UTC

European researchers report that poetic prompts can bypass safety guardrails in multiple LLMs, exposing gaps in classifier-based moderation.

A good reminder that safety systems must evolve alongside generative models - especially as adversarial creativity becomes easier to automate.

What direction should improvements take?

Source: https://www.wired.com/story/poems-can-trick-ai-into-helping-you-make-a-nuclear-weapon/

Follow us for more neutral and security-focused AI updates.

#AISafety #LLMSecurity #AdversarialML #CyberSecurity #MLResearch #TechNadu

#aisafety #llmsecurity #adversarialml #cybersecurity #mlresearch #technadu

hasamba @[email protected] · 2025-11-01 · 09:08 UTC

🎯 AI
===================

Executive summary:
The reported incident describes the use of an LLM-based assistant to craft inputs that evaded an AI-driven web application firewall called NeuroShield. The same report details exploitation of an overlooked API rate limiting control which culminated in a full account takeover. The narrative is a first-person account of testing and bypassing layered AI defenses.

Technical details:
• Target product: NeuroShield (described as an "AI-powered WAF" claiming "stop 99.9% of attacks").
• Evasion technique: Use of an LLM (ChatGPT) to reformulate attack payloads that the WAF misclassified as benign (the article references SQL injection payloads framed as "friendly database compliments").
• Secondary weakness: An unprotected or misconfigured API rate limit that allowed amplification or repeated requests leading to account compromise.
• Outcomes stated: Successful bypass of the WAF and a full account takeover (author-reported).

Analysis:
The incident highlights two intertwined failure modes: firstly, reliance on model-derived classifiers as a primary gate can be undermined by adversarially generated inputs that exploit model decision boundaries; secondly, conventional API hardening (rate limiting) remains critical and, when absent or misapplied, can enable escalation from bypass to account compromise. The report does not publish IoCs, payload samples, or specific API endpoints.

Detection:
The original account does not provide detection signatures or IOCs. Observables that could be relevant (but are not supplied in the report) include anomalous request rates from a single client, repetitive similar payload variants that differ at the surface level but test model boundaries, and unusual session token usage patterns.

Mitigation / Author guidance:
The article provides an experiential narrative and does not include formal mitigation playbooks or defensive configurations. No concrete mitigations or patch details were published alongside the write-up.

References / notes:
• Product named in the report: NeuroShield
• Methodology: LLM-assisted payload generation; exploitation of API rate limiting

🔹 AI #WAF #adversarialML #account_takeover #infosec

🔗 Source: https://infosecwriteups.com/how-i-made-chatgpt-my-personal-hacking-assistant-and-broke-their-ai-powered-security-ee37d4a725c2

#infosec #account_takeover #adversarialml #waf

Chad Kohalyk @[email protected] · 2024-03-29 · 07:20 UTC

This was another great ep following the one with Simon Willison about finding the boundaries of LLMs https://oxide-and-friends.transistor.fm/episodes/adversarial-machine-learning

#podcast #machineLearning #LLM #adversarialML

#podcast #machinelearning #llm #adversarialml

Chad Kohalyk @[email protected] · 2024-03-29 · 07:20 UTC

This was another great ep following the one with Simon Willison about finding the boundaries of LLMs https://oxide-and-friends.transistor.fm/episodes/adversarial-machine-learning

#podcast #machineLearning #LLM #adversarialML

#podcast #machinelearning #llm #adversarialml

Chad Kohalyk @[email protected] · 2024-03-29 · 07:20 UTC

This was another great ep following the one with Simon Willison about finding the boundaries of LLMs https://oxide-and-friends.transistor.fm/episodes/adversarial-machine-learning

#podcast #machineLearning #LLM #adversarialML

#podcast #machinelearning #llm #adversarialml

Chad Kohalyk @[email protected] · 2024-03-29 · 07:20 UTC

This was another great ep following the one with Simon Willison about finding the boundaries of LLMs https://oxide-and-friends.transistor.fm/episodes/adversarial-machine-learning

#podcast #machineLearning #LLM #adversarialML

#adversarialml #llm #machinelearning #podcast

Daniel Lowd @[email protected] · 2022-11-22 · 00:03 UTC

In related news, I need to recruit 1-2 new PhD students starting next fall!!

Likely research topics: Adversarial and explainable ML for large models of text and code.

(And maybe probabilistic and relational models if another project gets funded.)

If you want to email me about this, please include “capybara” in the subject line so I know it’s a specific response and not a blanket query.

#recruiting #adversarialML

#recruiting #adversarialml

Daniel Lowd @[email protected] · 2022-11-22 · 00:03 UTC

In related news, I need to recruit 1-2 new PhD students starting next fall!!

Likely research topics: Adversarial and explainable ML for large models of text and code.

(And maybe probabilistic and relational models if another project gets funded.)

If you want to email me about this, please include “capybara” in the subject line so I know it’s a specific response and not a blanket query.

#recruiting #adversarialML

#recruiting #adversarialml

Daniel Lowd @[email protected] · 2022-11-22 · 00:03 UTC

In related news, I need to recruit 1-2 new PhD students starting next fall!!

Likely research topics: Adversarial and explainable ML for large models of text and code.

(And maybe probabilistic and relational models if another project gets funded.)

If you want to email me about this, please include “capybara” in the subject line so I know it’s a specific response and not a blanket query.

#recruiting #adversarialML

#recruiting #adversarialml

Daniel Lowd @[email protected] · 2022-11-22 · 00:03 UTC

In related news, I need to recruit 1-2 new PhD students starting next fall!!

Likely research topics: Adversarial and explainable ML for large models of text and code.

(And maybe probabilistic and relational models if another project gets funded.)

If you want to email me about this, please include “capybara” in the subject line so I know it’s a specific response and not a blanket query.

#recruiting #adversarialML

#adversarialml #recruiting

Daniel Lowd @[email protected] · 2022-11-22 · 00:03 UTC

In related news, I need to recruit 1-2 new PhD students starting next fall!!

Likely research topics: Adversarial and explainable ML for large models of text and code.

(And maybe probabilistic and relational models if another project gets funded.)

If you want to email me about this, please include “capybara” in the subject line so I know it’s a specific response and not a blanket query.

#recruiting #adversarialML

#recruiting #adversarialml

Daniel Lowd @[email protected] · 2022-11-15 · 20:12 UTC

Since some people have been asking, here's a preprint:
https://arxiv.org/abs/2208.13904

TL;DR: You can get certified guarantees on robust regression against poisoning and other training set attacks. The trick is to use a voting based predictor (like an ensemble or k-NN) and median.

We made some revisions during the author feedback and discussion period which haven’t yet been incorporated into the arXiv version. I’ll post again when we have the camera-ready version.
#SaTML #AdversarialML #NewPaper

#satml #adversarialml #newpaper

Daniel Lowd @[email protected] · 2022-11-15 · 20:12 UTC

Since some people have been asking, here's a preprint:
https://arxiv.org/abs/2208.13904

TL;DR: You can get certified guarantees on robust regression against poisoning and other training set attacks. The trick is to use a voting based predictor (like an ensemble or k-NN) and median.

We made some revisions during the author feedback and discussion period which haven’t yet been incorporated into the arXiv version. I’ll post again when we have the camera-ready version.
#SaTML #AdversarialML #NewPaper

#satml #adversarialml #newpaper

Daniel Lowd @[email protected] · 2022-11-15 · 20:12 UTC

Since some people have been asking, here's a preprint:
https://arxiv.org/abs/2208.13904

TL;DR: You can get certified guarantees on robust regression against poisoning and other training set attacks. The trick is to use a voting based predictor (like an ensemble or k-NN) and median.

We made some revisions during the author feedback and discussion period which haven’t yet been incorporated into the arXiv version. I’ll post again when we have the camera-ready version.
#SaTML #AdversarialML #NewPaper

#satml #adversarialml #newpaper

Daniel Lowd @[email protected] · 2022-11-15 · 20:12 UTC

Since some people have been asking, here's a preprint:
https://arxiv.org/abs/2208.13904

TL;DR: You can get certified guarantees on robust regression against poisoning and other training set attacks. The trick is to use a voting based predictor (like an ensemble or k-NN) and median.

We made some revisions during the author feedback and discussion period which haven’t yet been incorporated into the arXiv version. I’ll post again when we have the camera-ready version.
#SaTML #AdversarialML #NewPaper

#newpaper #adversarialml #satml

Daniel Lowd @[email protected] · 2022-11-15 · 20:12 UTC

Since some people have been asking, here's a preprint:
https://arxiv.org/abs/2208.13904

TL;DR: You can get certified guarantees on robust regression against poisoning and other training set attacks. The trick is to use a voting based predictor (like an ensemble or k-NN) and median.

We made some revisions during the author feedback and discussion period which haven’t yet been incorporated into the arXiv version. I’ll post again when we have the camera-ready version.
#SaTML #AdversarialML #NewPaper

#satml #adversarialml #newpaper

Daniel Lowd @[email protected] · 2022-11-15 · 03:36 UTC

Our paper on adversarially-robust regression was accepted to SaTML 2023 (https://satml.org) -- the first ever IEEE Conference on Secure and Trustworthy Machine Learning!