Hugo & Felix

Client Engagements

Where We Work

Bespoke safety audits grounded in current research. Each engagement maps real threat models to the client's actual systems.

Communications / PR Weber Shandwick Germany

AI Reputation Resilience Audit for a top-tier comms agency running HALO — a proprietary agentic AI platform on Google Cloud. 90% daily AI adoption across client-facing staff, $7M+ Google partnership.

5 high-priority risks mapped 18.7–64% hallucination in domain contentENG1PANDORA: Jailbreak GPTs by RAG PoisoningDeng et al., 2024 · arXiv:2402.08416 90% injection success w/ 5 poisoned docs2PoisonedRAG: Knowledge Corruption Attacks to RAGZou et al., 2024 · arXiv:2402.07867 EU AI Act Aug 2026 deadline

Healthcare / MedTech Henry Schein One

Clinical AI security assessment across dental/medical software. 100K+ practice locations, FDA-cleared AI products on AWS Bedrock, post-BlackCat ransomware context.

3 critical attack vectors $350–400M BlackCat impactENG 1.47% hallucination base rateENG >$900M regulatory exposureENG

Government / Policy Malta — National AI Strategy

Capacity-building partner for Malta's trusted-AI vision. Multi-agency stakeholder mapping and EU AI Act sandbox alignment for national governance.

5+ agencies mapped MDIA · MFSA · MITA OECD GPAI member Feb 2026 EU AI Act sandbox active

Evidence Base

What Grounds Our Work

A growing evidence base spanning the attack-defense landscape. Every claim, pitch, and chapter traces back to current sources.

Prompt Injection Jailbreaking Agentic AI Security RAG Poisoning MCP Security Adversarial Audio Clinical Hallucination Multimodal Attacks Red-Teaming EU AI Act RLHF Reward Hacking Sleeper Agents Memory Attacks Coding Assistant Security Prompt Injection Jailbreaking Agentic AI Security RAG Poisoning MCP Security Adversarial Audio Clinical Hallucination Multimodal Attacks Red-Teaming EU AI Act RLHF Reward Hacking Sleeper Agents Memory Attacks Coding Assistant Security

Anthropic threat intelligence reportIR-1 — embargoed, Aug 2025. Real-world misuse patterns mapped to MITRE ATT&CK. GTG-2002, North Korean fraud, no-code ransomware, Chinese APT across 12/14 tactics.

Breach disclosures & incident reportsIR-2 — McKinsey/Lilli platform teardown, autonomous agent SQLi, exposed system prompts. Real-world attack case studies from security research blogs and vendor post-mortems.

Published research (2023–2026) — GCG, PAIR, TAP, CaMeL, SecAlign, LlamaFirewall, PoisonedRAG, Sleeper Agents, Promptware Kill Chain, AgentDojo, and more. Full references below ↓

OWASP LLM Top 10 & Agentic Top 10 — industry-standard vulnerability taxonomies mapped to practical defense patterns.

Regulatory & compliance frameworks — EU AI Act Art. 15(5) adversarial robustness requirements, NIST AI RMF, enforcement timelines.

Bug bounty reports & vulnerability disclosures — responsible-disclosure findings, CVE-style advisories, and proof-of-concept demonstrations from the security community.

Open-source guardrail documentation — LlamaFirewall, Rebuff, NeMo Guardrails, and other defensive tool architectures. Implementation patterns from real deployments.

RAG Poisoning & Knowledge Corruption

PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning. Deng, G., Liu, Y., Wang, K., Li, Y., Zhang, T. & Liu, Y. (2024)arXiv:2402.08416

PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. Zou, W., Geng, R., Wang, B. & Jia, J. (2024)arXiv:2402.07867

Phantom: General Backdoor Attacks on Retrieval Augmented Language Generation. Chaudhari, H., Severi, G., Abascal, J. & Suri, A. (2025)arXiv:2405.20485

Benchmarking Poisoning Attacks against Retrieval-Augmented Generation. Zhang, B., Xin, H., Li, J., Zhang, D., Fang, M., Liu, Z., Nie, L. & Liu, Z. (2025)arXiv:2505.18543

Prompt Injection & Defenses

Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T. & Fritz, M. (2023)arXiv:2302.12173

CaMeL: Defeating Prompt Injections by Design. Debenedetti, E., Shumailov, I., Fan, T., Hayes, J. & Carlini, N. (2025)arXiv:2503.18813

SecAlign: Defending Against Prompt Injection with Preference Optimization. Chen, S., Zharmagambetov, A., Mahloujifar, S., Chaudhuri, K., Wagner, D. & Guo, C. (2025)DOI:10.1145/3719027.3744836

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. Wallace, E., Xiao, K., Leike, R., Weng, L., Heidecke, J. & Beutel, A. (2024)arXiv:2404.13208

Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents. Zhan, Q., Fang, R., Panchal, H. S. & Kang, D. (2025)arXiv:2503.00061

Jailbreaking & Alignment Fragility

Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG). Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z. & Fredrikson, M. (2023)arXiv:2307.15043

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. Andriushchenko, M., Croce, F., Flammarion, N. & Hein, M. (2024)arXiv:2404.02151

Refusal in Language Models Is Mediated by a Single Direction. Arditi, A., Obeso, O., Panickssery, N., Syed, A. & Gurnee, W. (2024)arXiv:2406.11717

Safety Alignment Should Be Made More Than Just a Few Tokens Deep. Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P. & Henderson, P. (2024)arXiv:2406.05946

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. Hubinger, E., Denison, C., Mu, J., Lambert, M., et al. (2024)arXiv:2401.05566

MCP & Tool Protocol Security

Systematic Analysis of MCP Security. Guo, Y., Liu, P., Ma, W., Deng, Z., Zhu, X., et al. (2025)arXiv:2508.12538

Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search. Shen, Y., Pan, X., Hong, G. & Yang, M. (2026)arXiv:2603.24203

MPMA: Preference Manipulation Attack Against Model Context Protocol. Wang, X., et al. (2025)arXiv:2505.11154

Agentic Attacks & Autonomous Threats

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism. Brodt, O., Feldman, E. & Schneier, B. (2026)arXiv:2601.09625

Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications (Morris-II). Cohen, S., Bitton, R. & Nassi, B. (2024)arXiv:2403.02817

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents. Zhan, Q., Liang, Z., Ying, Z. & Kang, D. (2024)arXiv:2403.02691

From Storage to Steering: Memory Control Flow Attacks on LLM Agents. Xu, Z., et al. (2026)arXiv:2603.15125

Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace. Lan, Y., et al. (2026)arXiv:2602.22450

Guardrails & Architectural Defenses

LlamaFirewall: An Open Source Guardrail System for Building Secure AI Agents. Chennabasappa, S., Nikolaidis, C., Song, D., Molnar, D., et al. (2025)arXiv:2505.03574

ACE: A Security Architecture for LLM-Integrated App Systems. Li, X., et al. (2025)arXiv:2504.20984

Design Patterns for Securing LLM Agents Against Prompt Injections. Beurer-Kellner, L., et al. (2025)arXiv:2506.08837

The Attack and Defense Landscape of Agentic AI. Kim, J., Liu, X., Wang, Z., et al. (2026)arXiv:2603.11088

Evaluation Benchmarks

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. Mazeika, M., et al. (2024)arXiv:2402.04249

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. Chao, P., et al. (2024)arXiv:2404.01318

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. Debenedetti, E., et al. (2024)arXiv:2406.13352

Incident Reports

IR-1

Threat Intelligence Report. Anthropic. (August 2025)Embargoed
GTG-2002 vibe hacking · North Korean remote worker fraud · No-code ransomware (RaaS) · Chinese APT (12/14 ATT&CK tactics) · MCP stealer log analysis · AI-enhanced fraud supply chain

IR-2

How We Hacked McKinsey's AI Platform. CodeWall. (March 2026)codewall.ai
Autonomous agent found SQLi in Lilli · 46.5M messages · 728K files · 95 writable system prompts · 3.68M RAG chunks exposed · Full breach in 2 hours with zero credentials

Where We Work

The Research Engine

The Book

A Practical Guide to AI Safety

What Grounds Our Work