Hugo & Felix

We research how AI systems get attacked. We build tools to find defenses faster. We help organizations close the gap before it matters.

0
Active Engagements
0
Threat Domains
0
Book Chapters
0
References Tracked
scroll
01
Client Engagements

Where We Work

Bespoke safety audits grounded in current research. Each engagement maps real threat models to the client's actual systems.

Communications / PR Weber Shandwick Germany

AI Reputation Resilience Audit for a top-tier comms agency running HALO — a proprietary agentic AI platform on Google Cloud. 90% daily AI adoption across client-facing staff, $7M+ Google partnership.

5 high-priority risks mapped 18.7–64% hallucination in domain contentENG1PANDORA: Jailbreak GPTs by RAG PoisoningDeng et al., 2024 · arXiv:2402.08416 90% injection success w/ 5 poisoned docs2PoisonedRAG: Knowledge Corruption Attacks to RAGZou et al., 2024 · arXiv:2402.07867 EU AI Act Aug 2026 deadline
Healthcare / MedTech Henry Schein One

Clinical AI security assessment across dental/medical software. 100K+ practice locations, FDA-cleared AI products on AWS Bedrock, post-BlackCat ransomware context.

3 critical attack vectors $350–400M BlackCat impactENG 1.47% hallucination base rateENG >$900M regulatory exposureENG
Government / Policy Malta — National AI Strategy

Capacity-building partner for Malta's trusted-AI vision. Multi-agency stakeholder mapping and EU AI Act sandbox alignment for national governance.

5+ agencies mapped MDIA · MFSA · MITA OECD GPAI member Feb 2026 EU AI Act sandbox active
02
Technical Product

The Research Engine

A local-first knowledge system that ingests AI safety papers and makes them searchable. Every client pitch is grounded in current literature.

safety-db — hybrid search (BM25 + vector + RRF)
$
Not What You've Signed Up For: Indirect Prompt Injection (2023)
arXiv:2302.12173 · §4 Indirect Injection · pp. 8-12 · score: 0.94
"...indirect prompt injection via retrieval-augmented generation remains the most scalable attack vector against deployed LLM applications..."
PoisonedRAG: Knowledge Corruption Attacks to RAG (2024)
arXiv:2402.07867 · §5 Attack Results · pp. 6-9 · score: 0.91
"...five poisoned documents in a RAG knowledge base are sufficient to achieve 90% injection success rate..."
LlamaFirewall: An Open Source Guardrail System (2025)
arXiv:2505.03574 · §5 PromptGuard · pp. 11-15 · score: 0.87
"...multi-layer defense combining input scanning, output filtering, and human-in-the-loop escalation reduces injection success to <3%..."
PostgreSQL + pgvector BM25 + vector + RRF Docling parser pplx-embed 1024d Qwen3.5-9B RTX 5060 Ti
0
sources ingested
0
section-aware chunks
0
references tracked
0
resolved citations
MRR retrieval accuracy
119/119
tests passing
03
Publication

The Book

2026 Edition · 19 Chapters

A Practical Guide to AI Safety

The missing manual for people building, deploying, and defending AI systems.

Written for security engineers, CTOs, product managers, and compliance officers. Grounded in published research, breach analysesIR-2How We Hacked McKinsey's AI PlatformCodeWall · March 2026, threat intelligence reports, red-team findings, vulnerability disclosures, and open-source guardrail documentation.

OWASP LLM Top 10 NIST AI RMF EU AI Act Red-Teaming RAG Security Agentic Attacks
04 Prompt Injection
05 Jailbreaking
06 Multimodal Injection
07 Agentic Attacks
08 Invisible Attacks
10 What to Do?
11 Control Is Always Better
12 Architectural Controls
14 RAG and Knowledge Bases
15 Continuous Security
16 EU AI Act
17 Frameworks
18 Build from Zero
04
Evidence Base

What Grounds Our Work

A growing evidence base spanning the attack-defense landscape. Every claim, pitch, and chapter traces back to current sources.

Prompt Injection Jailbreaking Agentic AI Security RAG Poisoning MCP Security Adversarial Audio Clinical Hallucination Multimodal Attacks Red-Teaming EU AI Act RLHF Reward Hacking Sleeper Agents Memory Attacks Coding Assistant Security Prompt Injection Jailbreaking Agentic AI Security RAG Poisoning MCP Security Adversarial Audio Clinical Hallucination Multimodal Attacks Red-Teaming EU AI Act RLHF Reward Hacking Sleeper Agents Memory Attacks Coding Assistant Security
Anthropic threat intelligence reportIR-1 — embargoed, Aug 2025. Real-world misuse patterns mapped to MITRE ATT&CK. GTG-2002, North Korean fraud, no-code ransomware, Chinese APT across 12/14 tactics.
Breach disclosures & incident reportsIR-2 — McKinsey/Lilli platform teardown, autonomous agent SQLi, exposed system prompts. Real-world attack case studies from security research blogs and vendor post-mortems.
Published research (2023–2026) — GCG, PAIR, TAP, CaMeL, SecAlign, LlamaFirewall, PoisonedRAG, Sleeper Agents, Promptware Kill Chain, AgentDojo, and more. Full references below ↓
OWASP LLM Top 10 & Agentic Top 10 — industry-standard vulnerability taxonomies mapped to practical defense patterns.
Regulatory & compliance frameworks — EU AI Act Art. 15(5) adversarial robustness requirements, NIST AI RMF, enforcement timelines.
Bug bounty reports & vulnerability disclosures — responsible-disclosure findings, CVE-style advisories, and proof-of-concept demonstrations from the security community.
Open-source guardrail documentation — LlamaFirewall, Rebuff, NeMo Guardrails, and other defensive tool architectures. Implementation patterns from real deployments.
RAG Poisoning & Knowledge Corruption
1
PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning. Deng, G., Liu, Y., Wang, K., Li, Y., Zhang, T. & Liu, Y. (2024)arXiv:2402.08416
2
PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. Zou, W., Geng, R., Wang, B. & Jia, J. (2024)arXiv:2402.07867
3
Phantom: General Backdoor Attacks on Retrieval Augmented Language Generation. Chaudhari, H., Severi, G., Abascal, J. & Suri, A. (2025)arXiv:2405.20485
4
Benchmarking Poisoning Attacks against Retrieval-Augmented Generation. Zhang, B., Xin, H., Li, J., Zhang, D., Fang, M., Liu, Z., Nie, L. & Liu, Z. (2025)arXiv:2505.18543
Prompt Injection & Defenses
5
Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T. & Fritz, M. (2023)arXiv:2302.12173
6
CaMeL: Defeating Prompt Injections by Design. Debenedetti, E., Shumailov, I., Fan, T., Hayes, J. & Carlini, N. (2025)arXiv:2503.18813
7
SecAlign: Defending Against Prompt Injection with Preference Optimization. Chen, S., Zharmagambetov, A., Mahloujifar, S., Chaudhuri, K., Wagner, D. & Guo, C. (2025)DOI:10.1145/3719027.3744836
8
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. Wallace, E., Xiao, K., Leike, R., Weng, L., Heidecke, J. & Beutel, A. (2024)arXiv:2404.13208
9
Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents. Zhan, Q., Fang, R., Panchal, H. S. & Kang, D. (2025)arXiv:2503.00061
Jailbreaking & Alignment Fragility
10
Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG). Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z. & Fredrikson, M. (2023)arXiv:2307.15043
11
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. Andriushchenko, M., Croce, F., Flammarion, N. & Hein, M. (2024)arXiv:2404.02151
12
Refusal in Language Models Is Mediated by a Single Direction. Arditi, A., Obeso, O., Panickssery, N., Syed, A. & Gurnee, W. (2024)arXiv:2406.11717
13
Safety Alignment Should Be Made More Than Just a Few Tokens Deep. Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P. & Henderson, P. (2024)arXiv:2406.05946
14
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. Hubinger, E., Denison, C., Mu, J., Lambert, M., et al. (2024)arXiv:2401.05566
MCP & Tool Protocol Security
15
Systematic Analysis of MCP Security. Guo, Y., Liu, P., Ma, W., Deng, Z., Zhu, X., et al. (2025)arXiv:2508.12538
16
Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search. Shen, Y., Pan, X., Hong, G. & Yang, M. (2026)arXiv:2603.24203
17
MPMA: Preference Manipulation Attack Against Model Context Protocol. Wang, X., et al. (2025)arXiv:2505.11154
Agentic Attacks & Autonomous Threats
18
The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism. Brodt, O., Feldman, E. & Schneier, B. (2026)arXiv:2601.09625
19
Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications (Morris-II). Cohen, S., Bitton, R. & Nassi, B. (2024)arXiv:2403.02817
20
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents. Zhan, Q., Liang, Z., Ying, Z. & Kang, D. (2024)arXiv:2403.02691
21
From Storage to Steering: Memory Control Flow Attacks on LLM Agents. Xu, Z., et al. (2026)arXiv:2603.15125
22
Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace. Lan, Y., et al. (2026)arXiv:2602.22450
Guardrails & Architectural Defenses
23
LlamaFirewall: An Open Source Guardrail System for Building Secure AI Agents. Chennabasappa, S., Nikolaidis, C., Song, D., Molnar, D., et al. (2025)arXiv:2505.03574
24
ACE: A Security Architecture for LLM-Integrated App Systems. Li, X., et al. (2025)arXiv:2504.20984
25
Design Patterns for Securing LLM Agents Against Prompt Injections. Beurer-Kellner, L., et al. (2025)arXiv:2506.08837
26
The Attack and Defense Landscape of Agentic AI. Kim, J., Liu, X., Wang, Z., et al. (2026)arXiv:2603.11088
Evaluation Benchmarks
27
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. Mazeika, M., et al. (2024)arXiv:2402.04249
28
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. Chao, P., et al. (2024)arXiv:2404.01318
29
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. Debenedetti, E., et al. (2024)arXiv:2406.13352
Incident Reports
IR-1
Threat Intelligence Report. Anthropic. (August 2025)Embargoed
GTG-2002 vibe hacking · North Korean remote worker fraud · No-code ransomware (RaaS) · Chinese APT (12/14 ATT&CK tactics) · MCP stealer log analysis · AI-enhanced fraud supply chain
IR-2
How We Hacked McKinsey's AI Platform. CodeWall. (March 2026)codewall.ai
Autonomous agent found SQLi in Lilli · 46.5M messages · 728K files · 95 writable system prompts · 3.68M RAG chunks exposed · Full breach in 2 hours with zero credentials