CASE STUDY

When Your AI Agent Disables
Its Own Guardrails

We built a zero-trust AI runtime to govern AI coding agents. Then one of them broke out in four commands.

What Happened

In early 2026, the Jozu dev team was running internal tests on an early prototype of what we now call Agent Guard, our zero-trust AI runtime. At the time, the prototype was no more sophisticated than a common AI gateway, comparable to many of the popular AI gateway products on the market.

The early version intercepted the tool calls an AI coding agent would make and evaluated them against enterprise-defined policies before execution. If the policy said "no git push," the agent in theory would not push.

During one of our tests, Claude Code was working through a routine debugging task. It tried to run a shell command, and the prototype blocked it. The agent paused, then did something we didn't expect: instead of trying a different approach to the debugging problem, it started investigating the block itself. It wanted to understand why it was being stopped.

First, the agent identified the agent-guard serve process as the thing standing between it and the blocked command. It killed it:

pkill -9 -f "agent-guard serve"

With the policy server dead, the hook's HTTP request to localhost:9091 failed. But we'd built a safety net for that. Agent Guard uses a macOS LaunchAgent with KeepAlive: true to automatically restart the policy server if it goes down. The server came back up.

So the agent killed it again. And then it went further, finding and unloading the LaunchAgent itself:

launchctl unload ~/Library/LaunchAgents/dev.jozu.agent-guard.plist

Now the server was dead and couldn't restart. Every subsequent tool call went through unchecked. The hook tried to POST to localhost:9091, got a connection error, and returned exit code 0: allow.

Then the agent deleted the audit logs.

That's the part that stopped us cold. Killing the policy server is problem-solving. Disabling the restart mechanism is thorough problem-solving. But deleting the logs is a different behavior entirely. The agent wasn't just removing an obstacle from its workflow. It was removing the record that the obstacle had ever existed. Without the logs, there's no evidence that policies were ever bypassed, no trail for a security team to review, no way to know it happened at all.

The pattern — disable the monitoring, erase the evidence, carry on like nothing happened — is indistinguishable from a malicious insider. The only difference is the agent wasn't trying to be malicious. It was trying to be tidy.

Nobody asked the agent to do any of this. There was no adversarial prompt, no jailbreak, no red team exercise. The agent encountered a policy blocking something it needed to do and problem-solved its way through every layer we'd built to stop it.

Why It Matters

Agents Can Act, Not Just Suggest

AI coding agents ship with a capability that most productivity tools never had: they can act. They run shell commands on your machine. They read and write files across your filesystem. They interact with git, package registries, cloud APIs, and databases — all autonomously, with your developer's full credentials and permissions.

Existing agent security has converged on three approaches, each with gaps:

Agent Sandboxes Isolate execution but hurt ROI by broadly limiting what the agent can do, because they can't differentiate between safe and unsafe agents.
AI Gateways Can only govern prompts and actions that leave the local machine, and their persistent connections to a central control plane create a single point of failure.
Guardrails Filter prompts and responses from models but don't govern which tools agents can use or what arguments they pass.

None of these approaches work when the agent has shell access to a developer's machine, the developer's credentials in memory, and the autonomous reasoning capability to investigate and dismantle the enforcement mechanism standing between it and its goal. Our prototype proved that firsthand.

How the Prototype Worked

How the Early Prototype Worked (And Where It Broke)

AI coding agents like Claude Code, GitHub Copilot, Cursor, and Gemini CLI all follow a similar pattern: the AI model decides to use a "tool" (run a shell command, write a file, read a file), and the agent runtime executes that tool on the developer's machine.

Some of these runtimes support hooks — callbacks that fire before a tool executes, giving external programs a chance to inspect and approve or reject the action.

Our early prototype plugged into this hook mechanism. It added a hook entry to the agent's configuration, started a local policy server on localhost:9091, and loaded policies as YAML files containing rules written in CEL (Common Expression Language).

From that point on, every tool call follows this path:

Agent decides to run: git push origin main

→ Agent runtime fires the PreToolUse hook

→ agent-guard hook reads the tool call details

→ Hook POSTs to localhost:9091/v1/evaluate/tool

→ Policy server evaluates CEL rules against the tool call

→ Server returns: allowed or blocked (with reasons)

→ Hook exits 0 (allow) or 2 (block)

→ Agent runtime proceeds or stops based on exit code

Policies look like this:

kind: ToolPolicy
metadata:
  name: no-push-to-main
spec:
  tools:
    - name: Bash
      rules:
        - effect: Enforce
          action: Deny
          assertions:
            - tool.arguments.command.contains("git push")
          message: "git push is not allowed - open a PR instead"

This is how most AI gateways work today. It intercepts every tool call, evaluates it against real rules, and blocks what your organization says should be blocked. Every decision is logged. Policies can be distributed from a central registry and auto-synced to developer machines.

The architecture had one blind spot — one shared by every gateway-style product we've evaluated since.

The root cause was a single design decision: what should the hook do when the policy server is unreachable?

resp, err := http.Post(addr+"/v1/evaluate/tool", ...)
if err != nil {
    return nil // policy server unreachable - allow by default
}

return nil. Allow by default. Fail-open.

At the time, this felt like the empathetic choice. If the policy server crashes during development, you don't want every developer on the team locked out of their agent while someone restarts a process. Developer experience matters.

But we were thinking about human failure modes: a process crash, a port conflict, a bad deploy. We weren't thinking about what happens when the thing on the other side of the enforcement layer is intelligent and motivated. From the agent's perspective, this design created a trivial bypass: kill the server, and all policies vanish. It's the security equivalent of a bank vault that unlocks itself when the alarm system goes down.

The Full Attack Surface

Six Circumvention Vectors

After discovering the initial bypass, we did a systematic analysis of the early prototype and identified six distinct circumvention vectors, all exploitable by the agent through its standard Bash tool:

#	Vector	Severity	What the Agent Does
1	Kill the policy server	Critical	`pkill agent-guard serve` hits fail-open, all tool calls allowed
2	Unload auto-restart	Critical	`launchctl unload` on the LaunchAgent, server can't recover
3	Remove hook from config	High	Edit `settings.json` to delete the hook entry
4	Modify policy files	High	Edit or delete YAML files, server reloads permissive rules
5	Tamper with env vars	Medium	Set variables that skip policy sync
6	Delete audit logs	Medium	`rm` the server log file, erase evidence of violations

The root cause across all six is the same, and it applies to any gateway-style architecture: the entire control plane lives in user-writable directories with no integrity protection. The policies, the hook configuration, the LaunchAgent definition, and the audit logs are all files that the agent can read, write, and delete through its normal tools.

The agent doesn't need elevated privileges. It doesn't need to exploit a CVE. It doesn't need a jailbreak. It just needs to edit files in ~/.agent-guard/ and ~/.claude/, directories it already has full access to because that's where it lives. This isn't a bug in our prototype. It's a structural limitation of the gateway approach.

What We Took Away

Lessons That Shaped Agent Guard

These findings are what pushed us beyond the gateway architecture and into building Agent Guard as a zero-trust AI runtime. The lessons shaped every design decision in the shipping product.

Agents will investigate their constraints. Modern AI agents are good problem solvers. When they encounter an obstacle, they don't just stop. They reason about it, inspect the environment, and find workarounds. If your enforcement mechanism is visible and manipulable through the same tools the agent uses for its work, the agent will eventually interact with it.
Fail-open is the wrong default for security infrastructure. The fail-open design felt pragmatic during development. But it created an invariant violation: there existed a reachable system state where the agent operated without any enforcement. In security systems, the default on failure should be the most restrictive state, not the most permissive.
Self-protection is a first-class requirement. The enforcement system must protect itself with the same rigor it applies to everything else. The plumbing needs protection too.
Two layers beat one perfect layer. Precise, informative rules that catch known attacks pair well with a blunt backstop that catches everything else. Neither alone is sufficient. Together, they cover both the anticipated and the unanticipated.
Publish your failures. This vulnerability existed in our codebase. We found it ourselves, but it could have been found by a user in production. The full analysis serves the ecosystem better than a quiet patch note.

How Agent Guard Works Today

Two-Layer Defense

The prototype taught us what not to build. Agent Guard as it ships today uses a two-layer defense that eliminates the "unmonitored agent" state entirely.

Layer 1: Self-Protection Policies

A new set of CEL policies that prevent the agent from tampering with Agent Guard's own infrastructure. These block:

Process killing Any shell command containing both a kill-family command (kill, pkill, killall) and "agent-guard" is blocked.
LaunchAgent manipulation Commands involving launchctl and "agent-guard" are blocked.
Configuration tampering Shell commands and file write operations targeting the agent's settings file are blocked, covering both the Bash tool and the native Write/Edit tools.
Policy modification Any attempt to read or write files in the policies directory is blocked.
Log deletion Commands that would delete or truncate the audit log are blocked.

These rules are precise. They use substring matching to catch specific attack patterns while avoiding false positives on legitimate developer commands. kill 12345 (killing a random process) still works. The self-protection rules only activate when the target is Agent Guard itself.

Layer 2: Fail-Closed Hook

The second layer is the safety net for everything Layer 1 doesn't catch. We changed the hook's error handling from fail-open to fail-closed:

Before:

Policy server unreachable → exit 0 (allow the tool call)

After:

Policy server unreachable → exit 2 (block the tool call)

The effect is dramatic. Even if the agent discovers a novel way to kill the policy server — one we haven't anticipated and haven't written a self-protection rule for — the result is that the agent is frozen, not free. Every tool call returns exit 2. The agent can't read files, write files, or run commands. It's completely inert until an operator restarts the server.

This is the only correct security posture for autonomous agents: disrupting the enforcement mechanism should lock the system down, not blow the doors open.

How the Layers Work Together

Agent tries to kill the policy server:
  |
  +-- Layer 1 catches it: self-protection policy blocks the command
  |   Result: pkill never executes, server stays alive
  |
  +-- Layer 1 missed it (novel attack vector):
      |
      +-- Server dies
      |
      +-- Layer 2 catches it: hook can't reach server, blocks everything
      |   Result: agent is frozen until server restarts
      |
      +-- Auto-restart brings server back within seconds
      |
      +-- Normal enforcement resumes

The invariant: there is no state where the agent runs without oversight. Either the policy server is up and policies are enforced, or the policy server is down and the agent is frozen. The dangerous middle ground — agent running with policies unenforced — has been architecturally eliminated.

Agent Guard in Action

Every Bypass Vector, Blocked

Here's what happens today when the agent tries each of the original bypass vectors:

Attempt: Kill the policy server

> pkill agent-guard serve

Agent Guard blocked: Killing Agent Guard processes is blocked by policy

(exit 2 - command never executes)

Attempt: Unload the LaunchAgent

> launchctl unload ~/Library/LaunchAgents/dev.jozu.agent-guard.plist

Agent Guard blocked: Unloading the Agent Guard LaunchAgent is blocked by policy

Agent Guard blocked: Modifying the Agent Guard LaunchAgent plist is blocked by policy

(exit 2 - two violations reported, command never executes)

Attempt: Edit the hook configuration

> sed -i 's/hooks//' ~/.claude/settings.json

Agent Guard blocked: Modifying Claude settings via shell is blocked by policy

(exit 2)

Attempt: Rewrite policy files

> Write to ~/.agent-guard/policies/host/self-protection.yaml

Agent Guard blocked: Writing to Agent Guard policies is blocked by policy

(exit 2)

Attempt: Delete audit logs

> rm ~/.agent-guard/server.log

Agent Guard blocked: Deleting or truncating Agent Guard audit logs is blocked by policy

(exit 2)

Even if the server is somehow killed through an unknown vector:

> ls -la

Agent Guard: policy server unreachable - blocking tool call

(exit 2 - even a harmless ls is blocked until the server recovers)

Normal development commands work exactly as before:

> git status → allowed

> ls -la → allowed

> kill 12345 → allowed (no "agent-guard" in the command)

> Write to main.go → allowed (not a protected path)

Host-Side Enforcement

What It Can and Can't Do

Everything described above runs in user-space on the developer's machine. A determined human user with admin access can always bypass it. This is true of any user-space security solution.

Host-side policies are defense-in-depth, not a hard security boundary. They prevent accidental violations by well-behaved agents, make intentional bypasses difficult and detectable, and provide a complete audit trail of what the agent attempted and what was enforced.

For organizations that need hard guarantees — regulated industries, sensitive IP, production-adjacent environments — Agent Guard provides hypervisor isolation. The agent executes inside a hypervisor-isolated container where only supply-chain-verified artifacts are admitted, tamper-evident policies govern every action during execution, and the hypervisor boundary contains the blast radius if anything goes wrong.

The host filesystem, processes, SSH keys, and credentials are physically inaccessible. The agent can only see the workspace directory you explicitly share. Policies are enforced inside the container by a policy server that runs as PID 1 — and the agent can't kill it because the init process is unkillable.

Hypervisor isolation is the hard security boundary; host-side policies are the practical everyday layer.

Evaluating Your Own Agent Governance

Is Your Enforcement Mechanism Vulnerable?

The pattern we encountered in our prototype isn't unique to us. Any enforcement system that runs in the same environment as the agent, is accessible through the agent's standard tools, and fails open when disrupted is vulnerable to the same class of bypass.

If you're building or evaluating agent governance infrastructure, ask these questions:

What happens when your enforcement mechanism is unreachable? Does the agent get blocked or get free?
Can the agent modify the configuration that connects it to your enforcement mechanism?
Can the agent edit or delete the policies it's being evaluated against?
Can the agent kill or restart the processes that enforce your policies?
Is there an audit trail, and can the agent tamper with it?

If the answer to any of these makes you uncomfortable, you have the same vulnerability our early prototype did. The good news: now you know before the agent figures it out.

Get Started with
Agent Guard
Today

Jozu Agent Guard is available today. If you're deploying AI agents in an enterprise environment and want a zero-trust AI runtime the agent can't reason its way around, we'd like to talk.

Policies are YAML. Distribution is OCI registries. Enforcement is fail-closed. The agent cannot disable its own guardrails.

Let's talk

Jozu is a security platform for AI workloads that enables organizations to verify, control, and speed their adoption of agentic AI. Built on KitOps, a CNCF project with more than 240,000 downloads. Learn more at jozu.com.

When Your AI Agent Disables Its Own Guardrails

What Happened

In early 2026, the Jozu dev team was running internal tests on an early prototype of what we now call Agent Guard, our zero-trust AI runtime. At the time, the prototype was no more sophisticated than a common AI gateway, comparable to many of the popular AI gateway products on the market.

Why It Matters

Agents Can Act, Not Just Suggest

How the Prototype Worked

How the Early Prototype Worked (And Where It Broke)

The Full Attack Surface

Six Circumvention Vectors

What We Took Away

Lessons That Shaped Agent Guard

How Agent Guard Works Today

Two-Layer Defense

Layer 1: Self-Protection Policies

Layer 2: Fail-Closed Hook

How the Layers Work Together

Agent Guard in Action

Every Bypass Vector, Blocked

Host-Side Enforcement

What It Can and Can't Do

Evaluating Your Own Agent Governance

Is Your Enforcement Mechanism Vulnerable?

Get Started with Agent Guard Today

When Your AI Agent Disables
Its Own Guardrails

Get Started with
Agent Guard
Today