Skip to content
All white papers
White paper The Agentic CLI Architecture — Part 1 of 2

The Agentic CLI, Today: Claude, Codex & Gemini Compared

An architectural side-by-side across eighteen dimensions — what each tool bets on, and where they diverge most sharply.

17 May 2026 Toronto v2 12 min read
Agentic AI CLI architecture Comparative analysis
On this page

This is Part 1 of a two-part series. Audience: architects, security engineers, and platform leads designing or evaluating agentic coding tools. This brief catalogues how three production tools — Claude CLI, OpenAI Codex, Gemini CLI — actually build their agentic surface; Part 2 distils what an ideal design would look like from the patterns that recur.

All three CLIs converge on a similar shape: a REPL with tools, an MCP plugin layer, hierarchical project memory, and a sandbox. Look one layer deeper and the engineering priorities diverge sharply. The point of this brief is not to crown a winner; each tool optimises for a defensible bet given its parent organisation’s constraints. The point is to make those bets legible, so an architect can borrow from each with eyes open.

3
Production agentic CLIs
Claude CLI, OpenAI Codex, Gemini CLI
~5,800
Source files across the three
Across TypeScript and Rust
18
Architectural dimensions compared
From transport to governance
~50
Novel mechanisms identified
Non-obvious bets unique to one tool
The shape of the comparison — three tools, three stacks, eighteen dimensions.

How to read this brief

Each dimension is treated as a three-card comparison — what Claude does, what Codex does, what Gemini does — with a few words on each tool’s distinctive mechanism (marked NOVEL where appropriate). Source-file paths are included so the curious reader can verify. The brief deliberately understates novelty when a tool has merely adopted a third-party SDK; “NOVEL” means in-tree engineering investment, not “first to ship.”

Two acts. This brief is the descriptive half — what each tool is, today. Part 2 of the series is the synthesis — what an ideal design looks like, distilled from the patterns that recur.

TL;DR — three personalities

ToolOne-line characterisation
Claude CLIThe cache-aware, careful one. Heavy investment in prompt-cache economics (single-marker policy, cache_reference for tool results, on-disk cache-break diffs) and safety classifiers (LLM-based YOLO and Bash classifiers gating dangerous operations). Aggressive enterprise plumbing: remote-managed settings, policy-limits, an upstream-proxy WebSocket relay with MITM CA injection inside managed-session containers, compile-time PII guards on telemetry.
OpenAI CodexThe systems-engineered one. Rust + multi-OS sandboxing (Seatbelt, bubblewrap+seccomp, Windows restricted-token, a process-hardening ctor that disables core dumps before main). An app-server daemon cleanly separates engine from UI; the TUI is a JSON-RPC client that can run remotely. Guardian is a second LLM session that auto-approves on-request tool calls fail-closed. Three-tier persistence (JSONL rollouts + SQLite index + rollout-trace reducer).
Gemini CLIThe platform-integrated one. The most protocol-coupled: A2A (Agent-to-Agent) server with GCS persistence, ACP stdio adapter so external clients (e.g. Zed) drive Gemini as a sub-process, MCP, plus a vscode-ide-companion. Routing uses a local on-device Gemma classifier (LiteRT binary) to pick model tier; context management is a graph pipeline of processors, not a single compactor. CCPA remote admin pushes MCP servers, required-servers, and extension toggles at runtime.

Codebase shapes

Claude CLIOpenAI CodexGemini CLI
LanguageTypeScript · React/InkRust workspace · Ratatui (+ TS shell)TypeScript · npm monorepo
Source files~1,900~1,865 .rs~1,977
Tree size34 MB51 MB108 MB
LayoutFlat root: QueryEngine.ts, Tool.ts, query.ts, cost-tracker.ts, tools/ (40+ first-party tools), bridge/, buddy/, coordinator/, memdir/, remote/, plugins/, skills/, upstreamproxy/codex-rs/core/ (engine), codex-rs/tui/ (Ratatui), sandbox family (sandboxing/, linux-sandbox/, bwrap/, windows-sandbox-rs/, execpolicy/, process-hardening/), persistence (rollout, state, agent-graph-store), governance (cloud-tasks, cloud-requirements, responses-api-proxy)packages/cli (Ink front-end), packages/core (engine: agent/, scheduler/, confirmation-bus/, context/, policy/, routing/, safety/, sandbox/), packages/a2a-server, packages/vscode-ide-companion, packages/sdk + devtools + test-utils; dedicated evals/, perf-tests/, memory-tests/ harnesses

The shape itself tells the story. Claude has the flattest tree and the most ad-hoc UI experiments (voice STT, animated “buddy” companion sprite, vim mode, perfetto-style profiling). Codex is the cleanest workspace separation — each piece of the architecture is its own crate. Gemini is the most ops-aware monorepo, with separate harnesses for evals, perf, and memory.

The 18-dimension matrix

The differentiator per cell — what’s distinctive, not what’s the same. NOVEL marks novel/unique mechanisms; otherwise cells describe approach in 6–12 words.

A · Model interaction

DimensionClaude CLICodexGemini CLI
Transport & streamingNOVEL Chunked heartbeat yields under unattended-retry; stall detector + fallback to non-streamingNOVEL Responses-over-WebSocket with prewarm; sticky session reconnectNOVEL Mid-stream retry with UI-visible RETRY event; SSL BAD_RECORD_MAC in retry allow-list
Prompt economics & cachingNOVEL Single-marker cache_control, cache_reference on tool results, on-disk cache-break diffsServer-side prompt cache keyed by thread id; trim-from-head to preserve prefixImplicit context caching only; no explicit cache wiring
Context-window managementNOVEL Three-tier: auto-compact + microcompact + session-memory; reactive on prompt_too_longNOVEL Remote (server-side) compaction supported per providerNOVEL Directed-graph pipeline with processors and profiles
Prompt assemblyHierarchical CLAUDE.md + auto-memdir/MEMORY.md; sectioned cache-scoped blocksNOVEL Root-to-cwd concat of all AGENTS.md; AGENTS.override.md for local override4-tier GEMINI.md (global → user-project → extension → project); inode dedup
Model routingProvider strategy (firstParty / Bedrock / Vertex / Foundry) + small-fast/main splitData-driven ModelProviderInfo; built-in Ollama & LM Studio crates; Responses API onlyNOVEL Composite strategy chain incl. local Gemma classifier (LiteRT)

B · Execution & safety

DimensionClaude CLICodexGemini CLI
Tool dispatch & concurrencyPer-tool isConcurrencySafe; parallel groups; shell-task tracking per agentNOVEL RwLock parallelism: read-lock = parallel-safe, write-lock = exclusiveStateful scheduler; wait_for_previous flag batches parallelizable calls
Trust & permissionNOVEL LLM-based YOLO/bash classifiers gating each call; killswitch on bypass modeNOVEL Guardian LLM auto-approver (fail-closed); 5-category Granular approval configNOVEL Per-prompt LLM-generated Conseca security policy; subagent-sanitizing message bus
Sandboxing & isolationExternal @anthropic-ai/sandbox-runtime + CCR upstream-proxy WS relay with MITM CANOVEL Seatbelt + bwrap+seccomp + Windows restricted-token + process-hardening ctorNOVEL Six backends: docker / podman / sandbox-exec / runsc / lxc / windows-native
Sub-agent isolationNOVEL Built-in adversarial verification agent; coordinator + tmux swarmFull child Session; dedicated agent-graph-store + agent-identity JWTNOVEL Full A2A protocol (Agent-to-Agent); GCS-backed task store
Error handlingTyped errors; query-source-aware retry; reactive compaction on prompt-too-longSplit request_max_retries vs stream_max_retries; explicit PendingUnauthorizedRetryPer-model terminal/sticky-retry tracking; onPersistent429 validation flow

C · State & extensibility

DimensionClaude CLICodexGemini CLI
State persistenceJSONL transcripts; conversationRecovery with orphan filters; sidechain transcripts for sub-agentsNOVEL JSONL rollouts + SQLite state-db + rollout-trace reducer (three-tier)NOVEL Shadow-Git checkpoint repo per session for tool-call rollback
Plugins / MCP / extensionsIn-process + SDK + VS Code MCP transports; .mcpb zipped bundles; hooks event-busLayered: plugin / core-plugins / skills / connectors; external-agent-migration imports Claude configIntegrity-checked extensions; requiredMcpServers from admin policy
IDE integrationLockfile-based detection across 13+ JetBrains products; VS Code SDK transport; diff pushApp-server JSON-RPC bridge; SessionSource::VSCode first-classNOVEL MCP companion + ACP stdio adapter (driven-from-outside)
Terminal UIInk (React) + voice STT, vim mode, buddy companion sprite, FPS trackerRatatui; TUI is app-server client (in-process or remote)Ink + Kitty-keyboard protocol + dedicated screen-reader layout

D · Ops & governance

DimensionClaude CLICodexGemini CLI
Usage & cost trackingFull per-session/per-tool/per-model accounting; persisted cost; 5-hour Pro/Max windowsToken usage with cached_input_tokens + BASELINE_TOKENS floor; server-side billingNOVEL “Google One AI credits” overage currency; in-band consumption per stream chunk
Telemetry & privacyNOVEL Compile-time PII guards (typed-never casts); Datadog + 1P BigQuery; 3-tier privacy levelsSeparate OTel + analytics pipelines; W3C trace context; SQLite log-dbOTel SDK + Google Cloud exporters + Clearcut (Google-internal logging)
Auth & identityOAuth+PKCE; API key; Bedrock/Vertex creds; CCR JWT bridge; OAuth-revoked recoveryNOVEL PKCE + device-code login; AgentIdentity JWKS-verified JWT for M2MSix auth types incl. Code Assist; loopback OAuth2; ADC; gateway URL override
Remote adminNOVEL Remote-managed settings + policy-limits; CCR session viewer; upstream-proxy WSNOVEL App-server daemon + cloud-tasks + cloud-requirements + responses-api-proxyNOVEL CCPA admin push (MCP servers, required-MCP, extensions); IPC-forwarded across sandbox

Selected deep dives

The full matrix is dense by design — it’s a reference, not a narrative. Below are the four dimensions where the divergence between tools is most architecturally instructive.

Prompt economics & caching — three philosophies

Claude is the most invested. Exactly one cache_control: ephemeral marker per request on the last message; for fire-and-forget forks (skipCacheWrite) it’s shifted one back so the fork does a server-side no-op merge. System prompts are split into named, cache-scoped text blocks (buildSystemPromptBlocks). cache_reference is injected on tool-result blocks by tool_use_id, letting the server re-use already-cached tool outputs. A cache-break detector hashes system/tool schemas/betas/extra-body every call, diffs them, and dumps a cache-break-*.diff artifact when something invalidates the cache.

Codex leans on OpenAI’s server-side cache. Every Responses API request carries prompt_cache_key = thread_id. The CLI deliberately keeps the cached prefix stable: when the context window overflows mid-turn it remove_first_item() rather than the last — “preserve cache (prefix-based) and keep recent messages intact” (compact.rs:223–231).

Gemini has no explicit Gemini context-cache wiring in source. Caching is delegated to Gemini’s implicit context cache via stable history shape.

Trust & permission — three central novel ideas

All three layer rule-based gates with LLM-based judgment, but each has its own central novel mechanism.

Claude has the YOLO classifier: in auto mode, every tool call is run through an LLM-based classifier (YOLO_CLASSIFIER_TOOL_NAME = 'classify_result') that takes a stripped transcript — only user text plus assistant tool_use blocks, never assistant text, preventing the model from influencing its own classifier — and returns {thinking, shouldBlock, reason}. Bypass-permissions has a remotely-disablable killswitch.

Codex has Guardian: a second LLM session (codex-auto-review) that auto-grants or denies on-request approvals using a structured-output contract — fail-closed on timeout or malformed output, capped at MAX_CONSECUTIVE_GUARDIAN_DENIALS_PER_TURN = 3. The five-mode AskForApproval enum includes a uniquely fine-grained Granular(GranularApprovalConfig) variant with separate booleans for sandbox approval, exec-policy rules, skill approval, MCP elicitations.

Gemini has Conseca plus a sub-agent-sanitising MessageBus. Conseca is a per-prompt LLM-generated security policy enforced in-process. The MessageBus’s derive(subagentName) produces an untrusted child bus that scrubs forcedDecision, metadata, and rewrites subagent identity — so a sub-agent cannot impersonate its parent’s policy.

Sandboxing — the most divergent dimension

Claude delegates to an external package — @anthropic-ai/sandbox-runtime — and adds a CCR upstream-proxy: inside a managed-session container, the CLI reads a session token, sets PR_SET_DUMPABLE=0 to block ptrace of the heap, downloads a MITM CA cert, and starts a CONNECT-over-WebSocket relay wrapping bytes in UpstreamProxyChunk protobufs so every subprocess (curl, gh, python) goes through the org’s egress proxy.

Codex ships three per-platform sandbox backends picked by get_platform_sandbox: MacosSeatbelt, LinuxSeccomp (bubblewrap+seccomp via the codex-linux-sandbox helper binary), and WindowsRestrictedToken. macOS uses /usr/bin/sandbox-exec hardcoded against PATH injection with .sbpl policies modeled on Chromium’s renderer sandbox. A separate codex-process-hardening crate runs pre_main_hardening() via #[ctor] — disables core dumps, ptrace, strips LD_PRELOAD / DYLD_* before main executes.

Gemini ships six backends: docker, podman, sandbox-exec, runsc (Docker + gVisor), lxc, windows-native. The CLI re-execs itself inside the chosen sandbox; supports a project-level .gemini/sandbox.Dockerfile. Separate SandboxPolicyManager for shell-command-level filtering inside the sandbox.

State persistence — and the closest thing to event sourcing in the comparison

Claude persists session transcripts append-only at ~/.claude/projects/<projectHash>/<sessionId>.jsonl. On resume, conversationRecovery.ts rebuilds the chain with orphan-filtering and copy-forwards fileHistory plus plan snapshots. Compact boundaries are first-class persisted message types (SystemCompactBoundaryMessage, MicrocompactBoundaryMessage, RequestStartEvent, TombstoneMessage, ToolUseSummaryMessage) — so the log can replay non-trivial state including microcompact deletions. Sub-agent transcripts are stored as sidechains.

Codex does the cleanest job. Three-tier persistence: JSONL rollouts at ~/.codex/sessions/rollout-<rfc3339>-<uuid>.jsonl, a SQLite state DB (state_db.rs, sqlite_metrics.rs) for metadata indexes, and a separate rollout-trace crate for replay and reduction of conversations. RolloutItem discriminates ResponseItem / EventMsg / Compacted / TurnContext / SessionMeta. Two persistence policies (Limited vs Extended) tunable per session. The TUI directly resumes from any rollout via find_thread_path_by_id_str.

Gemini persists sessions to <projectTempDir>/chats/session-<timestamp>-<id>.jsonl. The unique mechanism is a separate shadow-Git checkpoint repo (author “Gemini CLI”) created per session for tool-call rollback — file-state snapshots captured as Git commits, allowing undo by checkout.

Where each one teaches the others

A short list of patterns worth borrowing, by tool of origin:

  • From Claude: compile-time PII type-guards in telemetry (AnalyticsMetadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS cast through typed-never), cache_reference by content identity, the adversarial verification sub-agent pattern.
  • From Codex: the RwLock concurrency model (read-shared, write-exclusive), the three-tier persistence design, the JWKS-verified AgentIdentity JWT for M2M sub-agents, the process-hardening ctor.
  • From Gemini: the MessageBus derive() that strips parent-only fields when forking to a sub-agent (genuinely capability-shaped), the directed-graph context pipeline with explicit invariant checks, the shadow-Git for filesystem rollback, ACP as a standard external-driving protocol.

What the comparison doesn’t answer

The descriptive half stops at “what is.” It doesn’t tell you which patterns to adopt or what an ideal design would look like if you started fresh tomorrow. That’s the synthesis — and the subject of Part 2 of this series.

Some recurring questions the data raises and Part 2 takes up: Should cache be a content-addressed Merkle DAG? Are LLM-based safety classifiers load-bearing or hedge? Is one microVM per tool call practical? What does a portable agent ↔ IDE protocol look like? Why does an event-sourced session log keep emerging as the right shape?

The brief catalogue you’ve just read is the raw material; the next one is the synthesis.

Bring this rigor to your own AI controls.

If this series maps to a problem on your desk, a short call is the fastest way to compare notes.