The Agentic CLI, Tomorrow: An Ideal-Design Synthesis

This is Part 2 of the Agentic CLI Architecture series — the synthesis. Part 1 catalogued how Claude CLI, OpenAI Codex, and Gemini CLI are built today; this brief distils what an ideal design would look like if you began the work fresh tomorrow. The synthesis is opinionated by design, grounded in current computer science, and constrained to what a competent team could ship in six to twelve months — not a research agenda.

The pattern across the eighteen dimensions of Part 1 is consistent. Each tool has one or two genuinely forward-leaning ideas. None of them has all the right ideas. The work of this brief is to draw a coherent design from the union.

Two principles set the frame.

Design recommendations

Drawn from the 18-dimension comparison

A · Model interaction

Caching

D2 · caching

Cache should be a content-addressed Merkle DAG over message blocks. The client proves what it holds; the server hash-dedupes on insert.

Every message and tool-result has a stable content hash; the cache is a sparse persistent data structure (Clojure’s PersistentVector, Git’s tree objects). Client sends a Bloom filter or sketch of held hashes; server fills in the deltas. Structural sharing across forks and sub-agents becomes automatic — no positional bookkeeping required. Eviction should be W-TinyLFU or 2Q, not LRU — both handle the high-churn workload of tool results far better. Cache breakage becomes detectable mechanically (a hash diff), not via heuristics.

Closest in practice: Claude’s cache_reference-by-tool_use_id is the only mechanism that addresses cache by content identity rather than position — the right primitive, locked to Anthropic’s wire format. Codex’s prompt_cache_key=thread_id is provider-managed and opaque. Gemini relies entirely on the implicit Gemini cache.

Context-window management

D3 · context

Treat context like a CPU memory hierarchy. Evict by information value, not by recency or static rules.

Three tiers — hot (recent verbatim), warm (group-summarised at message-cluster granularity), cold (vector-indexed for on-demand recall via RAG). Eviction prioritises tokens by −log P(next_turn | this_token, rest), proxied by attention or influence scores from a small local critic model. Compaction runs continuously in a background tier rather than reactively at a threshold. Semantic deduplication clusters near-duplicate tool results by embedding and keeps the medoid. The result: a context window whose information density is maximised, not just whose token count is bounded.

Closest in practice: Gemini’s directed-graph processor pipeline is structurally closest, but its processors are rule-based — they don’t optimise for information loss. Claude’s microcompact is excellent ad-hoc engineering. Codex’s remote compaction is the cleanest architectural split but still summary-based.

Model routing

D5 · routing

Use a contextual bandit (LinUCB or Thompson sampling) over a uniform provider abstraction, with online quality, cost, and latency feedback per task class.

Each request carries features (input size, task type, prior turn quality); the bandit picks the cheapest model whose expected quality clears a threshold. Failures are learning signals — regret updates push the policy away from the model that just disappointed. Local on-device classifiers (Gemini’s Gemma idea) become one feature among many, not the sole routing signal. Cost-aware regret bounds (Cesa-Bianchi) prevent the policy from over-exploring expensive models.

Closest in practice: Gemini’s local Gemma classifier is genuinely forward-leaning, but its routing is a fixed strategy chain, not a learning policy. Codex’s ModelProviderInfo is the cleanest provider abstraction. Claude’s small-fast/main-loop split is a heuristic that ignores task heterogeneity.

B · Execution & safety

Tool dispatch & concurrency

D6 · concurrency

Structured concurrency with explicit, lattice-typed effects per tool. The scheduler runs a DAG inferred from effect signatures — not from per-tool boolean flags.

Each tool declares effects (reads(paths), writes(paths), network(host), mutates(state_key)) at registration. Effects form a meet-semilattice — read+read commutes, write+anything serialises per shared key, network operations are independently parallel. The runtime infers a DAG and runs maximally-parallel slices automatically. Cancellation is structured: parent scope’s lifetime bounds children (Trio nurseries, Rust’s async-scope) — a hanging tool can’t leak past its turn. The semantics are provable, not heuristic — impossible-by-construction data races on declared keys.

Closest in practice: Codex’s RwLock parallelism (read-shared, write-exclusive) is the closest to a two-level effect lattice. Claude’s isConcurrencySafe is a coarse single-bit. Gemini’s wait_for_previous is too programmer-driven.

Trust & permission

D7 · permission

Object-capability security — the model receives unforgeable, typed, expiring capabilities, not tool names. LLM classifiers are advisory, never load-bearing.

Inspired by KeyKOS, Cap-n-Proto, and the Principle of Least Authority (POLA). A capability is a typed handle: Read(prefix='/foo', expires=+5m, max_calls=100). The model can’t refer to a tool it doesn’t hold; sub-agents can only delegate caps they themselves hold (attenuation, never amplification). Revocation = invalidate the cap object. This eliminates ambient authority — the failure mode where the model “knows” a tool exists and finds a way to invoke it. Putting an LLM in the trust path hedges the same model that just decided to do the thing — defence in depth, not the depth itself.

Closest in practice: Gemini’s MessageBus.derive() for sub-agents strips forcedDecision so children can’t impersonate parent caps — the most genuinely capability-shaped mechanism in the comparison. Codex’s Granular approval config is the cleanest ACL but still an ACL. Claude’s LLM-based classifiers are clever but architecturally fragile.

Sandboxing

D8 · sandboxing

One microVM per tool call (Firecracker / Cloud Hypervisor) booting in under 100ms, with capability-mediated syscalls. Not OS-level sandboxes layered over a full host process.

Each tool execution runs in a fresh microVM with copy-on-write filesystem snapshots and a minimal syscall surface (gVisor user-space kernel). State across turns persists via content-addressed FS layers (OverlayFS-immutable). The host kernel stops being the trust boundary — a hypervisor is. AWS Firecracker proves the boot-latency story. Fall back to seccomp + bwrap + Landlock + gVisor for environments without virtualisation. macOS Seatbelt and Windows Job Objects remain necessary substrates but acknowledged as weaker. The unit of isolation is the call, not the session.

Closest in practice: Gemini’s gVisor (runsc) backend gets closest to syscall-level isolation. Codex’s three-OS portfolio plus process-hardening ctor hedges the substrate but still inherits the host process. Claude’s external sandbox-runtime dependency is the most opaque.

Sub-agent verification

D9 · verification

Verification is property-based test generation plus a contract-checking critic that only sees the diff. Not a hand-crafted ‘be adversarial’ prompt.

Sub-agents declare goal contracts (pre/postconditions, invariants) in a small DSL. The verifier is a separate model session with fresh context that sees only the diff and the contracts — never the parent’s chain-of-thought, so it can’t be prompt-injected by upstream reasoning. Counterexamples are generated mechanically (QuickCheck-style, with shrinking) and the model must disprove the contract on each. Optional escalation: proof-carrying code — the sub-agent submits the change plus a proof obligation the parent verifies cheaply. The asymmetry (hard to produce, easy to check) is exactly the property Necula exploited for safe kernel extensions.

Closest in practice: Claude’s adversarial verification agent is conceptually right but uses a hardcoded “try to break it” prompt, not generated counterexamples. Codex’s agent-graph-store + agent-identity JWT is the strongest topology and identity layer. Gemini’s A2A is overkill for in-tree sub-agents.

Error handling

D10 · errors

All errors are algebraic data types with table-driven recovery policies. Unknown errors fail loud, never silently retried.

A closed sum type of error variants per subsystem (transport, model, tool, policy, auth). Each variant has a row in a recovery table: retry budget, strategy (fixed / exponential / none), user-visible message template, observability tag. Categorisation is structural — no parsing strings out of API responses. The table is configuration ops can audit; adding a new error class is a single edit. Pair with let-it-crash (Erlang/OTP) — a turn that hits an irrecoverable error aborts cleanly with persisted state, and a supervisor restarts it. Total error handling at compile time prevents the catch-all-and-pray pattern.

Closest in practice: Claude’s typed error classes plus query-source-aware retry table is closest. Codex’s split request_max_retries vs stream_max_retries is cleaner than the others. Gemini’s availability model-state tracker handles sticky failures well but the persistent-429 validation flow leaks model details to users.

C · State & extensibility

State persistence

D11 · persistence

Event-sourced session log (append-only, content-addressed) plus CQRS projections. One log, many queryable views.

One append-only log per session; each entry is content-hashed. Background workers materialise projections — current conversation state, file history snapshots, cost rollups, transcript views — into a query database (SQLite). Forking a session means pointing a new head at any existing event id; immutable history makes branching free, and replay is byte-identical given (log, model_seed). Filesystem state is part of the log via shadow-Git snapshots so tool-call rollback is a checkout, not a heuristic. Every recovery is a deterministic projection rebuild — no orphan-filtering, no corruption to clean up.

Closest in practice: Codex’s three-tier (JSONL rollouts + SQLite state DB + rollout-trace reducer) is by far the closest — arguably already this design. Claude’s conversationRecovery filter chain is sophisticated but reactive (cleans up corruption rather than preventing it). Gemini’s shadow-Git for filesystem rollback is the right primitive but lives separate from the message log.

Plugins & extensions

D12 · plugins

WASM Component Model plugins with WIT-typed capability imports, signed by author, content-addressed, reproducibly built.

Plugins compile to WASM components; WIT (WebAssembly Interface Types) declares exports and required capabilities up front. Host grants caps at install time, not runtime — explicit, auditable, revocable. Bundles signed with Sigstore / cosign; distribution via a content-addressed registry so users can verify they got the bytes the author signed. Wasmtime / Wasmer provide hardware-fast sandboxing without paying for OS isolation. MCP remains the wire protocol; WASM is the unit. Reproducible builds (Bazel, Nix) prove binary matches source — closes the supply-chain gap entirely.

Closest in practice: Claude’s zipped .mcpb bundles are closest to a structured bundle format. Codex’s plugin layering (plugin / core-plugins / skills / connectors) is the cleanest type hierarchy. Gemini’s integrity-hash extensions hint at signed plugins but stop short of capability-typed interfaces.

IDE integration

D13 · IDE

One standard agent ↔ IDE protocol (ACP-shaped). Push-based for IDE→agent (selection, diagnostics, LSP events), structured-action for agent→IDE (diffs, terminal, file ops).

The IDE publishes a typed, subscribable event stream (analogous to LSP but agent-facing): open buffers, selection ranges, diagnostics, build status, debugger state. The agent publishes a typed action surface that the IDE renders — diffs, RFC-style proposed file ops, sandboxed terminal commands. Identity is per-workspace and capability-narrowed. JSON-RPC 2.0 underneath, stable across vendors. This eliminates the N-vendor-specific bridge problem entirely. Either side can be replaced without touching the other. Push-down, not poll — the agent reacts to context changes in real time without spinning.

Closest in practice: Gemini’s ACP stdio adapter (already adopted by Zed) is closest to a portable standard. Codex’s app-server JSON-RPC is the cleanest engine/UI split but is Codex-specific. Claude’s lockfile + ancestor-pid detection across 13+ JetBrains products is operationally heroic but bridge-heavy.

Terminal UI

D14 · TUI

TUI logic is a pure function of state. Rendering is immediate-mode. Accessibility is a separate render target — not an overlay.

Elm Architecture in the core (update : Msg → Model → (Model, Cmd Msg)); the renderer is stateless and re-runs on each state change. Screen-reader gets its own simplified projection of the model — not a CSS overlay on a visual layout. Property tests verify state-machine transitions, which React + side-effects cannot match rigorously. Kitty keyboard protocol and bracketed-paste are baseline. Input runs through a structured-concurrency event loop with explicit cancellation — no orphaned listeners, no race conditions in the input pipeline.

Closest in practice: Codex’s Ratatui + app-server-client split is closest to pure-function rendering, and the active-cell-in-place pattern is the right immediate-mode answer for streaming. Gemini’s screen-reader code path is the only one with explicit accessibility-as-output. Claude’s Ink + heavy React-hook tree is the most composable but the hardest to test rigorously.

D · Ops & governance

Usage & cost tracking

D15 · metering

A typed, provider-agnostic metering pipeline with predictive cost (estimate before the call) as a first-class output alongside reporting.

Every API call emits a structured metering event with normalised units (input / output / cached / reasoning tokens, USD, duration, model, task class). Projections compute rollups lazily as a CQRS view. The critical addition: next-turn cost estimate computed from a rolling average over similar task classes, surfaced to the user before they press enter. Budget gates interrupt before a breach, not after. Cost-prediction error itself becomes a tracked metric — the system learns its own pricing model over time.

Closest in practice: Claude’s cost-tracker is by far the most complete — all four token counters, multiple durations, per-model accounting, lines added/removed, persisted across sessions. Gemini’s in-band credit consumption per stream chunk is the cleanest live-update mechanism. Codex’s BASELINE_TOKENS correction is the only context-pressure adjustment in the comparison.

Telemetry

D16 · telemetry

Local-first telemetry by default. Uploads are opt-in, differentially private, and PII-gated by a small local classifier — with compile-time type guards as the last line of defence.

Telemetry lives locally first; uploads are explicit acts a user can revoke. PII detection runs in a small local classifier before any byte leaves the device; differential-privacy noise is added at the aggregation layer; compile-time types prevent unannotated fields from reaching exporters. The structural lesson: don’t ask “is this data sensitive?” at upload time, ask “is it possible for sensitive data to reach this code path?” at compile time.

Closest in practice: Claude’s compile-time PII type-guards (AnalyticsMetadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS cast through typed-never) is the strongest in-tree mechanism in the comparison — a direct lift candidate. Gemini’s three privacy notice UIs differentiate consent flows but lean on runtime sanitisation. Codex’s separate OTel + analytics pipelines are cleanly split but neither is privacy-typed.

The concept cloud

The recurring computer-science underpinnings — what the synthesis is built on. None of these are exotic; all have peer-reviewed literature or production track records.

Domain	Recurring primitives
Memory and caching	Merkle DAGs · content-addressed storage · persistent data structures (Clojure, Git tree objects) · W-TinyLFU / 2Q eviction · Bloom-filter cache state
Concurrency and safety	Structured concurrency (Trio nurseries, `async-scope`) · effect systems (Koka, Eff) · lattice-based parallelism · provable data-race freedom · ownership types
Capabilities and trust	Object capabilities (Mark Miller) · POLA · KeyKOS / EROS / Genode · capability attenuation · no ambient authority
Isolation	Firecracker microVMs · Cloud Hypervisor · gVisor user-space kernel · seccomp-bpf + Landlock · unikernels (MirageOS) · immutable-FS layers
State and replay	Event sourcing · CQRS (Greg Young) · append-only logs (Kafka) · content-addressed storage · Merkle history · deterministic replay
Verification	Property-based testing (QuickCheck, Hypothesis) · proof-carrying code (Necula) · contract programming (Eiffel) · differential testing
Error handling	Algebraic data types · total error handling · RFC 9457 problem-details · let-it-crash (Erlang/OTP) · supervisor trees
UI	Elm Architecture · immediate-mode UI (Muratori) · pure-function rendering · property-based UI testing · accessibility as separate output
Decision policy	Contextual bandits (LinUCB) · Thompson sampling · cost-aware regret bounds (Cesa-Bianchi) · online learning
Distribution	Sigstore / cosign · WIT (Interface Types) · WASM Component Model · reproducible builds (Bazel, Nix) · content-addressed registries

What this synthesis is, and isn’t

It is a coherent design for the next generation of agentic CLIs — built on the working ideas already shipping somewhere across Claude, Codex, and Gemini, with the structural primitives that each is missing filled in from established CS.

It is not a research agenda. Each recommendation is implementable by a competent team in six to twelve months, with the closest-in-practice precedent listed so an engineer can start from running code rather than from a paper.

It is also not a critique. The three tools surveyed are each defensible bets given their parent organisation’s constraints. The synthesis is what falls out if you treat all three as a single distributed system and let the bets argue.

The point of writing it down is that there is a real opportunity here — not for a marginally better CLI but for a clean foundation an industry can converge on. The work is more boring than it sounds: it is mostly content-addressing things that aren’t, typing things that aren’t, and moving trust boundaries down one level. None of those is a research problem. All of them are engineering problems.

That foundation is what Pintle is building toward. If the picture in this brief maps to a problem on your desk, reach out.