Home

Long Horizon

May 2026

AI used to be one race. Now it's two. One's for smarter models. The other, it turns out, is for something we're only starting to figure out: long-horizon agents. Things that can run for days without derailing, finding their way to a goal mostly on their own.

They run on three things:

Reasoning and planning (thinking on the fly)
Taking actions (using tools and APIs)
Iteration and persistence (holding focus across long horizons)

The hard part isn't intelligence. It's reliability, because errors compound across steps. 200 steps at 99% accuracy lands below 14% end-to-end.

A little bit of context.

The harness is the product

The harness is the wiring between the model and the world. It's what turns the model into an agent and keeps it on track: it decides what the model sees, what it can do.

Context Engineering as a practice.

Designing what goes into the model's context window: prompts, retrieval, memory, and tools. The goal is feeding the model what it needs to work reliably.

Where designers come in

I spend my days in Claude Code, token-maxxing, but I'm not an engineer. So what's my place in this, as an AI-native product designer?

The core tension: by definition, an agent that works autonomously for a long time is doing things you're not watching. And that opens up a pile of problems that are fundamentally design problems, not model problems.

The Supervision Layer.

How might we...

Surface every step, every tool call, every decision in order?
Build trust in something the user can't observe in real time?
Decide what info collapses by default and what deserves a sparkline?

How might we surface key agent states like these, without halting over trivial things, spamming the user, or going silent?

/ I'm stuck
/ I made an assumption
/ I pursued multiple assumptions in parallel and committed work to each
/ I'm about to do something irreversible not part of your whitelist

The intervention layer.

How might we...

Let the user redirect an agent already in motion?
Let a team share control of one agent, each able to step in and redirect?
Help users navigate, compare, and iterate on agent versions?
Choose when the agent asks permission versus forgiveness?

The reward layer.

How might we...

Design tasks where the agent can verify its own progress?
Surface pass/fail signals without making users define them upfront?
Let multiple users contribute reward signal on the same agent?

All of these questions are still unsolved. They're UX problems and taste calls as much as engineering ones.

Final edits done with like-you-talk, my own Claude skill based on Paul Graham's Write Like You Talk (2015) and Write Simply (2021).