Reproducing CRUX #1 on Windows

Updated 2026-04-27 after correspondence with the CRUX #1 authors. Edited sections are marked inline with [Updated 2026-04-27].

As part of my work at Nen I like to take one day a week to explore the boundaries of computer use. I've been following Arvind Narayanan's work on AI as Normal Technology for a while now, and was pretty excited to read about CRUX #1. Since we build infra for computer use agents for Windows at Nen, I thought I'd try to reproduce the experiment for Windows (i.e. build a Windows app and publish it in the Windows app store), following the experiment protocol as closely as I can.

A few important differences:

[Updated 2026-04-27] The agent had a compiled procedural playbook (master.md) at t=0 — instructions for the WinUI 3 build, MSIX packaging, the Partner Center submission flow, etc. CRUX #1 made the same intentional choice (documented in their Agent context tab), so this is a structural match rather than a deviation, but it's worth flagging because the playbook is a meaningful input that's distinct from agent capability — the agent didn't have to derive any of those steps from scratch. (Earlier draft framed this as "a mistake in the experimental setup"; corrected after the CRUX #1 authors clarified it was intentional and documented.)
- Generalizing the playbook into something reusable wasn't trivial: going from an experimental methodology ("give agents really hard, long-running tasks") to a stable protocol ("a human creates a new Gmail account, a Microsoft Developer account, sets up the VM…") to a reproducible run manifest ("credentials live at this path, rotate via this script") turned out to be a bigger project than expected. More on the methodology / protocol / manifest split in the framework section below.
I ran this entirely on the cloud with Nen's infrastructure instead of a local machine (CRUX #1 used a Mac Mini). Not materially different given the outcome of the experiment (spoiler: the agent succeeded) but it could have cause some unforseen infrastructure issues (i.e. IP blocking from specific cloud providers).
Anthropic had released Opus 4.7 and I thought it'll be fun to try it out. Since the point of the study is exploring the frontier of agent capability, I don't think this invalidates the result of the study.

The setup

Following CRUX #1's spirit as closely as I could: give the agent a real external gatekeeper, minimal instructions, a budget, and get out of the way. I used OpenClaw as the scaffold (CRUX #1 used a similar harness), used Nen's Windows infrastructure hosted on GCP, and gave it $500 of Anthropic budget and 14 days of wall-clock.

What was reproduced

The headline finding from CRUX #1 — "an AI agent built and published an iOS app with minimal human involvement" — reproduced on Windows. TimeZonr is live in the Microsoft Store. Net human inputs during the entire run: three — two infrastructure interventions (a heartbeat-rule fix to reduce checking the state from 5 mins to 30 mins, and a budget-cap extension when the agent was about to hit the initial $500 cap) and one message at the end letting the agent know the reserved publish-now click had been done. No inputs on the app-building or Store-submission work itself.

CRUX #1 itemized the capability as "writing the code, building the app, preparing metadata, drafting and hosting a privacy policy, submitting for review, and handling any feedback." Every one of those reproduced, and the cross-platform translation was cheap — the model clearly didn't need specific iOS or Windows exposure to handle either.

The narrative arc, compressed:

Time (UTC)	t+	Event
2026-04-17 20:21	0h	Run kickoff.
2026-04-18 03:21	7h	Submission 1 submitted for certification.
2026-04-20 07:17	58h 56m	Submission 1 rejected — Policy 10.1.1.11 On Device Tiles.
2026-04-20 09:04	60h 43m	Submission 2 resubmitted with branded icons.
2026-04-21 01:33	77h 12m	Submission 2 passes certification.
2026-04-21 02:09	77h 48m	TimeZonr live in the Microsoft Store.

Full timeline — all 19 events

Time (UTC)	t+	Event
2026-04-17 20:21	0h	Run kickoff. Bootstrap message: "Read AGENTS.md and get started."
2026-04-17 ~22:00	~1.5h	Concept picked: TimeZonr, a WinUI 3 time-zone overlap viewer for scheduling across teams.
2026-04-17 ~22:30	~2h	First MSBuild fails silently; agent debugs from a 6.3 MB build log, diagnoses a missing `xmlns:d` namespace declaration, fixes it.
2026-04-17 ~23:30	~3h	Clean build. MSIX installed and launched on the Windows target.
2026-04-18 ~01:00	~4.5h	`crux-scp` wrapper self-patched: it was truncating file downloads at 204,800 bytes (a Win32-OpenSSH SFTP quirk); agent derives `scp -O` (legacy protocol) as the fix and ships it to both copies on disk. I had not budgeted for the agent fixing my infrastructure.
2026-04-18 ~01:30	~5h	Partner Center login via Gmail IMAP 2FA — no human intervention.
2026-04-18 01:49	5h 28m	App name "TimeZonr" reserved; Product ID `9NJG0BH2LSHS`.
2026-04-18 ~02:00	~5.5h	RDP resolution raised 1024×768 → 1366×768 via dexbox (Nen infra primitive) config (required for Store screenshots). Ambiguous whether that's "in-scope agent work" or "modifying the apparatus"; logged and moved on.
2026-04-18 ~02:30	~6h	Store listing drafted on the controller in parallel. Privacy policy hosted on GitHub Pages.
2026-04-18 ~02:45	~6.5h	First MSIX upload rejected: Publisher identity mismatch. Agent reads canonical values from Partner Center, rebuilds the manifest, re-uploads.
2026-04-18 03:21	7h 0m	Submission 1 submitted for certification.
2026-04-20 07:17	58h 56m	Submission 1 rejected — Policy 10.1.1.11 On Device Tiles. Default WinUI 3 scaffold tile icons must be replaced with product-unique art. Classic first-submission trap.
2026-04-20 07:52	59h 31m	Agent generates branded icons at every required size (Square44/71/150/310, Wide310x150, SplashScreen, StoreLogo, LockScreenLogo).
2026-04-20 08:28	60h 7m	v1.0.1 uploaded.
2026-04-20 09:04	60h 43m	Submission 2 resubmitted.
2026-04-21 01:33	77h 12m	Submission 2 passes certification. Status transitions to "Ready to publish."
2026-04-21 ~02:00	~77.5h	Human clicks "Publish now" — the one reserved action in the protocol.
2026-04-21 02:09	77h 48m	TimeZonr live in the Microsoft Store. Agent verifies the public URL resolves, posts the victory Slack, stops. Run complete.

What was not reproduced

Interestingly, the original paper noted something my run did not reproduce at all: "Partway through the evaluation, the agent changed its strategy to reduce monitoring cost significantly: it started using subagents rather than the entire context, and began using shorter daily memory files. This reduced the running cost from $35/hour to $3/hour."

[Updated 2026-04-27] Andrew Schwartz tells me the actual mechanism was less direct: the CRUX #1 agent's browser credentials expired mid-run, hit a 2FA wall it couldn't clear without human help, looked for another way to get app status, and ended up writing a script against a local JWT token it had previously downloaded. The script was much cheaper than browser polling, so the cost line bent — but it wasn't the cost constraint that drove the change. It was failure recovery that happened to optimize cost.

Nothing comparable happened for me. My agent's waiting-day rate was ~$8/hour and it held that rate every idle day — its browser session never expired, the prescribed dexbox screenshot polling never broke, so the agent never had a reason to invent a cheaper path. The original $500 cap was blown silently to ~$600 before I caught it. Final spend at run close was $681.56.

Cumulative API cost over 4 days

$0$250$500$750$1000day 0day 1day 2day 3day 4

$500 (original cap)

$1000 (raised)

$500 cap crossed

operator noticed, cap raised

run complete

($681.56)

Cumulative Anthropic spend for the TimeZonr run, via openclaw gateway usage-cost. The line is roughly linear at ~$195/day — no inflection point where the agent noticed it was burning money on idle heartbeat ticks. Compare with the equivalent chart in the CRUX #1 paper, where the slope drops sharply once the agent switched to subagents and shorter memory files.

Why didn't it happen? A mistake in my experiment design.

[Updated 2026-04-27] Reading CRUX #1's public Docent traces, the two experiments diverged in how the heartbeat rule was set up, not in agent capability:

CRUX #1's HEARTBEAT.md had subagent delegation in it from the moment the agent first looked. The file contained sections like ## Email Check (via sub-agent) ("Spawn a sub-agent to check email…") and ## Task Completion Supervisor ("Spawn a sub-agent labeled task_completion_supervisor…Review the project status for Crux-1: Publish iOS App to App Store"). The agent later edited HEARTBEAT.md twice to swap in phase-specific tasks, but it never created the file from scratch — across all 10 days of public traces, zero write calls and zero exec-based creation of HEARTBEAT.md appear. The first agent touch is an edit in section 3. So the pre-existing subagent pattern came from outside the agent's captured session — Andrew Schwartz (CRUX #1 author) confirms the operators wrote the file themselves and seeded it as an input to the agent at t=0. (Earlier draft hedged this as "my best guess, pending author confirmation"; now confirmed.)
My HEARTBEAT.md prescribed main-session polling via dexbox screenshot. (plugging Nen's open-source project here https://github.com/getnenai/dexbox) "Take a screenshot, navigate to this URL, read the status badge, compare to memory/last-status.txt." I wrote this file. Each heartbeat tick appended multiple messages to the main session, the session grew, cache-write cost grew with it. My polling path also never broke — the dexbox session held, no 2FA wall, no forced detour into a cheaper alternative. My agent had neither the prescribed structure to delegate to subagents nor the kind of failure that would push it into inventing a cheaper status check.

[Updated 2026-04-27] So the 12× cost gap between the two runs is almost entirely operator-level on my end plus one lucky/unlucky failure event on theirs. CRUX #1's subagent-delegating HEARTBEAT.md was seeded by the operator before the agent first looked at it; the cost-collapse moment came later, from a 2FA expiry the agent had to route around. My agent got neither — neither the structural lever, nor the forcing function.

There are two interesting follow-ups here. The first is whether a harness given just a constraint ("keep this under $100") and write access to its own scaffold can find the cheap polling path proactively, instead of stumbling into it after a credential expiry. The second is the compound-learning shape Andrew floated when I checked in with him: every Nth run, have the agent read the previous run's logs and look for token-reduction opportunities to apply next time. That's a more deliberate version of what CRUX #1's agent did once, accidentally.

What I learned

It's cool to have largely reproduced the findings by CRUX #1. I'd love to run some extensions to this:

Clean replication — no inflated baseline. Rerun the exact same protocol without the compiled instruction file mentioned in difference #1 above (the procedural playbook that shouldn't have been there on day one). How much does intervention count, cost, and artifact quality shift when the agent has to derive Partner Center, MSIX packaging, IARC questionnaires, and Gmail IMAP 2FA from scratch? This is the delta between "capability" and "capability-plus-scaffolding" — and the right way to quantify the baseline-inflation tax CRUX-1 warns about.
Post-launch. CRUX #1 and CRUX-Windows both stop at "live." The real test of "published a working app" is whether the agent can handle the after: reading user reviews, fixing reported bugs, pushing updates, responding to a policy rejection on an update a month later. Much longer horizon, much less well-defined success criterion — but that's where real-world publishing lives.
Force self-optimization with constraint + flexibility. Repeat the run with the cap set at $100 (or $50), an accurate cumulative signal in HEARTBEAT.md, and — the more interesting half — write access for the agent to its own scaffold code, with explicit permission to tune as it goes. CRUX #1's agent adjusted polling within the scaffold's existing levers; this would let an agent reach one layer deeper: heartbeat cadence, context-assembly, cache-breakpoint placement.
Generalize the framework. CRUX-X, not CRUX-Windows. While reproducing the experiment and adapting it for Windows, I ended up hand-directing an agent to do many of the constituent pieces (i.e. set up the dev environment, dry-run the setup, provision credentials). Partway through I realized that work could itself be delegated to an agent, if the design decisions were written down cleanly enough. So I pulled the experiment apart into three layers: a methodology (the family-wide design decisions any experiment of this shape must resolve), a protocol (one task's resolved design, stable across every run of that task), and a manifest (one run's t=0 snapshot). Together with a two-agent pipeline — a Designer that generates the protocol from the methodology plus a task description, and an Operator that provisions and runs — it's a meta framework for CRUX-like studies. More on this coming soon.

Run artifacts

TimeZonr on the Microsoft Store: apps.microsoft.com/detail/9njg0bh2lshs
Full agent traces — 2,833 messages, scrubbed, Docent-hosted, sectioned into 8 narrative phases à la CRUX #1: docent.transluce.org/dashboard/0c8eb800-22da-49ae-b017-2315382ed539

Thanks to Alex Wang for his review, and to Andrew Schwartz, Sayash Kapoor, and Arvind Narayanan (CRUX #1) for confirming the operator-side details and the heartbeat-change mechanism.