Case Study
A Boston flower shop. Five products. Daily ordering. No second chances.
The Problem
A flower shop orders roses every morning. If they order 30 and 40 people wanted them, the register says “30 sold.” The ten who walked away leave no trace. Tomorrow, you look at yesterday’s sales — 30 — and order 30 again.
This is censored demand. Your data doesn’t record what people wanted. It records the minimum of what they wanted and what you had. Every stock-out makes your history look like demand was lower than it actually was.
The trap is self-reinforcing: understock → your data says demand is low → you understock again. The business slowly strangles itself with its own sales records.
The theoretical maximum profit — what you’d earn if you could see the future — is $541,236 per year. The question is how close you can get without a crystal ball.
639 Experiments

$392,534
Starting profit (72.5%)
$491,103
Final profit (90.7%)
176 / 639
Experiments kept
Constraints
You only see sales, not demand. Every stock-out is invisible. Order 30 roses, 40 people wanted them — your data says “demand was 30.” The gap between real demand and observed sales is the information you most need and cannot see.
A wasted rose costs $8. A missed sale costs $17 in lost margin. The penalty for under-ordering is twice the penalty for over-ordering — but the data makes under-ordering feel safe.
Valentine’s Day doesn’t gradually build. Rose demand spikes 7.5× on the day itself, then crashes to 85% of normal the day after. The agent has to learn dozens of these patterns across five products and multiple holidays.
Run out of roses and some customers buy tulips instead. This inflates tulip demand in your data and deflates rose demand. Every product’s history is entangled with every other product’s stock-out patterns.
Results
| Milestone | Profit | % of Oracle |
|---|---|---|
| Baseline EMA | $392,534 | 72.5% |
| + Censoring correction | $438,000 | 80.9% |
| + Holiday spikes | $465,000 | 85.9% |
| + Per-product tuning | $482,000 | 89.1% |
| Final policy | $491,103 | 90.7% |
The biggest single-day improvement came from boosting orchid orders 3.4× on the Monday after Mother’s Day — a pattern invisible in typical weekly averages.
Orchid demand is so stable that weighting older data more heavily (negative EMA alpha) outperformed recency bias. The opposite of conventional wisdom.
Including holiday data in the EMA window contaminates weeks of forecasts. The agent learned to exclude holiday windows from the moving average entirely — different windows for each product.
Every attempt to add safety margins — even 1% — made things worse. The waste from over-ordering on 365 normal days exceeded the savings from catching the few spikes.
Lessons
461 of 639 experiments made things worse. The agent discarded them and moved on. A human would have spent a week on each idea. The agent spent seconds. Volume of attempts, not precision of first guesses, is what closes the gap.
The baseline EMA with censoring correction took profit from 72.5% to 80.9% — the largest single jump. Everything after was diminishing returns: holiday patterns, per-product tuning, day-of-week adjustments. The early wins are structural. The late wins are surgical.
The agent didn’t discover that Valentine’s Day matters for roses. A human wrote that into the agenda. What the agent discovered was that rose demand spikes exactly 7.5×, that it crashes to 85% the day after, and that the post-holiday dampener needs to last five days. The human provides the structure. The agent fills in the numbers.
Censored demand is a first-order problem — your data understates true demand. But the real difficulty is that each product’s censoring infects every other product through substitution. And holiday spikes don’t just distort the holiday — they contaminate weeks of moving averages afterward. The last 9% of the gap is all second-order effects.
Open Source
The full code, data, and experiment log are open source. Fork it, swap in your business, and see what the agent finds.
Write the Brief — Describe your business in agenda.md
Build the Judge — Write a scoring function in prepare.py
Seed the Canvas — Start policy.py and let the agent iterate