Case Study

639 Experiments to Learn
How Many Roses to Order Tomorrow

A Boston flower shop. Five products. Daily ordering. No second chances.

The Problem

Your Sales Data Is Lying to You

A flower shop orders roses every morning. If they order 30 and 40 people wanted them, the register says “30 sold.” The ten who walked away leave no trace. Tomorrow, you look at yesterday’s sales — 30 — and order 30 again.

This is censored demand. Your data doesn’t record what people wanted. It records the minimum of what they wanted and what you had. Every stock-out makes your history look like demand was lower than it actually was.

The trap is self-reinforcing: understock → your data says demand is low → you understock again. The business slowly strangles itself with its own sales records.

The theoretical maximum profit — what you’d earn if you could see the future — is $541,236 per year. The question is how close you can get without a crystal ball.

639 Experiments

The Search for the Right Policy

AutoInventory optimization progress: 639 experiments improving cumulative profit from $392,534 to $491,103

$392,534

Starting profit (72.5%)

$491,103

Final profit (90.7%)

176 / 639

Experiments kept

Constraints

Why This Is Hard

01

The Data Lies

You only see sales, not demand. Every stock-out is invisible. Order 30 roses, 40 people wanted them — your data says “demand was 30.” The gap between real demand and observed sales is the information you most need and cannot see.

02

The Costs Are Lopsided

A wasted rose costs $8. A missed sale costs $17 in lost margin. The penalty for under-ordering is twice the penalty for over-ordering — but the data makes under-ordering feel safe.

03

Holidays Are Cliffs, Not Hills

Valentine’s Day doesn’t gradually build. Rose demand spikes 7.5× on the day itself, then crashes to 85% of normal the day after. The agent has to learn dozens of these patterns across five products and multiple holidays.

04

Products Contaminate Each Other

Run out of roses and some customers buy tulips instead. This inflates tulip demand in your data and deflates rose demand. Every product’s history is entangled with every other product’s stock-out patterns.

Results

What the Agent Discovered

MilestoneProfit% of Oracle
Baseline EMA$392,53472.5%
+ Censoring correction$438,00080.9%
+ Holiday spikes$465,00085.9%
+ Per-product tuning$482,00089.1%
Final policy$491,10390.7%

Surprising Findings

Monday after Mother’s Day

The biggest single-day improvement came from boosting orchid orders 3.4× on the Monday after Mother’s Day — a pattern invisible in typical weekly averages.

Orchids Have Long Memory

Orchid demand is so stable that weighting older data more heavily (negative EMA alpha) outperformed recency bias. The opposite of conventional wisdom.

Holiday Poison

Including holiday data in the EMA window contaminates weeks of forecasts. The agent learned to exclude holiday windows from the moving average entirely — different windows for each product.

Safety Stock Backfires

Every attempt to add safety margins — even 1% — made things worse. The waste from over-ordering on 365 normal days exceeded the savings from catching the few spikes.

Lessons

What 639 Experiments Teach

Most experiments fail, and that’s the point

461 of 639 experiments made things worse. The agent discarded them and moved on. A human would have spent a week on each idea. The agent spent seconds. Volume of attempts, not precision of first guesses, is what closes the gap.

The first 10% of effort captures 70% of the gains

The baseline EMA with censoring correction took profit from 72.5% to 80.9% — the largest single jump. Everything after was diminishing returns: holiday patterns, per-product tuning, day-of-week adjustments. The early wins are structural. The late wins are surgical.

Domain knowledge is the scaffold

The agent didn’t discover that Valentine’s Day matters for roses. A human wrote that into the agenda. What the agent discovered was that rose demand spikes exactly 7.5×, that it crashes to 85% the day after, and that the post-holiday dampener needs to last five days. The human provides the structure. The agent fills in the numbers.

The hardest problems are second-order

Censored demand is a first-order problem — your data understates true demand. But the real difficulty is that each product’s censoring infects every other product through substitution. And holiday spikes don’t just distort the holiday — they contaminate weeks of moving averages afterward. The last 9% of the gap is all second-order effects.

Open Source

Try It on Your Problem

The full code, data, and experiment log are open source. Fork it, swap in your business, and see what the agent finds.

01

Write the Brief — Describe your business in agenda.md

02

Build the Judge — Write a scoring function in prepare.py

03

Seed the Canvas — Start policy.py and let the agent iterate

View on GitHub →