The Experimentation Agenda: Choosing the Few Clean Tests You Can Run

The constraint

Why you can't just test everything

A clean incrementality test isn't an A/B tweak you ship on Friday. It needs a holdout — real spend withheld, or whole regions kept dark — for several weeks, plus enough conversion volume to read a signal above the noise. That's expensive in money, time, and forgone performance. Which means the binding constraint isn't ideas; it's test slots. Most teams get a handful of trustworthy reads a year.

KeyIf experiments were free, you'd test everything. They're not — so the real skill is choosing which questions deserve a slot, and that's a prioritization problem, not a statistics problem.

The framework

Score every test idea

Three factors decide whether an idea earns a slot. A test is worth running when the decision it informs is big, you're genuinely unsure, and the test is cheap relative to that.

priority = decision value × uncertainty ÷ cost

how much budget/strategy the answer moves · how unsure you are now · spend + holdout + time the test burns

The test-priority matrix

Plot each idea by decision value and uncertainty; bubble size = cost. Only the top-right quadrant earns a slot.

"Should we scale Connected TV?" is the classic test-first: a large budget decision you're genuinely unsure about — worth a slot even though it's costly to test. "Cut brand search?" might be high-value but you may already know the answer, so just decide. A tiny new partner is uncertain but not worth a slot. And an ad-copy tweak is a cheap A/B, not an incrementality experiment at all.

Pick the instrument

Match the test to the question

Once an idea earns a slot, the design follows the channel and the budget. The four workhorses:

Test type	How it works	Best for	Cost / effort
Geo holdout	Turn a channel off in matched regions; compare to on regions	Channels without clean user-level tracking — CTV, audio, OOH, broad social	High — real spend withheld
Ghost ads / PSA	Control group sees a placebo instead of your ad	Display & video where you can serve a control creative	Medium
Platform conversion lift	Randomized holdout run inside Meta or Google Ads	In-platform channels with enough volume	Low–medium — native tooling
Switchback	Toggle the channel on/off over alternating periods	Always-on channels; smaller geographies	Medium — needs stable baseline

The payoff

The agenda is a maintenance schedule

Here's what turns experimentation from a side project into the backbone of your measurement: every test recalibrates everything else. A result doesn't just answer its own question — it re-tunes the whole stack.

→

It sets attribution's discount factor

A geo holdout that reads 60% incremental becomes the 0.6 you apply to attribution's claimed conversions for that channel — so your daily steering number gets more honest the day the test lands.

→

It validates (or corrects) the MMM

If your mix model says social is 2× more efficient than the holdout proved, you constrain the model to the causal truth. Incrementality is the calibration input a good MMM is built to accept.

→

It ages — so you re-test

Discount factors go stale as channels and platforms change. The agenda's job is to keep the most decision-critical factors fresh, on a rotation. Measurement is upkeep, not a one-time install.

This closes the loop

This is the same triangle from the measurement field guide, viewed from the experimentation corner. Attribution steers, MMM bounds, and incrementality is the referee that keeps the other two honest — and the experimentation agenda is simply the referee's schedule.

In practice

A sample year of tests

What a realistic, prioritized agenda looks like across four quarters — biggest, most uncertain decisions first.

Biggest bet

Geo holdout on your largest upper-funnel channel (CTV/social). The costliest, highest-value read — do it while budgets are fresh.

Re-test the staple

Conversion-lift study on your biggest in-platform channel to refresh a discount factor that drives daily allocation.

The new thing

Geo or switchback test on a channel you scaled this year and have no causal read on yet.

Validate the model

A targeted test on whatever your MMM and attribution disagree about most — reconcile before planning next year.

NoteFour to six clean reads a year is plenty if they're the right reads. A portfolio of well-chosen tests beats a backlog of cheap ones every time.

Go to the source

Tools & references

GeoLiftMeta's open-source package for designing and measuring geo holdout experiments. Source: Meta Open Source.

Conversion Lift / Brand LiftNative randomized-holdout incrementality tools inside Meta and Google Ads. Source: Meta & Google Ads.

MeridianGoogle's MMM, which takes incrementality results as calibration priors — the loop in action. Source: Google.

The measurement field guideHow experimentation, attribution, and MMM form one triangle. Source: this site.

Keep going

The rest of the stack

Pillar

How to measure in 2026

The triangulation framework this agenda keeps honest.

Read the field guide →

Method

Scrappy MMM

The model your test results calibrate.

Read it →

Method

Forecast a range

Where your incrementality reads sharpen the forecast.

Read it →

🎛️ Free tool

The budget allocator

Put the calibrated curves to work.

Open the tool →

Quick answers

Common questions

How many marketing experiments can a team realistically run?

Far fewer clean ones than people think. A real incrementality test needs a holdout — withheld spend or a dark geo — for several weeks, plus enough volume for a readable signal, so most teams can only run a handful of trustworthy tests per year. That scarcity is why you prioritize them as a portfolio rather than working through a backlog.

How do you prioritize which experiments to run?

Score each idea by decision value times uncertainty, divided by cost. Decision value is how much budget or strategy the answer would move; uncertainty is how unsure you are today; cost is the spend, holdout, and time the test consumes. Test where a big, uncertain decision can be resolved cheaply — and skip tests whose answer wouldn't change anything.

What types of incrementality tests are there?

The main types are geo holdouts (turn a channel off in matched regions), ghost ads and PSA tests (show a control group a placebo instead of your ad), platform conversion-lift studies (randomized holdouts inside Meta or Google), and switchback tests (toggle on and off over time). Each fits a different channel and budget.

How does an experimentation agenda connect to MMM and attribution?

Each incrementality result recalibrates the rest of your measurement stack: it sets the discount factor you apply to attribution's claimed conversions, and it validates or corrects your marketing mix model. So the agenda isn't a side project — it's the maintenance schedule that keeps your whole measurement triangle honest.

This is how I operate

I build experimentation roadmaps that pay for themselves

If you're spending on channels you can't prove are working — or sitting on test ideas with no way to rank them — let's build the agenda together.

Read the measurement guide → Work with me

Last updated June 2026 · Part of an in-progress series on growth measurement & budget allocation.