Abstract colorful geometric shapes representing A/B testing paths
Andrejkostal.sk logo

The sample selection dilemma: testing product changes in e-commerce

Oct 29, 2025

The sample selection dilemma: testing product changes in e-commerce - Andrej Kostal

We had one of those A/B testing discussions last week that sounds simple at first, then spirals into something more interesting. The question: “How would you test a change that only affects discounted products?”

The scenario seemed straightforward. We’re improving how we communicate discounts on product cards—better badges, clearer pricing, the usual suspect improvements. We want to measure engagement (product clicks) and add-to-cart rates, but specifically for discounted items. The catch? Only about 10-15% of our catalog is on sale at any given time.

The Obvious Answer and the Pushback

The instinct is to filter. Run the test on everyone, but only measure users who interact with discounted products. It’s clean. It’s focused. You’re comparing apples to apples—people who saw discount products in variant A versus people who saw them in variant B.

But here’s where it gets messy. These product card changes appear everywhere—homepage, category pages, search results. You’re not just changing how discount products look. You’re potentially changing how the entire page composition is perceived.

One of us (okay, it was me) pushed back: “What if this change decreases interest in full-price products and causes a revenue problem?”

What if your shiny new discount badges actually pull attention away from full-price products? What if users become more discount-focused and overall revenue drops even as discount engagement improves? What if the badges change how people perceive product value across the board, or train them to wait for sales, or shift traffic patterns in ways we don’t expect?

If you only analyze users who clicked on discounted products, you miss all of that:

  • The users who would have bought full-price but got distracted by discount signals
  • Changes in overall browse behavior
  • Mix shift—more discount purchases, fewer full-price purchases
  • The actual impact on revenue and margin

That question reframed everything. This isn’t just a discount product test. It’s a merchandising change that affects the entire shopping experience.

What We Decided

Here’s what we decided: Run the test on everyone and track revenue as the primary metric. We’ll do post-hoc analysis on discount product engagement, but only after confirming the change doesn’t hurt the bottom line.

Decision Framework:

┌─────────────────────────────────────────────┐
│ Does the change appear on multiple pages?   │
└──────────────┬──────────────────────────────┘

        ┌──────┴──────┐
        │             │
       YES           NO
        │             │
        ▼             ▼
  ┌─────────┐   ┌──────────────────────────────┐
  │  FULL   │   │ Can it affect behavior on    │
  │ SAMPLE  │   │ unchanged pages?             │
  └─────────┘   └──────────┬───────────────────┘

                    ┌──────┴──────┐
                    │             │
                   YES           NO
                    │             │
                    ▼             ▼
              ┌─────────┐   ┌──────────┐
              │  FULL   │   │  FILTER  │
              │ SAMPLE  │   │  COHORT  │
              └─────────┘   └──────────┘

Why? Because we can’t measure what we don’t track.

The filtered cohort approach would have given us precise measurements of discount product engagement while completely missing effects on full-price products, overall revenue, and mix shift—the effects that actually matter to the business.

Yes, it makes the effect harder to detect—the impact is diluted across all users, not just those who interact with discounted products, so we need longer duration to see the smaller overall effect with confidence. However, this ensures we measure the actual business impact rather than optimizing for a segment while potentially harming overall performance.

How We Think About It Now

The key insight from this discussion: it’s not about “always test everything on everyone.” It’s about identifying whether indirect effects exist.

ScenarioSample SelectionWhy
Discount badge (site-wide)Full sampleVisual change affects entire page composition
Search algorithmFilter to 5% who use searchIsolated functionality, no spillover
PDP redesign (all PDPs)Filter to 60% who visit PDPsChange only affects PDP visitors
Checkout flowFilter to 15% who reach checkoutIsolated to checkout users

Why this case needed the full sample:

The discount badges appear everywhere—homepage, category pages, product detail pages, search results. This is a visual change that affects entire page composition across the site. The potential for attention and substitution effects on full-price products is real. We’re not just testing discount engagement. We’re testing how these changes affect overall shopping behavior and revenue mix.

When to test on the entire sample:

Include everyone when:

  • The change affects multiple pages or contexts (our discount badges appear across the site)
  • Site-wide changes like header redesigns or navigation updates
  • There are potential visual or attention effects on non-target items—exactly what we worried about with full-price products
  • Mix shift between categories or segments matters to the business

When to filter to a relevant cohort:

Filter when the change is truly isolated functionality:

  • Testing a search algorithm change? If only 5% of users actually use search, there’s no reason to include the 95% who never touch it
  • Testing a PDP redesign that affects all product detail pages? Include the 60% of users who visit PDPs, not the 40% who never see those pages
  • Testing checkout flow improvements? Only the 15% who reach checkout need to be in the analysis—the other 85% will never encounter the change

The critical question isn’t the size of the affected segment. It’s whether the change can indirectly influence behavior beyond its immediate context.

The underlying principle:

Many A/B tests have spillover effects we don’t anticipate, especially those involving:

  • Visual changes
  • Attention effects
  • Product substitution

The question isn’t “What are we testing?” It’s “What could this change affect?”

When in doubt, start with the full sample for your primary metrics. You can always do filtered analysis afterward. But you can’t recover the data for effects you didn’t measure.

Still figuring this out as we go. If you’ve dealt with similar dilemmas in your testing program, I’d love to hear how you approached it.