1 of 4

Real-world assignment & inference

Geographic blocks versus individuals How to block/stratify

See Geographic segmentation/blocked randomization for a mainly theoretical discussion of this

Facebook split-testing issues for how to do split testing on Facebook, and the limits to traditional design given their setup

Geographic segmentation/blocked randomization

Discussion of blocking/randomizing treatments by post/zip code or other region, allowing us to more accurately tie treatments to ultimate outcomes

Measurement needs are varied and come with a variety of limitations, e.g., data avail-ability, ad targeting restrictions, wide-ranging measurement objectives, budget availability,time constraints, etc
Kerman et al, 2017

Why 'Geo experiments'

In many contexts, the route to a meaningful outcome (e.g., GWWC pledge) is a long one. Attribution is difficult. An individual may have been first influenced by (1) YouTube ad while seeing a video on her AppleTV, and then (2) by a friend's post on Facebook, and then finally moved to act (3) after having a conversation at a bar and (4) visiting the GWWC web site on her telephone.

The same individual may not (or may) be trackable through 'cookies' and 'pixels' but this is already very limited and imprecise, and is being made harder by new legislation.

"Geographic targeting" of individual treatments/trials/initiatives/ads may help better track, attribute, and yield inference about 'what works'. E.g., we might do a 'lift test':

select a balanced random set of US Zip codes for a particular repeated YouTube ad promoting GWWC, the "Treated group"
compare the rate of GWWC visits, email sign-ups, pledges, and donations in the next 6 months from these zip codes relative to all other zip codes. (Possibly throwing out or finding a way to draw additional inference from zip codes adjacent to the treated group)..

We could also do multi-armed tests (of several types of ad or other treatment, with a similar setup as above)

There are a few well-known and researched approaches: From Kerman et al, 2017 (emphasis added)

Geo experiments (Vaver and Koehler, 2011, 2012) meet a large range of measurement needs. They use non-overlapping geographic regions, or simply “geos,” that are randomly, or systematically, assigned to a control or treatment condition. Each region realizes its assigned treatment condition through the use of geo-targeted advertising. These experiments can be used to analyze the impact of advertising on any metric that can be collected at the geo level. Geo experiments are also privacy-friendly since the outcomes of interest are aggregated across each geographic region in the experiment. No individual user-level information is required for the “pure” geo experiments, although hybrid geo + user experiments have been developed as well (Ye et al., 2016). Matched market tests (see e.g., Gordon et al., 2016) are another specific form of geo experiments. They are widely used by marketing service providers to measure the impact of online advertising on offline sales. In these tests, geos are carefully selected and paired. This matching process is used instead of a randomized assignment of geos to treatment and control. Although these tests do not offer the protection of a randomization experiment against hidden biases, they are convenient and relatively inexpensive, since the testing typically uses a small subset of available geos. These tests often use time series data at the store level. Another matching step at the store level is used to generate a lift estimate and confidence interval.

Where and how can we geographically block treatments?

Context/location

Geographic blocking? (How)

What if we can only apply the treatment to one, or a few, of many groups?

We still mahy be able to make valuable inferences, under specified conditions, through 'difference in difference', 'event study', and 'Time based' approaches. We consider this in the next section: Difference in difference/'Time-based methods'

Difference in difference/'Time-based methods'

Abstract .... While effective, this geo-based regression (GBR) approach is less applicable, or not applicable at all, for situations in which few geographic units are available for testing (e.g. smaller countries, or subregions of larger countries) These situations also include the so- called matched market tests, which may compare the behavior of users in a single control region with the behavior of users in a single test region. To fill this gap, we have developed an analogous time-based regression (TBR) approach for analyzing geo experiments. This methodology predicts the time series of the counterfactual market response, allowing for direct estimation of the cumulative causal effect at the end of the experiment. In this paper we describe this model and evaluate its performance using simulation.

Some specific notes/concerns

Geo experiments” where only a single geo is targeted for a treatment seem fairly common in practice. You ‘try something in a single market 1x only and see what it does’.\

This is probably reinventing the wheel some existing thing in Econometric (difference in difference, event studies?), but what?
I find it strange/suboptimal that they aggregate across the Geos in the control group, throwing important variation here … that might tell us something about how much things ‘typically vary by without treatments’. I wonder if there’s another approach that brings that variation back?
1. Maybe this is 'because this is an easy extract to get from Google Analytics'? How do we get it?
The is 5 years old with no recent updates … ages in this world; is there something better to use instead

Facebook split-testing issues

Facebook trials: "divergent delivery" --> limited inference

The main point

Facebook serves each ad variation to the people it thinks are most likely to click on it.

Thus, in comparing one ad variation to another... you may learn:

"Which variation performs best on the 'best audience for that variation' (according to Facebook)"
But you don't learn "which variation performs better than others on any single comparable audience."

Update 4 Oct 2022: We may have found a partial solution to this, with ads targeting 'Reach' rather than optimizing for other measures like 'clicks'. We are discussing this further and will report back.

Researchers are interested in running trials using Facebook ads. However, inference can be difficult. Facebook doesn't give you full control of who sees what version of an advertisement.

With A/B split testing etc: They have their own algorithm, which presumably uses something like Thomson sampling to optimize for an outcome (clicks, or a targeted action on the linked site with a 'pixel'). Statistical inference is challenging with adaptive designs and reinforcement learning mechanisms. As the procedure is not transparent, it is even more difficult to make statistical inferences about how one treatment performed relative to another.
Segmentation and composition of population: Facebook's 'PageRank' algorithm determines who sees an ad. I don't think you can turn this off.
1. We haven't found a way to be able to set it to "show all versions of an ad to comparable populations"
2. (And even if you could, it would be difficult for you to specifically describe "which population" your results pertain to.)

Divergent delivery and "the A/B test deception"

Further notes

Orazi, D. C., & Johnston, A. C. (2020). Running field experiments using Facebook split test. Journal of Business Research, 118, 189-198.

"Haven’t heard of an update since. They do something to mitigate the effects of targeting different audiences with the different treatments, but it’s still not quite random assignment"

"Bottom line: good news, bad news. I'm confirming that you're right: The "latest best possible settings" are still not giving you results that reflect the random experiment that a researcher in consumer psychology or advertising would be expecting. But the problems are worse than they may have seemed to you initially."

Notes on Facebook “Lift tests/Lift Studies” with ’Multiple Test Groups”

Do Facebook “Lift tests/Lift Studies” with ’Multiple Test Groups” give us the freedom we want to …

Randomize/balance different ad content ‘treatments’ to comparable groups?
Make inferences about ‘which treatment (ad) performs better, holding the audience constant’?

See "‘Meta for developers’ on Lift Tests:"

No. ****Josh: "what it says is something importantly different: you can compare the number of people who do the action you are interested in ... according to whether or not they see a given ad. So, you do have random assignment when comparing the effect of an ad to the effect of no ad. ... if we compare the lift for two different treatments (What these multi-cell lift tests are doing), we are doing almost exactly the same thing as we were without the lift functionality...

A and B are displayed to different audiences, so this test does not have random assignment."

Essentially this allows you to get the correct 'lift' of A and B, on their own distinct audiences, by getting the counterfactual audiences for each of these correct. But you cannot compare the lift of A and B on any comparable audience.

To help understand the context... "Facebook often randomizes the whole audience into different cells and THEN targets the ad WITHIN that audience. So there is random assignment at the initial stage, but that's irrelevant, because not everyone in the potential audience sees each ad"\