arrow-left

All pages
gitbookPowered by GitBook
1 of 4

Loading...

Loading...

Loading...

Loading...

Facebook split-testing issues

hashtag
Facebook trials: "divergent delivery" --> limited inference

circle-info

The main point

Facebook serves each ad variation to the people it thinks are most likely to click on it.

Thus, in comparing one ad variation to another... you may learn:

  • "Which variation performs best on the 'best audience for that variation' (according to Facebook)"

  • But you don't learn "which variation performs better than others on any single comparable audience."

circle-info

Update 4 Oct 2022: We may have found a partial solution to this, with ads targeting 'Reach' rather than optimizing for other measures like 'clicks'. We are discussing this further and will report back.

Researchers are interested in running trials using Facebook ads. However, inference can be difficult. Facebook doesn't give you full control of who sees what version of an advertisement.

  1. With A/B split testing etc: They have their own algorithm, which presumably uses something like Thomson sampling to optimize for an outcome (clicks, or a targeted action on the linked site with a 'pixel'). Statistical inference is challenging with adaptive designs and reinforcement learning mechanisms. As the procedure is not transparent, it is even more difficult to make statistical inferences about how one treatment performed relative to another.

  2. Segmentation and composition of population: Facebook's 'PageRank' algorithm determines who sees an ad. I don't think you can turn this off.

hashtag
Divergent delivery and "the A/B test deception"

chevron-rightFurther noteshashtag

Orazi, D. C., & Johnston, A. C. (2020). Running field experiments using Facebook split test. Journal of Business Research, 118, 189-198.

"Haven’t heard of an update since. They do something to mitigate the effects of targeting different audiences with the different treatments, but it’s still not quite random assignment"

"Bottom line: good news, bad news. I'm confirming that you're right: The "latest best possible settings" are still not giving you results that reflect the random experiment that a researcher in consumer psychology or advertising would be expecting. But the problems are worse than they may have seemed to you initially."

chevron-rightNotes on Facebook “Lift tests/Lift Studies” with ’Multiple Test Groups”hashtag

Do Facebook “Lift tests/Lift Studies” with ’Multiple Test Groups” give us the freedom we want to …

  • Randomize/balance different ad content ‘treatments’ to comparable groups?

We haven't found a way to be able to set it to "show all versions of an ad to comparable populations"
  • (And even if you could, it would be difficult for you to specifically describe "which population" your results pertain to.)

  • Make inferences about ‘which treatment (ad) performs better, holding the audience constant’?

    See ":"

    No. ****Josh: "what it says is something importantly different: you can compare the number of people who do the action you are interested in ... according to whether or not they see a given ad. So, you do have random assignment when comparing the effect of an ad to the effect of no ad. ... if we compare the lift for two different treatments (What these multi-cell lift tests are doing), we are doing almost exactly the same thing as we were without the lift functionality...

    A and B are displayed to different audiences, so this test does not have random assignment."

    Essentially this allows you to get the correct 'lift' of A and B, on their own distinct audiences, by getting the counterfactual audiences for each of these correct. But you cannot compare the lift of A and B on any comparable audience.

    To help understand the context... "Facebook often randomizes the whole audience into different cells and THEN targets the ad WITHIN that audience. So there is random assignment at the initial stage, but that's irrelevant, because not everyone in the potential audience sees each ad"\

    ‘Meta for developers’ on Lift Testsarrow-up-right

    Difference in difference/'Time-based methods'

    Estimating Ad Effectiveness using Geo Experiments in a Time-Based Regression Framework Jouni Kerman, Peng Wang, and Jon Vaver Google, Inc. March 2017arrow-up-right

    Abstract .... While effective, this geo-based regression (GBR) approach is less applicable, or not applicable at all, for situations in which few geographic units are available for testing (e.g. smaller countries, or subregions of larger countries) These situations also include the so- called matched market tests, which may compare the behavior of users in a single control region with the behavior of users in a single test region. To fill this gap, we have developed an analogous time-based regression (TBR) approach for analyzing geo experiments. This methodology predicts the time series of the counterfactual market response, allowing for direct estimation of the cumulative causal effect at the end of the experiment. In this paper we describe this model and evaluate its performance using simulation.

    • DR hypothesis note-taking versionarrow-up-right

    chevron-rightSome specific notes/concernshashtag

    Geo experiments” where only a single geo is targeted for a treatment seem fairly common in practice. You ‘try something in a single market 1x only and see what it does’.\

    1. This is probably reinventing the wheel some existing thing in Econometric (difference in difference, event studies?), but what?

    I find it strange/suboptimal that they aggregate across the Geos in the control group, throwing important variation here … that might tell us something about how much things ‘typically vary by without treatments’. I wonder if there’s another approach that brings that variation back?
    1. Maybe this is 'because this is an easy extract to get from Google Analytics'? How do we get it?

  • The is 5 years old with no recent updates … ages in this world; is there something better to use instead

  • Related software packagearrow-up-right
    packagearrow-up-right

    Geographic segmentation/blocked randomization

    Discussion of blocking/randomizing treatments by post/zip code or other region, allowing us to more accurately tie treatments to ultimate outcomes

    Measurement needs are varied and come with a variety of limitations, e.g., data avail-ability, ad targeting restrictions, wide-ranging measurement objectives, budget availability,time constraints, etc

    Kerman et al, 2017

    hashtag
    Why 'Geo experiments'

    In many contexts, the route to a meaningful outcome (e.g., GWWC pledge) is a long one. Attribution is difficult. An individual may have been first influenced by (1) YouTube ad while seeing a video on her AppleTV, and then (2) by a friend's post on Facebook, and then finally moved to act (3) after having a conversation at a bar and (4) visiting the GWWC web site on her telephone.

    The same individual may not (or may) be trackable through 'cookies' and 'pixels' but this is already very limited and imprecise, and is being made harder by new legislation.

    "Geographic targeting" of individual treatments/trials/initiatives/ads may help better track, attribute, and yield inference about 'what works'. E.g., we might do a 'lift test':

    1. select a balanced random set of US Zip codes for a particular repeated YouTube ad promoting GWWC, the "Treated group"

    2. compare the rate of GWWC visits, email sign-ups, pledges, and donations in the next 6 months from these zip codes relative to all other zip codes. (Possibly throwing out or finding a way to draw additional inference from zip codes adjacent to the treated group)..

    We could also do multi-armed tests (of several types of ad or other treatment, with a similar setup as above)

    There are a few well-known and researched approaches: (emphasis added)

    Geo experiments (Vaver and Koehler, 2011, 2012) meet a large range of measurement needs. They use non-overlapping geographic regions, or simply “geos,” that are randomly, or systematically, assigned to a control or treatment condition. Each region realizes its assigned treatment condition through the use of geo-targeted advertising. These experiments can be used to analyze the impact of advertising on any metric that can be collected at the geo level. Geo experiments are also privacy-friendly since the outcomes of interest are aggregated across each geographic region in the experiment. No individual user-level information is required for the “pure” geo experiments, although hybrid geo + user experiments have been developed as well (Ye et al., 2016). Matched market tests (see e.g., Gordon et al., 2016) are another specific form of geo experiments. They are widely used by marketing service providers to measure the impact of online advertising on offline sales. In these tests, geos are carefully selected and paired. This matching process is used instead of a randomized assignment of geos to treatment and control. Although these tests do not offer the protection of a randomization experiment against hidden biases, they are convenient and relatively inexpensive, since the testing typically uses a small subset of available geos. These tests often use time series data at the store level. Another matching step at the store level is used to generate a lift estimate and confidence interval.

    hashtag
    Where and how can we geographically block treatments?

    Context/location
    Geographic blocking? (How)

    hashtag
    What if we can only apply the treatment to one, or a few, of many groups?

    We still mahy be able to make valuable inferences, under specified conditions, through 'difference in difference', 'event study', and 'Time based' approaches. We consider this in the next section:

    Youtube ads

    Facebook ads

    USA

    zip codes

    Australia

    From Kerman et al, 2017arrow-up-right
    Difference in difference/'Time-based methods'

    Real-world assignment & inference

    Geographic blocks versus individuals How to block/stratify

    See Geographic segmentation/blocked randomization for a mainly theoretical discussion of this

    Facebook split-testing issues for how to do split testing on Facebook, and the limits to traditional design given their setup

    https://deliverypdf.ssrn.com/delivery.php?ID=132088086121073001083080077127104108125018001047091022000126083006124123094095115126045055003101126027111029072099097094075071009041023059084026026078112114121098091072003004001121021088083096006085084101119103123065081122008080126069118090005117066093&EXT=pdf&INDEX=TRUEdeliverypdf.ssrn.comchevron-right
    Where A-B Testing Goes Wrong: How Divergent Delivery Affects What Online Experiments Cannot (and Can) Tell You About How Customers Respond to Advertisingpapers.ssrn.comchevron-right
    Logo