Discussion of issues in designing experiments/studies that are not specifically 'quantitative', but are important for gaining clear and useful inference
Academics usually try to make each treatment differ in precisely one dimension, these treatments are meant to represent the underlying model or construct as purely as possible. This can lead to setups that appear strange or artificial, which itself might bring responses it will not be representative or generalizable.
For example, in my 'give if you win' (lab) work we had a trial that was (paraphrasing) 'we are asking you to commit to a donation that may or may not be collected. If the coin flips heads, we will collect the amount you commit, otherwise no donation is made'. It was meant to separate the component of the "give if you win effect" driven by the uncertain nature of the commitment rather than the uncertain nature of the income. However when we considered bringing this to field experiments, there was no way to do it without it making it obvious that this was an experiment or a very strange exercise.
When we consider an experiment providing 'real impact information' to potential donors, we might be encouraged to use the exact write-up from Givewell's page, for naturalness. However, this may not present the "lives per dollar" information in exactly the same way between two charities of interest, and the particular write-up may suggest certain "anchors" (e.g., whole numbers that people may want to contribute). Thus if we use the exact GW language we may not be 100% confident that the provision of the impact of information is driving any difference. We might be tempted to change it; but at a possible cost of naturalness and direct applicability.
There are very often tradeoffs of this sort.
In the present context, we have posted about our work, in general terms, on a public forum (EA forum post). Thus the idea that ‘people are running experiments to promote effective giving and EA ideas’ is not a well-kept secret. If participants in our experiments and trials are aware of this it may affect their choices and responses to treatments. This general set of problem is referred to in various ways, referring to different aspects of this; see 'experimenter demand', 'desirability bias', 'arbitrary coherence/coherent arbitrariness', observer bias (?), etc.
Mitigating this, in our context, most of our experiments will be conducted in subtle ways (e.g., small but meaningful variations in EA-aligned home pages), and individuals will only see one of these (with variation by geography or by IP-linked cookies). Furthermore, we will conduct most of our experiments targeting non-EA-aligned audiences unlikely to read posts like this one. (People reading the EA forum post are probably ‘already converted’.)
(To be fleshed out in more detail)
Universe (population) of interest, representativeness
Design study to measure 'cheap' behavior like 'clicks' (easier to observe, quicker feedback) versus meaningful and long-run behavior (like donations and pledges)
attribution issues
attrition issues (also see the quantitative sections)
Choice of impact measure/metric (also see the quantitative sections)