The Unjournal commissions public evaluations of impactful research in quantitative social sciences fields. We are seeking ‘pivotal questions’ to guide our choice of research papers to commission for evaluation. We are reaching out to organizations that aim to use evidence to do the most good, and asking: Which open questions most affect your policies and funding recommendations? For which questions would research yield the highest ‘value of information’?
Our main approach has been to search for papers and then commission experts to publicly evaluate them. (For more about our process, see here). Our field specialist teams search and monitor prominent research archives (like NBER), and consider agendas from impactful organizations, while keeping an eye on forums and social media. Our approach has largely been to look for research that seems relevant to impactful questions and crucial considerations. We're now exploring turning this on its head and identifying pivotal questions first and evaluating a cluster of research that informs these. This could offer a more efficient and observable path to impact. (See our ‘logic model’ flowchart for our theory of change for context.)
The Unjournal will ask impact-focused research-driven organizations such as GiveWell, Open Philanthropy, and Charity Entrepreneurship to identify specific quantifiable questions^[We may later expand this to somewhat more open-ended and general questions; see below.] that impact their funding, policy, and research-direction choices. For example, if an organization is considering whether to fund a psychotherapeutic intervention in a LMIC, they might ask “How much does a brief course of non-specialist psychotherapy increase happiness, compared to the same amount spent on direct cash transfers?” We’re looking for the questions with the highest value-of-information (VOI) for the organization’s work over the next few years. We have some requirements — the questions should relate to The Unjournal’s coverage areas and engage rigorous research in economics, social science, policy, or impact quantification. Ideally, organizations will identify at least one piece of publicly-available research that relates to their question. But we are doing this mainly to help these organizations, so we will try to keep it simple and low-effort for them.
The Unjournal team will then discuss the suggested questions, leveraging our field specialists’ expertise. We’ll rank these questions, prioritizing at least one for each organization. We’ll work with the organization to specify the priority question precisely and in a useful way. We want to be sure that 1. evaluators will interpret these questions as intended, and 2. the answers that come out are likely to be actually helpful. We’ll make these lists of questions public and solicit general feedback — on the relevance of the questions, on their framing, on key sub-questions, and on pointers to relevant research.
Where practicable, we will operationalize the target questions as a claim on a prediction market (for example, Metaculus) to be resolved by the evaluations and synthesis below.
Where feasible, post these on public prediction markets (such as Metaculus)
If the question is well operationalized, and we have a clear approach to 'resolving it' after the evaluations and synthesis, we will post it on a reputation-based market like Metaculus or . Metaculus is offering 'minitaculus' platforms such as this one on Sudan to enable these more flexible questions.
We will ask (and help) the organizations and interested parties to specify their own beliefs about these questions, aka their 'priors'. We may adapt the Metaculus interface for this.
Once we’ve converged on the target question, we’ll do a variation of our usual evaluation process.
For each question we will prioritize roughly two to five . These papers may be suggested by the organization that suggested the question, sourced by The Unjournal, or discovered through community feedback ().
As we normally do, we’ll have ‘evaluation managers’ recruit . However, we’ll ask the evaluators to , and to consider the target organization’s priorities.
We’ll also . This is inspired by the repliCATS project, and some evidence suggesting that the (mechanistically aggregated) estimates of experts after deliberations than their independent estimates (also mechanistically aggregated). We may also facilitate collaborative evaluations and ‘live reviews’, following the examples of ASAPBio, PREreview, and others.
We will contact both the research authors (as per our standard process) and the target organizations for their responses to the evaluations, and for follow up questions. We’ll foster a productive discussion between them (while preserving anonymity as requested, and being careful not to overtax people’s time and generosity)
evaluation managers to write a report as a summary of the research investigated.
These reports should synthesize “What do the research, evaluations, and responses say about the question/claim?” They should provide an overall metric relating to the truth value of the target question (or similar for the parameter of interest). If and when we integrate prediction markets, they should decisively resolve the market claim.
Next, we will share these synthesis reports with authors and organizations for feedback.
We’ll put up each evaluation on our Unjournal.pubpub.org page, bringing them into academic search tools, databases, bibliometrics, etc. We’ll also curate them, linking them to the relevant target question and to the synthesis report..
We will produce, share, and promote further summaries of these packages. This could include forum and blog posts summarizing the results and insights, as well as interactive and visually appealing web pages. We might also produce less technical content, perhaps submitting work to outlets like Asterisk, Vox, or worksinprogress.co.
At least initially, we’re planning to ask for questions that could be definitively answered and/or measured quantitatively, and we will help organizations and other suggesters refine their questions to make this the case. These should approximately resemble questions that could be posted on forecasting platforms such as Manifold Markets or Metaculus. These should also somewhat resemble the 'claim identification' we currently request from evaluators.
We give detailed guidance with examples below:
Why do we want these pivotal questions to be 'operationalizable'?
We’re still refining this idea, and looking for your suggestions about what is unclear, what could go wrong, what might make this work better, what has been tried before, and where the biggest wins are likely to be. We’d appreciate your feedback! (Feel free to email contact@unjournal.org to make suggestions or arrange a discussion.)
If you work for an impact-focused research organization and you are interested in participating in our pilot, please reach out to us at contact@unjournal.org to flag your interest and/or complete this form. We would like to see:
A brief description of what your organization does (your ‘about us’ page is fine)
A specific, operationalized, high-value claim or research question you would like to be evaluated, that is within our scope (~quantitative social science, economics, policy, and impact measurement)
A brief explanation of why this question is particularly high value for your organization or your work, and how you have tried to answer it
If possible, a link to at least one research paper that relates to this question
Optionally, your current beliefs about this question (your ‘priors’)
Please also let us know how you would like to engage with us on refining this question and addressing it. Do you want to follow up with a 1-1 meeting? How much time are you willing to put in? Who, if anyone, should we reach out to at your organization?
Remember that we plan to make all of this analysis and evaluation public.
If you don’t represent an organization, we still welcome your suggestions, and will try to give feedback.
('.)
Please remember that we currently focus on quantitative ~social sciences fields, including economics, policy, and impact modeling (see here for more detail on our coverage). Questions surrounding (for example) technical AI safety, microbiology, or measuring animal sentience are less likely to be in our domain.
If you want to talk about this first, or if you have any questions, please send an email or schedule a meeting with David Reinstein, our co-founder and director.
Why are we seeking these pivotal questions to be 'operationalizable'?
This is in line with our own focus on this type of research,^[The Unjournal focuses on evaluating (mainly empirical) research that clearly poses and answers specific impactful questions, rather than research that seeks to define a question, survey a broad landscape of other research, open routes to further inquiry, etc. However, we have evaluated some broader work where it seemed particularly high impact, original, and substantive. E.g., we’ve evaluated work in ‘applied economic theory’ such as Aghion et al. on the impact of artificial intelligence on economic growth, and applied methodology, e.g., "Replicability & Generalisability: A Guide to CEA discounts"].
I think this will help us focus on fully-baked questions, where the answer is likely to provide actual value to the target organization and others (and avoid the old ‘42’ trap).
It offers potential for benchmarking and validation (e.g., using prediction markets), specific routes to measure our impact (updated beliefs, updated decisions), and informing the 'claim identification (and assessment)' we’re asking from evaluators (see footnote above).
However, as this initiative progresses we may allow a wider range of questions, e.g., more open-ended, multi-outcome, non-empirical (perhaps ‘normative), and best-practice questions.
At least initially, we’re planning to ask for questions that could be definitively answered and/or measured quantitatively, and we will help organizations and other suggesters refine their questions to make this the case. These should approximately resemble questions that could be posted on forecasting platforms such as Manifold Markets or Metaculus. These should also somewhat resemble the 'claim identification' we currently request from evaluators.
Phil Tetlock’s “Clairvoyance Test” is particularly relevant. As :
if you handed your question to a genuine clairvoyant, could they see into the future and definitively tell you [the answer]? Some questions like ‘Will the US decline as a world power?’...‘Will an AI exhibit a goal not supplied by its human creators?’ struggle to pass the Clairvoyance Test… How do you tell one type of AI goal from another, and how do you even define it?... In the case of whether the US might decline as a world power, you’d want to get at the theme with multiple well-formed questions such as ‘Will the US lose its #1 position in the IMF’s annual GDP rankings before 2050?’.... These should also somewhat resemble the 'claim identification' we currently request from evaluators.
Metaculus and Manifold: .
Some questions are important, but difficult to make specific, focused, and operationalizable. For example (from 80,000 Hours’ list of “research questions”):
“What can economic models … tell us about recursive self improvement in advanced AI systems?”
“How likely would catastrophic long-term outcomes be if everyone in the future acts for their own self-interest alone?”
“How could AI transform domestic and mass politics?”
Other questions are easier to operationalize or break down into several specific sub-questions. For example (again from 80,000 Hours’ “research questions”):
Could advances in AI lead to risks of very bad outcomes, like suffering on a massive scale? Is it the most likely source of such risks?
I rated this a 3/10 in terms of how operationalized it was. The word “could” is vague. “Could” might suggest some reasonable probability outcome (1%, 0.1%, 10%), or it might be interpreted as “can I think of any scenario in which this holds?” “Very bad outcomes” also needs a specific measure.
However, we can reframe this to be more operationalized. E.g., here are some fairly well-operationalized questions:
What is the risk of a catastrophic loss (defined as the death of at least 10% of the human population over any five year period) occurring before the year 2100?
How does this vary depending on the total amount of money invested in computing power for building advanced AI capabilities over the same period?
Here are some highly operationalizable questions developed by the Farm Animal Welfare team at Open Phil:
What percentage of plant-based meat alternative (PBMA) units/meals sold displace a unit/meal of meat?
What percentage of people will be [vegetarian or vegan] in 20, 50, or 100 years?
And a few more posed and addressed by Our World in Data:
How much of global greenhouse gas emissions come from food? (full article)
What share of global CO₂ emissions come from aviation? (full article)
However, note that many of the above questions are descriptive or predictive. We are also very interested in causal questions such as
What is the impact of an increase (decrease) in blood lead level by one “natural log unit” on children’s learning in the developing world (measured in standard deviation units)?