Comment on page
Guidelines for evaluators
Thanks for your interest in evaluating research for The Unjournal!
Your evaluation will be made public and given a DOI, but you have the option to remain anonymous or to "sign your review" and take credit.
You will be given a (minimum) $400 honorarium for providing an on-time and complete evaluation and feedback. [Updated 14 July 2023] You will also be eligible for monetary prizes for "most useful and informative evaluation," plus other bonuses. See the guidelines below. You can submit your response in a Google Doc (see note), and share it back with us. Click here to make a new copy of this form directly.
We are in the process of building a more clearly justified and interpretable set of guidelines, leveraging insights and innovations from previous work (such as repliCATS). This will take some time. In the meantime we are making some updates and simplifications; these are noted below. The main changes made so far are:
- 1.Removing the "suggested weightings'" for ratings categories
- 2.Adjusting the discussion of the "overall assessment" category
We may occasionally offer additional payments for specifically requested evaluation tasks, or raise the base payments for particularly hard-to-source expertise.
July 2023: The above is our current policy; we are working to build an effective, fair, transparent, and straightforward system of honorariums, incentives, and awards for evaluators.
- 1.Write a review: a "standard high-quality referee report," with some specific considerations.
- 2.Give quantitative metrics and predictions as requested in the two tables below.
- 3.Answer a short questionnaire about your background and our processes.
In writing your report (and providing ratings), please consider the following:
Please pay attention to anything our managers and editors specifically asked you to focus on. We may ask you to focus on specific areas of expertise; you do not need to address all aspects of the work. We may also forward specific feedback requests from authors.
For the most part, this is like a standard journal review, but we have some particular priorities. See Category explanations: what you are rating for guidance. For example, we would like to prioritize impact and robustness over cleverness.
Unless you were advised otherwise, it will be given a DOI and, hopefully, will enter the public research conversation. Note that the authors will be given two weeks to respond to reviews before the evaluations, ratings, and responses are made public. You will be given a choice of whether you want to be listed publicly as an author of the review.
If you have questions or clarifications about the authors’ work, you can ask them these questions anonymously; we will facilitate it.
We want you to evaluate the most recent/relevant version of the paper/project that you can access. If you see a more recent version than the one we shared with you, please let us know.
We are considering the best policy towards signed reviews vs. single-blind reports. For now we give evaluators the option to choose, and you can wait to choose until after you have completed the report. We may change this policy in the future.
We may give early-career researchers the right to veto the publication of very negative reviews or to embargo the release of these reviews for a defined period. We will inform you in advance if this will be the case for your evaluation.
You can reserve some "sensitive" content in your report to be shared with only The Unjournal management or only the authors, but we hope to keep this limited.
Important questions for your evaluation include who the audience is, what value the evaluation provides, and how you should prioritize feedback.
Essentially, we want you to put an equal value on:
- making the evaluations and ratings useful for readers and policymakers;
- making them meaningful for "assessing academics" (as a measure of value to consider against the current "journal tier" system); and
- communicating useful feedback to researchers, to help them improve their work.
We are generally asking for the sort of report an academic would write for a tradition high-prestige journal. We are asking for this, subject to some differences in priorities (discussed below) and subject to any particular requests the managing editor may communicate to you.
Length/time spent: This is up to you. We welcome detail, elaboration, and technical discussion.
The Econometrics society recommends a 2–3 page referee report; Berk et al. suggest this is relatively short, but confirm that brevity is desirable. In a recent survey (Charness et al., 2022), economists report spending (median and mean) about one day per report, with substantial shares reporting "half a day" and "two days." We expect that reviewers tend spend more time on papers for high-status journals, and when reviewing work that is closely tied to their own agenda.
Our general priorities are embodied in the quantitative metrics below. We believe these are similar, but not identical, to criteria used by the top journals in economics and adjacent fields.
Below is a completed example. We will give evaluators a concise survey form with everything they need to fill out.
Evaluators working before October 2023 saw a previous version of the table, which you can see HERE.
Although we ask you to rate (and discuss) the relevance of this work to global priorities, we give it a suggested weight of 0, as we don't think this should enter into your overall assessment rating. Why not? While we do think relevance to global priorities is very important, we want the overall assessment here to be somewhat comparable to that of traditional journals, to enable benchmarking.
For each question above, if it seems relevant and you feel qualified to judge, please . . .
However, we recognize (as of June 2023) that we have not yet defined our criteria and their metrics precisely; we are working to improve this. We may adopt a more explicit metric, e.g., in terms of "the distribution of research work typically published in conventional journals with particular tiers."
Ideally, we would like you to state your "confidence intervals" or "credibility intervals." Loosely speaking, we hope to capture a sense of how sure you are about your ratings. This will help people who read your evaluation to know how much weight to put on them in using them for making their own decisions. These can also be used in systematic ways for meta-science and meta-analysis. We can "aggregate expert judgment" to get a better measure of how confident we should be about particular measures and claims.
You may know most of the concepts below, but you might be unfamiliar with applying them in a situation like this one.
Suppose your best guess for the "Methods..." criterion is 65. Still, even an expert can never be certain. E.g., you may misunderstand some aspect of the paper, there may be a justification or method you are not familiar with, you might not understand the criterion, etc.
Your "uncertainty" over this could be described by some distribution, representing your beliefs about the true value of this criterion. By "true value," you might think, "If you had all the evidence, knowledge, and wisdom in the world, and the benefit of perfect hindsight, what value would you choose for this criterion?"
Your "'best guess" should (basically) be the central mass point of this distribution.
You are asked to give a 90% interval. Loosely speaking, you could consider something like, "What is the smallest interval around this best guess that I believe is 90% likely to contain the true value?"
E.g., you might have thoughts similar to these:
"I am going to interpret the 'methods' in terms of their reliability for consistent causal inference and minimizing parameter mean-squared error in settings like this one.
"I see the suggested metrics scale. Although this scale gives descriptive criteria, I think the best interpretation of this metric would consider appropropriateness of the methods chosen relative to the choices made across the distribution of all papers published in any top-50-or-above, impact-factor-rated journal in economics. My best/central guess is that this paper falls into the 65th percentile for this.
"I have made intuitive judgments on questions like this in the past. I sometimes changed my mind a bit. Considering this in context, I am only somewhat confident in my judgment here. I'm unsure about the diagnostic tests for the two-way fixed effects. I'd put about a 10% probability that this work is actually in the bottom 45% of all work submitted to such journals. On the other hand, if these diagnostic tests were powerful, this would be among the strongest work in this respect. Thus, I'd give a 10% chance that this is in the top 10% of such work in this sense.
"Thus, I give a central score of 65 for this metric, with 90% bounds (45, 90)."
"But how do I know if I'm setting these bounds right?"
One consideration is "calibration." If you're well-calibrated, then your specified 90% bounds should contain the true value close to 90% of the time. Similarly, 50% bounds should contain the true value half the time. If your 90% bounds contain the true value less than 90% of of the time, you're being overconfident (try to give wider bounds in the future). If they contain the true value more than 90% of the time, you are underconfident (specify tighter bounds going forward).
"The aim of the web app is to help you become 'well-calibrated.' This means that when you say you’re 50% confident, you’re right about 50% of the time; when you say you're 90% confident, you're right about 90% of the time; and so on."
We see "overall assessment" as the most important measure. Please prioritize this.
Judge the work’s quality heuristically. Consider all aspects of quality, importance to knowledge production, and importance to practice.
The description folded below focuses on the "Overall Assessment." Please try to use a similar scale when evaluating the category metrics.
95–100: Among the highest quality and most important work you have ever read.
90–100: This work represents a major achievement, making substantial contributions to the field and practice. Such work would/should be weighed very heavily by tenure and promotion committees, and grantmakers.
- Most work in this area in the next ten years will be influenced by this paper.
- This paper is substantially more rigorous or more insightful than existing work in this area in a way that matters for research and practice.
- The work makes a major, perhaps decisive contribution to a case for (or against) a policy or philanthropic intervention.
This work represents a strong and substantial achievement. It is highly rigorous, relevant, and well-communicated, up to the standards of the strongest work in this area (say, the standards of the top 5% of committed researchers in this field). Such work would/should not be decisive in a tenure/promotion/grant decision alone, but it should make a very solid contribution to such a case.
60–74.9: A very strong, solid, and relevant piece of work. It may have minor flaws or limitations, but overall it is very high-quality, meeting the standards of well-respected research professionals in this field.
40–59.9: A useful contribution, with major strengths, but also some important flaws or limitations.
20–39.9: Some interesting and useful points and some reasonable approaches, but only marginally so. Important flaws and limitations. Would need substantial refocus or changes of direction and/or methods in order to be a useful part of the research and policy discussion.
5–19.9: Among the lowest quality papers; not making any substantial contribution and containing fatal flaws. The paper may fundamentally address an issue that is not defined or obviously not relevant, or the content may be substantially outside of the authors’ field of expertise.
0–4: Illegible, fraudulent, or plagiarized. Please flag fraud, and notify us and the relevant authorities.
We want policymakers and researchers to be able to use The Unjournal's evaluations to carefully update their beliefs and make better decisions. To do this well, they need to weigh multiple evaluations against each other, and against other sources of information. How much weight should they give to each? In this context, it is important to quantify the uncertainty. That's why we ask you to provide a measure of this. You may feel comfortable giving your "90% confidence interval," or you may prefer to give a "descriptive rating" of your confidence (from "extremely confident" to "not confident").
5 = Extremely confident, i.e., 90% confidence interval spans +/- 4 points or less
4 = Very confident: 90% confidence interval +/- 8 points or less
3 = Somewhat confident: 90% confidence interval +/- 15 points or less
2 = Not very confident: 90% confidence interval, +/- 25 points or less
1 = Not confident: (90% confidence interval +/- more than 25 points)
Remember, we would like you to give a 90% CI or a confidence rating (1–5 dots), but not both.
Note that all of these criteria are scales (not binaries).
("To what extent"...) does the project make a contribution to the field or to practice, particularly in ways that will be relevant to our other criteria?
Originality and cleverness should be weighted less than the typical journal, because The Unjournal focuses on impact. Papers that apply existing techniques and frameworks more rigorously than previous work or apply them to new areas in ways that provide practical insights for GP (global priorities) and interventions should be highly valued. More weight should be placed on contribution to GP than to the academic field.
Do the insights generated inform our ("posterior") beliefs about important parameters and about the effectiveness of interventions? Note that we do not require a substantial shift in our expectations; sound and well-presented "null results" can be valuable.
Does the project leverage and incorporate recent relevant and credible work in useful ways?
Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? Are all of the given results justified in the discussion of methods?
Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this?
Avoiding bias and questionable research practices (QRP): Did the authors take steps to reduce bias from opportunistic reporting and QRP? For example, pre-registration, multiple hypothesis testing corrections, and reporting flexible specifications.
Coherent and clear argumentation, communication, reasoning transparency
Are the goals/questions of the paper clearly expressed? Are concepts clearly defined and referenced?
Is the reasoning "transparent"? (See, e.g., Open Philanthropy's guide on reasoning transparency.) Are all of the assumptions and logical steps made clear? Does the logic of the arguments make sense? Is the argument written well enough to make it easy to follow?
Are the data and/or analysis presented relevant to the arguments made? Are the stated conclusions/results consistent with the evidence (or theoretical results/proofs) presented? Are the tables/graphs/diagrams easy enough to understand in the context of the narrative (e.g., no errors in labeling)?
4a. Replicability, reproducibility, data integrity
Would another researcher be able to perform the same analysis and get the same results? Are the method and its details explained sufficiently, in a way that would enable easy and credible replication? For example, a full description of analysis, code and software provided, and statistical tests fully explained. Is the source of the data clear?
Is the necessary data made as widely available as possible? As applicable? Ideally, the cleaned data should also be clearly labeled and explained/legible.
Optional: Are we likely to be able to construct the output from the shared code (and data)? Note that evaluators are not required to run or evaluate the code; this is at your discretion. However, having a quick look at some of the elements could be helpful. Ideally, the author should give code that allows easy, full replicationl; for example, a single R script that runs and creates everything, starting from the original data source, and including data cleaning files. This would make it fairly easy for an evaluator to check. For example, see this taxonomy of "levels of computational reproducibility."
Do the numbers in the paper (and code output, if checked) make sense? Are they internally consistent throughout the paper?
4c. Useful building blocks
Do the authors provide tools, resources, data, and outputs that are likely to enable and enhance future work and meta-analysis?
Does the paper consider the real-world relevance of the arguments and results presented, perhaps engaging policy and implementation questions?
Is the setup particularly well-informed by real-world norms and practices? “Is this realistic; does it make sense in the real world?”
Authors might be encouraged—and should be rewarded—for the following:
- Do the authors communicate their work in ways policymakers and decision-makers are likely to understand (perhaps in a supplemental "non-technical abstract"), without being misleading and oversimplifying?
- Do the authors present practical "impact quantifications," such as cost-effectiveness analyses, or provide results enabling these?
In future we may be able to pay them to do the above, if grant funding permits.
Is this topic, approach, and discussion potentially useful to global priorities research and interventions?
We would like to benchmark our evaluations against "how research is currently judged." We want to provide a bridge between the current accept-or-reject system and an evaluation-based system. We want our evaluations to be taken seriously by universities and policymakers. Thus, we are asking you for two "predictions" in the table below. The first is a "real-world" prediction, and the second is a comparable measure for a hypothetical "ideal world."
*To better understand what we are asking here, please consult the subsections below: Journal metrics; In what quality level of journal . . . ; and Overall assessment on "scale of journals"
For the "prediction" questions above, we are asking for a journal quality rating prediction from 0.0 to 5.0. You can specify up to two digits (e.g., “4.4” or “2.0”). We are using this 0–5 metric here (rather than 0–100) as we suspect it is more familiar to academics.
The metrics are:
0/5: Marginally respectable/Little to no value; not publishable in any journal with scrutiny or credible WP series; not likely to be cited by credible researcher
1/5: OK/Somewhat valuable journal
2/5 Marginal B-journal/Decent field journal
3/5: Top B-journal/Strong field journal
4/5: Marginal A-Journal/Top field journal
5/5: A-journal/Top journal
The question above presumes that this work has not already been published in a peer-reviewed journal. However, we are planning to commission at least some post-publication review going forward. If the work has already been peer-review-published, you can either:
- Answer a related question (not a prediction): “Suppose this paper were submitted to journals, in succession, from the top tier downwards. Imagine there is some randomness in this process. Consider all possible “random draws of the world.” In the "median draw," in what quality level journal would this paper be published?
From "five dots" to "one dot":
5 = Extremely confident, i.e., 90% confidence interval spans +/– 4 points or less*
4 = Very confident: 90% confidence interval +/– 8 points or less
3 = Somewhat confident: 90% confidence interval +/– 15 points or less
2 = Not very confident: 90% confidence interval, +/– 25 points or less
1 = Not confident: 90% confidence interval +/– 25 points
Consider the scale of journals described above. Suppose that
- 1.the journal process was fair, unbiased, and free of noise, and that status, social connections, and lobbying to get the paper published didn’t matter;
- 2.journals assessed research according to the category metrics we discussed above; and
- 3.this research was being submitted to journals according to this fair process.*
In such a case, in what quality level of journal would and should this research be published in its current form (or with minor revisions)?
For the questions below, we will publish your responses and review unless you ask us to keep them anonymous.
- 1.How long have you been in this field?
- 2.How many proposals and papers have you evaluated? (For journals, grants, and other peer review.)
Your answers to the questions below will not be made public:
- 1.How would you rate this template and process?
- 2.Do you have any suggestions or questions about this process or The Unjournal? (We will try to respond and to incorporate your suggestions.) [Open response]
- 3.Would you be willing to consider evaluating a revised version of this project?
- Cite evidence and reference specific parts of the research when giving feedback.
- Try to justify your critiques and claims in a reasoning-transparent way, rather than merely ‘"passing judgment."
- Provide specific, actionable feedback to the author where possible.
- When considering the authors’ arguments, consider the most reasonable interpretation of what they have written (and state what that is, to help the author make their point more clearly). See steelmanning.
- Be collegial and encouraging, but also rigorous. Criticize and question specific parts of the research without suggesting criticism of the researchers themselves.
We are happy for you to use whichever process and structure you feel comfortable with when writing a peer review.
- Assign an overall score based on quantitative metrics (possible: brief discussion of these metrics).
- Summarize the work and issues, and the research in context to convey your understanding and help others understand it.
- Highlight positive aspects of the paper, strengths and contributions.
- Assess the contribution of the work in context of existing research.
- Note major limitations and potential ways the work could be improved; where possible, reference methodological literature and discussion and work that models what you are suggesting.
- Discuss minor flaws and their potential revisions.
- You are not obliged (or paid) to spend a great deal of time copyediting the work. If you like, you can give a few specific suggestions and then suggest that the author look to make other changes along these lines.
- Offer suggestions for research agendas, increasing the impact of the work, incorporating the work into global priorities research and impact evaluations, and enhancing future work.
Remember: The Unjournal doesn’t “publish” and doesn’t “accept or reject.” So don’t give an Accept, Revise and Resubmit, or Reject-type recommendation. We just want quantitative metrics, some written feedback, and some relevant discussion.
We still want your evaluation and ratings. Some things to consider as an evaluator in this situation:
- 1.We still want your quantitative ratings and predictions.
- 2.A paper or project is not only a good to be judged on a single scale. How useful is it, and to whom or what? We'd like you discuss its value in relation to previous work, it’s implications, what it suggests for research and practice, etc.
- 3.Even if the paper is great . . .
- Would you accept it in the “top journal" in economics”? If not, why not?
- Would you hire someone based on this paper?
- Would you fund a major intervention (as a government policymaker, major philanthropist, etc.) based on this paper alone? If not, why not?
- 4.What are the most important and informative results of the paper?
- 5.Can you quantify your confidence in these "crucial" results, and their replicability and generalizability to other settings? Can you state your probabilistic bounds (confidence or credibility intervals) on the quantitative results (e.g., 80% bounds on QALYs/DALYs/or WELLBYs per $1,000).
- 6.Would any other robustness checks or further work have the potential to increase your confidence (narrow your belief bounds) in this result? Which?
- 7.Do the authors make it easy to reproduce the statistical (or other) results of the paper from shared data? Could they do more in this respect?
- 8.Communication: Did you understand all of the paper? Was it easy to read? Are there any parts that could have been better explained?
- Is it communicated in a way that would it be useful to policymakers? To other researchers in this field, or in the general discipline?