arrow-left

All pages
gitbookPowered by GitBook
1 of 4

Loading...

Loading...

Loading...

Loading...

Conventional guidelines for referee reports

hashtag
How to write a good review (general conventional guidelines)

chevron-rightSome key pointshashtag
  • Cite evidence and reference specific parts of the research when giving feedback.

  • Justify your critiques and claims in a reasoning-transparent way, rather than merely ‘"passing judgment." Avoid comments like "this does not pass the smell test".

  • Provide specific, actionable feedback to the author where possible.

  • Try to restate the authors’ arguments, clearly presenting the most reasonable interpretation of what they have written. See .

  • Be collegial and encouraging, but also rigorous. Criticize and question specific parts of the research without suggesting criticism of the researchers themselves.

We're happy for you to use whichever process and structure you feel comfortable with when writing your evaluation content.

chevron-rightOne possible structurehashtag

Core

  1. Briefly summarize the work in context

  2. Highlight positive aspects of the paper and its strengths and contributions, considered in the context of existing research.

circle-info

Remember: The Unjournal doesn’t “publish” and doesn’t “accept or reject.” So don’t give an Accept, Revise-and-Resubmit', or Reject-type recommendation. We ask for quantitative metrics, written feedback, and expert discussion of the validity of the paper's main claims, methods, and assumptions.

hashtag
Writing referee reports: resources and benchmarks

Economics

Semi-relevant:

Report:

Open Science

(Conventional but open access; simple and brief)

(Open-science-aligned; perhaps less detail-oriented than we are aiming for)

(Journal-independent “PREreview”; detailed; targets ECRs)

General, other fields

(Conventional; general)

(extensive resources; only some of this is applicable to economics and social science)

hashtag
Other templates and tools

‘the 4 validities’ and

  • Most importantly: Identify and assess the paper's most important and impactful claim(s). Are these supported by the evidence provided? Are the assumptions reasonable? Are the authors using appropriate methods?

  • Note major limitations and potential ways the work could be improved; where possible, reference methodological literature and discussion and work that models what you are suggesting.

  • Optional/desirable

    • Offer suggestions for increasing the impact of the work, for incorporating the work into global priorities research and impact evaluations, and for supporting and enhancing future work.

    • Discuss minor flaws and their potential revisions.

    • Desirable: formal

    Please don't spend time copyediting the work. If you like, you can give a few specific suggestions and then suggest that the author look to make other changes along these lines.

    steelmanningarrow-up-right
    How to Write an Effective Referee Report and Improve the Scientific Review Process (Berk et al, 2017)arrow-up-right
    Econometric Society: Guidelines for refereesarrow-up-right
    Improving Peer Review in Economics: Stocktaking and Proposal (Charness et al 2022)arrow-up-right
    PLOSarrow-up-right
    Peer Community In... Questionnaire arrow-up-right
    Open Reviewers Reviewer Guide arrow-up-right
    The Wiley Online Libraryarrow-up-right
    "Peer review in the life sciences (Fraser)"arrow-up-right
    Collaborative template: RRR assessment peer reviewarrow-up-right
    Introducing Structured PREreviews on PREreview.orgarrow-up-right
    seaboatarrow-up-right
    'claim identification and assessment'arrow-up-right

    Proposed curating robustness replication

    We are considering asking evaluators, with compensation, to assist and engage in the process of "robustness replication." This may lead to some interesting follow-on possibilities as we build our potential collaboration with the Institute for Replicationarrow-up-right and others in this space.

    We might ask evaluators discussion questions like these:

    • What is the most important, interesting, or relevant substantive claim made by the authors, (particularly considering global priorities and potential interventions and responses)?

    • What statistical test or evidence does this claim depend on, according to the authors?

    • How confident are you in the substantive claim made?

    • "Robustness checks": What specific statistical test(s) or piece(s) of evidence would make you substantially more confident in the substantive claim made?

    • If a robustness replication "passed" these checks, how confident would you be then in the substantive claim? (You can also express this as a continuous function of some statistic rather than as a binary; please explain your approach.)

    Background:

    The Institute for Replication is planning to hire experts to do "robustness-replications" of work published in a top journal in economics and political science. Code- and data sharing is now being enforced in many or all of these journals and other important outlets. We want to support their efforts and are exploring collaboration possibilities. We are also considering how to best guide potential future robustness replication work.

    Why these guidelines/metrics?

    circle-info

    31 Aug 2023: Our present approach is a "working solution" involving some ad-hoc and intuitive choices. We are re-evaluating the metrics we are asking for as well as the interface and framing. We are gathering some discussion in this linked Gdocarrow-up-right, incorporating feedback from our pilot evaluators and authors. We're also talking to people with expertise as well as considering past practice and other ongoing initiatives. We plan to consolidate that discussion and our consensus and/or conclusions into the present (Gitbook) site.

    hashtag
    Why numerical ratings?

    Ultimately, we're trying to replace the question of "what tier of journal did a paper get into?" with "how highly was the paper rated?" We believe this is a more valuable metric. It can be more fine-grained. It should be less prone to gaming. It aims to reduce randomness in the process, through things like 'the availability of journal space in a particular field'. See our discussion of .

    To get to this point, we need to have academia and stakeholders see our evaluations as meaningful. We want the evaluations to begin to have some value that is measurable in the way “publication in the AER” is seen to have value.

    While there are some ongoing efforts towards journal-independent evaluation, these tend not use comparable metrics. Typically, they either have simple tick-boxes (like "this paper used correct statistical methods: yes/no") or they enable descriptive evaluation without an overall rating. As we are not a journal, and we don’t accept or reject research, we need another way of assigning value. We are working to determine the best way of doing this through quantitative ratings. We hope to be able to benchmark our evaluations to "traditional" publication outcomes. Thus, we think it is important to ask for both an overall quality rating and a journal ranking tier prediction.

    hashtag
    Why these categories?

    In addition to the overall assessment, we think it will be valuable to have the papers rated according to several categories. This could be particularly helpful to practitioners who may care about some concerns more than others. It also can be useful to future researchers who might want to focus on reading papers with particular strengths. It could be useful in meta-analyses, as certain characteristics of papers could be weighed more heavily. We think the use of categories might also be useful to authors and evaluators themselves. It can help them get a sense of what we think research priorities should be, and thus help them consider an overall rating.

    However, these ideas have been largely ad-hoc and based on the impressions of our management team (a particular set of mainly economists and psychologists). The process is still being developed. Any feedback you have is welcome. For example, are we overemphasizing certain aspects? Are we excluding some important categories?

    We are also researching other frameworks, templates, and past practice; we hope to draw from validated, theoretically grounded projects such as .

    hashtag
    Why ask for credible intervals?

    In eliciting expert judgment, it is helpful to differentiate the level of confidence in predictions and recommendations. We want to know not only what you believe, but how strongly held your beliefs are. If you are less certain in one area, we should weigh the information you provide less heavily in updating our beliefs. This may also be particularly useful for practitioners. Obviously, there are challenges to any approach. Even experts in a quantitative field may struggle to convey their own uncertainty. They may also be inherently "poorly calibrated" (see discussions and tools for ). Some people may often be "confidently wrong." They might state very narrow "credible intervals", when the truth—where measurable—routinely falls outside these boundaries. People with greater discrimination may sometimes be underconfident. One would want to consider and potentially correct for poor calibration. As a side benefit, this may be interesting for research in and of itself, particularly as The Unjournal grows. We see 'quantifying one's own uncertainty' as a good exercise for academics (and everyone) to engage in.

    hashtag
    "Weightings" for each rating category (removed for now)

    chevron-rightWeightings for each ratings category (removed for now)hashtag

    2 Oct 2023 -- We previously suggested 'weightings' for individual ratings, along with a note

    We give "suggested weights" as an indication of our priorities and a suggestion for how you might average these together into an overall assessment; but please use your own judgment.

    We included these weightings for several reasons:

    hashtag
    Adjustments to metrics and guidelines/previous presentations

    chevron-rightOct 2023 update - removed "weightings"hashtag

    We have removed suggested weightings for each of these categories. We discuss the rationale at some length .

    Evaluators working before October 2023 saw a previous version of the table, which you can see .

    chevron-rightDec. 2023: Hiding/de-emphasizing 'confidence Likerts'hashtag

    We previously gave evaluators two options for expressing their confidence in each rating:

    Either:

    1. The 90% Confidence/Credible Interval (CI) input you see below (now a 'slider' in PubPub V7) or

    hashtag
    Pre-October 2023 'ratings with weights' table, provided for reference (no longer in use)

    Category (importance)
    Sugg. Wgt.*
    Rating (0-100)
    90% CI
    Confidence (alternative to CI)

    We had included the note:

    We give the previous weighting scheme in a fold below for reference, particularly for those reading evaluations done before October 2023.

    As well as:

    Suggested weighting: 0. Why 0?

    Elsewhere in that page we had noted:

    As noted above, we give suggested weights (0–5) to suggest the importance of each category rating to your overall assessment, given The Unjournal's priorities. But you don't need, and may not want to use these weightings precisely.

    The weightings were presented once again along with each description in the section .

    hashtag
    Pre-2024 ratings and uncertainty elicitation, provided for reference (no longer in use)

    Category (importance)
    Rating (0-100)
    90% CI
    Confidence (alternative to CI)

    [FROM PREVIOUS GUIDELINES:]

    You may feel comfortable giving your "90% confidence interval," or you may prefer to give a "descriptive rating" of your confidence (from "extremely confident" to "not confident").

    Quantify how certain you are about this rating, either giving a 90% / interval or using our . (We prefer the 90% CI. Please don't give both.

    chevron-right[Previous guidelines] "1–5 dots": Explanation and relation to CIshashtag

    5 = Extremely confident, i.e., 90% confidence interval spans +/- 4 points or less

    4 = Very confident: 90% confidence interval +/- 8 points or less

    3 = Somewhat confident: 90% confidence interval +/- 15 points or less

    2 = Not very confident: 90% confidence interval, +/- 25 points or less

    [Previous...] Remember, we would like you to give a 90% CI or a confidence rating (1–5 dots), but not both.

    chevron-right[Previous guidelines] Example of confidence dots vs CIhashtag

    The example in the diagram above (click to zoom) illustrates the proposed correspondence.

    And, for the 'journal tier' scale:

    chevron-right[Previous guidelines]: Reprising the confidence intervals for this new metrichashtag

    From "five dots" to "one dot":

    5 = Extremely confident, i.e., 90% confidence interval spans +/– 4 points or less*

    4 = Very confident: 90% confidence interval +/– 8 points or less

    3 = Somewhat confident: 90% confidence interval +/– 15 points or less

    hashtag
    Previous 'descriptions of ratings intervals'

    [Previous guidelines]: The description folded below focuses on the "Overall Assessment." Please try to use a similar scale when evaluating the category metrics.

    chevron-rightTop ratings (90–100)hashtag

    95–100: Among the highest quality and most important work you have ever read.

    90–100: This work represents a major achievement, making substantial contributions to the field and practice. Such work would/should be weighed very heavily by tenure and promotion committees, and grantmakers.

    For example:

    chevron-rightNear-top (75–89) (*)hashtag

    This work represents a strong and substantial achievement. It is highly rigorous, relevant, and well-communicated, up to the standards of the strongest work in this area (say, the standards of the top 5% of committed researchers in this field). Such work would/should not be decisive in a tenure/promotion/grant decision alone, but it should make a very solid contribution to such a case.

    chevron-rightMiddle ratings (40–59, 60–74) (*)hashtag

    60–74.9: A very strong, solid, and relevant piece of work. It may have minor flaws or limitations, but overall it is very high-quality, meeting the standards of well-respected research professionals in this field.

    40–59.9: A useful contribution, with major strengths, but also some important flaws or limitations.

    chevron-rightLow ratings (5–19, 20–39) (*)hashtag

    20–39.9: Some interesting and useful points and some reasonable approaches, but only marginally so. Important flaws and limitations. Would need substantial refocus or changes of direction and/or methods in order to be a useful part of the research and policy discussion.

    5–19.9: Among the lowest quality papers; not making any substantial contribution and containing fatal flaws. The paper may fundamentally address an issue that is not defined or obviously not relevant, or the content may be substantially outside of the authors’ field of expertise.

    0–4: Illegible, fraudulent, or plagiarized. Please flag fraud, and notify us and the relevant authorities.

    chevron-right(*) 20 Mar 2023: We adjusted these ratings to avoid overlaphashtag

    The previous categories were 0–5, 5–20, 20–40, 40–60, 60–75, 75–90, and 90–100. Some evaluators found the overlap in this definition confusing.

    hashtag
    See also

    This page explains the value of the metrics we are seeking from evaluators.

    chevron-rightCalibration training toolshashtag

    The from Clearer Thinking is fairly helpful and fun for practicing and checking how good you are at expressing your uncertainty. It requires creating account, but that doesn't take long. The 'Confidence Intervals' training seems particularly relevant for our purposes.

    People are found [reference needed] do a more careful job at prediction (and thus perhaps at overall rating too) if the outcome of interest is built up from components that are each judged separately.
  • We wanted to make the overall rating better defined and thus more useful to outsiders and comparable across raters

  • Emphasizing what we think is important (in particular, methodological reliability)

  • We didn't want evaluators to think we wanted them to weigh each category equally … some are clearly more important

  • However, we decided to remove these weightings because:

    1. Reduce clutter in an already overwhelming form and guidance doc. ‘More numbers’ can be particularly overwhelming

    2. These weights were ad-hoc, and they may suggest we have a more grounded ‘model of value’ than we already do. (And there is also some overlap in our categories anyways, something we are working on addressing.)

    3. Some people interpreted what we intended incorrectly (e.g., they thought we were saying ‘relevance to global priorities’ is not an important thing)

    A five-point 'Likert style' measure of confidence, which we described qualitatively and explained how we would convert it into CIs when we report aggregations.

    To make this process less confusing, to encourage careful quantification of uncertainty, and to enable better-justified aggregation of expert judgment, we are de-emphasizing the latter measure.

    Still, to accommodate those who may not be familiar with or comfortable stating "90% CIs on their own beliefs" we offer further explanations, and we are providing tools to help evaluators construct these. As a fallback, we will still allow evaluators to give the 1-5 confidence measure, noting the correspondence to CIs, but we discourage this somewhat.

    The previous guidelines can be seen here; these may be useful in considering evaluations provided pre-2024.

    starstarstarstarstarstarstarstarstarstar

    5

    51

    45, 55

    starstarstarstarstarstarstarstarstar

    4

    20

    10, 35

    starstarstarstarstarstarstarstar

    3

    60

    40, 70

    starstarstarstarstarstarstar

    2

    35

    30,46

    starstarstarstarstarstarstarstar

    0**

    30

    21,65

    starstarstarstarstarstar
    51

    45, 55

    starstarstarstarstarstarstarstarstar

    20

    10, 35

    starstarstarstarstarstarstarstar

    60

    40, 70

    starstarstarstarstarstarstar

    35

    30,46

    starstarstarstarstarstarstarstar

    30

    21,65

    starstarstarstarstarstar
    1 = Not confident: (90% confidence interval +/- more than 25 points)
    2 = Not very confident: 90% confidence interval, +/– 25 points or less

    1 = Not confident: 90% confidence interval +/– 25 points

    Most work in this area in the next ten years will be influenced by this paper.
  • This paper is substantially more rigorous or more insightful than existing work in this area in a way that matters for research and practice.

  • The work makes a major, perhaps decisive contribution to a case for (or against) a policy or philanthropic intervention.

  • Why these guidelines/metrics?(holistic, most important!)

    44

    39, 52

    starstarstarstarstarstarstarstarstar

    Why these guidelines/metrics?

    5

    50

    Why these guidelines/metrics?(holistic, most important!)

    44

    39, 52

    starstarstarstarstarstarstarstarstar

    Why these guidelines/metrics?

    50

    47, 54

    starstarstarstarstarstarstarstarstarstar
    Reshaping academic evaluation: beyond the binary...
    RepliCATSarrow-up-right
    calibration trainingarrow-up-right
    here
    HERE
    "Category explanations: what you are rating"
    confidencearrow-up-right
    credibilityarrow-up-right
    scale described below
    Unjournal Evaluator Guidelines and Metrics - Discussion spacearrow-up-right
    Calibrate Your Judgment apparrow-up-right

    47, 54

    Why these guidelines/metrics?
    Why these guidelines/metrics?
    Why these guidelines/metrics?
    Why these guidelines/metrics?
    Why these guidelines/metrics?
    Why these guidelines/metrics?
    Why these guidelines/metrics?
    Why these guidelines/metrics?
    Why these guidelines/metrics?
    Why these guidelines/metrics?
    More reliable, precise, and useful metrics

    Guidelines for evaluators

    This page describes The Unjournal's evaluation guidelines, considering our priorities and criteria, the metrics we ask for, and how these are considered.

    circle-info

    30 July 2024: These guidelines below apply to the evaluation form currently hosted on PubPub. We're adjusting this form somewhat – the new form is temporarily hosted in Coda here (academic stream) arrow-up-rightand herearrow-up-right (applied streamarrow-up-right). If you prefer, you are welcome to use the Coda form instead (just let us know).

    If you have any doubts about which form to complete or about what we are looking for please ask the evaluation manager or email contact@unjournal.org.

    You can download a pdf version of these guidelines herearrow-up-right (updated March 2024).

    circle-info

    Please see for an overview of the evaluation process, as well as details on compensation, public recognition, and more.

    hashtag
    What we'd like you to do

    1. Write an evaluation of the target paper or project, similar to a standard, high-quality referee report. Please identify the paper's main claims and carefully assess their validity, leveraging your own background and expertise.

    hashtag
    Writing the evaluation (aka 'the review')

    In writing your evaluation and providing ratings, please consider the following.

    hashtag
    The Unjournal's expectations and criteria

    In many ways, the written part of the evaluation should be similar to a report an academic would write for a traditional high-prestige journal (e.g., see some 'conventional guidelines' ). Most fundamentally, we want you to use your expertise to critically assess the main claims made by the authors. Are the claims well-supported? Are the assumptions believable? Are the methods are appropriate and well-executed? Explain why or why not.

    However, we'd also like you to pay some consideration to our priorities:

    1. Advancing our knowledge and practice

    2. Justification, reasonableness, validity, and robustness of methods

    3. Logic and communication

    4. Open, communicative, replicable science

    See our for more details on each of these. Please don't structure your review according to these metrics, just pay some attention to them.

    chevron-rightSpecific requests for focus or feedbackhashtag

    Please pay attention to anything our managers and editors specifically asked you to focus on. We may ask you to focus on specific areas of expertise. We may also forward specific feedback requests from authors.

    chevron-rightThe evaluation will be made publichashtag

    Unless you were advised otherwise, this evaluation, including the review and quantitative metrics, will be given a DOI and, hopefully, will enter the public research conversation. Authors will be given two weeks to respond to reviews before the evaluations, ratings, and responses are made public. You can choose whether you want to be identified publicly as an author of the evaluation.

    If you have questions about the authors’ work, you can ask them anonymously: we will facilitate this.

    We want you to evaluate the most recent/relevant version of the paper/project that you can access. If you see a more recent version than the one we shared with you, please let us know.

    chevron-rightPublishing evaluations: considerations and exceptionshashtag

    We may give early-career researchers the right to veto the publication of very negative evaluations or to embargo the release of these for a defined period. We will inform you in advance if this will be the case for the work you are evaluating.

    You can reserve some "sensitive" content in your report to be shared with only The Unjournal management or only the authors, but we hope to keep this limited.

    hashtag
    Target audiences

    We designed this process to balance three considerations with three target audiences. Please consider each of these:

    1. Crafting evaluations and ratings that help researchers and policymakers judge when and how to rely on this research. For Research Users.

    2. Ensuring these evaluations of the papers are comparable to current journal tier metrics, to enable them to be used to determine career advancement and research funding. For Departments, Research Managers, and Funders.

    3. Providing constructive feedback to Authors

    We discuss this, and how it relates to our impact and "theory of change", .

    chevron-right"But isn't The Unjournal mainly just about feedback to authors"?hashtag

    We accept that in the near-term an Unjournal evaluation may not be seen to have substantial career value.

    Furthermore, work we are considering may tend be at an earlier stage. authors may submit work to us, thinking of this as a "pre-journal" step. The papers we select (e.g., from NBER) may also have been posted long before authors planned to submit them to journals.

    This may make the 'feedback for authors' and 'assessment for research users' aspects more important, relative to traditional journals' role. However, in the medium-term, a positive Unjournal evaluation should gain credibility and career value. This should make our evaluations an "endpoint" for a research paper.

    hashtag
    Quantitative metrics

    We ask for a set of nine quantitative metrics. For each metric, we ask for a score and a 90% credible interval. We describe these in detail below. (We explain .)

    hashtag
    Percentile rankings

    For some questions, we ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on.

    Here* the population of papers should be all serious research in the same area that you have encountered in the last three years.

    chevron-right*Unless this work is in our 'applied and policy stream', in which case...hashtag

    For the applied and policy stream the reference group should be "all applied and policy research you have read that is aiming at a similar audience, and that has similar goals".

    chevron-right"Serious" research? Academic research? hashtag

    Here, we are mainly considering research done by professional researchers with high levels of training, experience, and familiarity with recent practice, who have time and resources to devote months or years to each such research project or paper. These will typically be written as 'working papers' and presented at academic seminars before being submitted to standard academic journals. Although no credential is required, this typically includes people with PhD degrees (or upper-level PhD students). Most of this sort of research is done by full-time academics (professors, post-docs, academic staff, etc.) with a substantial research remit, as well as research staff at think tanks and research institutions (but there may be important exceptions).

    chevron-rightWhat counts as the "same area"?hashtag

    This is a judgment call. Here are some criteria to consider: first, does the work come from the same academic field and research subfield, and does it address questions that might be addressed using similar methods? Secondly, does it deal with the same substantive research question, or a closely related one? If the research you are evaluating is in a very niche topic, the comparison reference group should be expanded to consider work in other areas.

    chevron-right"Research that you have encountered"hashtag

    We are aiming for comparability across evaluators. If you suspect you are particularly exposed to higher-quality work in this category, compared to other likely evaluators, you may want to adjust your reference group downwards. (And of course vice-versa, if you suspect you are particularly exposed to lower-quality work.)

    hashtag

    hashtag
    Midpoint rating and credible intervals

    For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty. Our interface provides slider bars to express your chosen intervals:

    circle-info

    for more guidance on uncertainty, credible intervals, and the midpoint rating as the 'median of your belief distribution'.

    The table below summarizes the percentile rankings.

    Quantitative metric
    Scale

    hashtag

    hashtag
    Overall assessment

    Percentile ranking (0-100%)

    Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.

    hashtag
    Claims, strength and characterization of evidence **

    Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence?

    hashtag
    Methods: Justification, reasonableness, validity, robustness

    Percentile ranking (0-100%)

    Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable?

    Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this?

    Avoiding bias and (QRP): Did the authors take steps to reduce bias from opportunistic reporting and QRP? For example, did they do a strong pre-registration and pre-analysis plan, incorporate multiple hypothesis testing corrections, and report flexible specifications?

    hashtag

    hashtag
    Advancing our knowledge and practice

    Percentile ranking (0-100%)

    To what extent does the project contribute to the field or to practice, particularly in ways that are relevant to global priorities and impactful interventions?

    (Applied stream: please focus on ‘improvements that are actually helpful’.)

    chevron-rightLess weight to "originality and cleverness’"hashtag

    Originality and cleverness should be weighted less than the typical journal, because The Unjournal focuses on impact. Papers that apply existing techniques and frameworks more rigorously than previous work or apply them to new areas in ways that provide practical insights for GP (global priorities) and interventions should be highly valued. More weight should be placed on 'contribution to GP' than on 'contribution to the academic field'.

    Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions?

    Does the project add useful value to other impactful research?

    We don't require surprising results; sound and well-presented null results can also be valuable.

    hashtag
    Logic and communication

    Percentile ranking (0-100%)

    Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced?

    Is the reasoning "transparent"? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow?

    Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims?

    Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)?

    hashtag
    Open, collaborative, replicable research

    Percentile ranking (0-100%)

    This covers several considerations:

    hashtag
    Replicability, reproducibility, data integrity

    Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided?

    Is the source of the data clear?

    Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained??

    Consistency

    Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper?

    Useful building blocks

    Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis?

    hashtag
    Relevance to global priorities, usefulness for practitioners**

    Are the paper’s chosen topic and approach likely to be useful to

    Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic?

    Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization?

    Do they communicate (at least in the abstract or introduction) in ways policymakers and decision-makers can understand, without misleading or oversimplifying?

    chevron-rightEarlier category: "Real-world relevance"hashtag

    hashtag
    Real-world relevance

    Percentile ranking (0-100%)

    Are the assumptions and setup realistic and relevant to the real world?

    chevron-rightEarlier category: Relevance to global prioritieshashtag

    Percentile ranking (0-100%)

    Could the paper's topic and approach potentially help inform

    hashtag
    Journal ranking tiers

    chevron-rightNote: this is less relevant for work in our Applied Stream hashtag

    Most work in our will not be targeting academic journals. Still, in some cases it might make sense to make this comparison; e.g., if particular aspects of the work might be rewritten and submitted to academic journals, or if the work uses certain techniques that might be directly compared to academic work. If you believe a comparison makes sense, please consider giving an assessment below, making reference to our guidelines and how you are interpreting them in this case.

    To help universities and policymakers make sense of our evaluations, we want to benchmark them against how research is currently judged. So, we would like you to assess the paper in terms of journal rankings. We ask for two assessments:

    1. a normative judgment about 'how well the research should publish';

    2. a prediction about where the research will be published.

    Journal ranking tiers are on a 0-5 scale, as follows:

    • 0/5: "Won't publish/little to no value". Unlikely to be cited by credible researchers

    • 1/5: OK/Somewhat valuable journal

    • 2/5: Marginal B-journal/Decent field journal

    circle-info

    We give some example journal rankings , based on SJR and ABS ratings.

    We encourage you to consider a non-integer score, e.g. 4.6 or 2.2.

    As before, we ask for a 90% credible interval.

    Journal ranking tiers
    Scale
    90% CI
    circle-info

    PubPub note: as of 14 March 2024, the PubPub form is not allowing you to give non-integer responses. Until this is fixed, please multiply these by 10 and enter these using the 0-50 slider . (Or use the Coda form.)

    hashtag
    What journal ranking tier should this work be published in?

    Journal ranking tier (0.0-5.0)

    Assess this paper on the journal ranking scale described above, considering only its merit, giving some weight to the category metrics we discussed above.

    Equivalently, where would this paper be published if:

    1. the journal process was fair, unbiased, and free of noise, and that status, social connections, and lobbying to get the paper published didn’t matter;

    2. journals assessed research according to the category metrics we discussed above.

    hashtag
    What journal ranking tier will this work be published in?

    Journal ranking tier (0.0-5.0)

    chevron-rightWhat if this work has already been peer reviewed and published?hashtag

    If this work has already been published, and you know where, please report the prediction you would have given absent that knowledge.

    hashtag
    The midpoint and 'credible intervals': expressing uncertainty

    hashtag
    What are we looking for and why?

    We want policymakers, researchers, funders, and managers to be able to use The Unjournal's evaluations to update their beliefs and make better decisions. To do this well, they need to weigh multiple evaluations against each other and other sources of information. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty.

    But it's hard to quantify statements like "very certain" or "somewhat uncertain" – different people may use the same phrases to mean different things. That's why we're asking for you a more precise measure, your credible intervals. These metrics are particularly useful for meta-science and meta-analysis.

    You are asked to give a 'midpoint' and a 90% credible interval. Consider this as the smallest interval that you believe is 90% likely to contain the true value. See the fold below for further guidance.

    chevron-rightHow do I come up with these intervals? (Discussion and guidance)hashtag

    You may understand the concepts of uncertainty and credible intervals, but you might be unfamiliar with applying them in a situation like this one.

    You may have a certain best guess for the "Methods..." criterion. Still, even an expert can never be certain. E.g., you may misunderstand some aspect of the paper, there may be a method you are not familiar with, etc.

    Your uncertainty over this could be described by some distribution, representing your beliefs about the true value of this criterion. Your "'best guess" should be the central mass point of this distribution.

    You are also asked to give a 90% credible interval. Consider this as

    chevron-rightConsider the midpoint as the 'median of your belief distribution'hashtag

    We also ask for the 'midpoint', the center dot on that slider. Essentially, we are asking for the median of your belief distribution. By this we mean the percentile ranking such that you believe "there's a 50% chance that the paper's true rank is higher than this, and a 50% chance that it actually ranks lower than this."

    chevron-rightGet better at this by 'calibrating your judgment'hashtag

    If you are "", your 90% credible intervals should contain the true value 90% of the time. To understand this better, assess your ability, and then practice to get better at estimating your confidence in results. will help you get practice at calibrating your judgments. We suggest you choose the "Calibrate your Judgment" tool, and select the "confidence intervals" exercise, choosing 90% confidence. Even a 10 or 20 minute practice session can help, and it's pretty fun.

    hashtag
    Claim identification, assessment, and implications

    We are now asking evaluators for “claim identification and assessment” where relevant. This is meant to help practitioners use this research to inform their funding, policymaking, and other decisions. It is not intended as a metric to judge the research quality per se. This is not required but we will reward this work.

    .

    hashtag
    Survey questions

    Lastly, we ask evaluators about their background, and for feedback about the process.

    chevron-rightSurvey questions for evaluators: detailshashtag

    For the two questions below, we will publish your responses unless you specifically ask these questions to be kept anonymous.

    1. How long have you been in this field?

    hashtag
    Other guidelines and notes

    chevron-rightNote on the evaluation platform (13 Feb 2024)hashtag

    12 Feb 2024: We are moving to a hosted form/interface in PubPub. That form is still somewhat a work-in-progress, and may need some further guidance; we try to provide this below, but please contact us with any questions. If you prefer, you can also submit your response in a Google Doc, and share it back with us. Click to make a new copy of that directly.

    Length/time spent: This is up to you. We welcome detail, elaboration, and technical discussion.

    chevron-rightLength and time: possible benchmarkshashtag

    recommends a 2–3 page referee report; suggest this is relatively short, but confirm that brevity is desirable. , economists report spending (median and mean) about one day per report, with substantial shares reporting "half a day" and "two days." We expect that reviewers tend spend more time on papers for high-status journals, and when reviewing work that is closely tied to their own agenda.

    chevron-rightAdjustments to earlier metrics; earlier evaluation formshashtag

    We have made some adjustments to this page and to our guidelines and processes; this is particularly relevant for considering earlier evaluations. See .

    circle-info

    If you still have questions, please contact us, or see our FAQ on .

    Our data protection statement is linked .

    Give quantitative metrics and predictions as described below
    .
  • Answer a short questionnaire about your background and our processes.

  • .

    Relevance to global priorities

    0 - 100%

    Does the paper consider the real-world relevance of the arguments and results presented, perhaps engaging policy and implementation questions?

    Do the authors communicate their work in ways policymakers and decision-makers can understand, without misleading or oversimplifying?

    Do the authors present practical impact quantifications, such as cost-effectiveness analyses? Do they report results that enable such analyses?

    3/5: Top B-journal/Strong field journal

  • 4/5: Marginal A-Journal/Top field journal

  • 5/5: A-journal/Top journal

  • the smallest interval
    that you believe is 90% likely to contain the true value.

    For some questions, the "true value" refers to something objective, e.g. will this work be published in a top-ranked journal? In other cases, like the percentile rankings, the true value means "if you had complete evidence, knowledge, and wisdom, what value would you choose?"

    For more information on credible intervals, this Wikipedia entryarrow-up-right may be helpful.

    If you are "well calibratedarrow-up-right", your 90% credible intervals should contain the true value 90% of the time.

    How many proposals and papers have you evaluated? (For journals, grants, and other peer review.)

    Answers to the questions below will not be made public:

    1. How would you rate this template and process?

    2. Do you have any suggestions or questions about this process or The Unjournal? (We will try to respond to your suggestions, and incorporate them in our practice.) [Open response]

    3. Would you be willing to consider evaluating a revised version of this project?

    Overall assessment

    0 - 100%

    Advancing our knowledge and practice

    0 - 100%

    Methods: Justification, reasonableness, validity, robustness

    0 - 100%

    Logic and communication

    0 - 100%

    Open, collaborative, replicable science

    0 - 100%

    Real world relevance

    0 - 100%

    What journal ranking tier should this work be published in?

    0.0-5.0

    lower, upper

    What journal ranking tier will this work be published in?

    0.0-5.0

    lower, upper

    For prospective evaluators
    here
    guidelines below
    here
    why we ask for these metrics here
    See below
    questionable research practicesarrow-up-right
    global priorities, cause prioritization, and high-impact interventions?
    global priorities, cause prioritization, and high-impact interventions?
    applied stream
    herearrow-up-right
    well calibratedarrow-up-right
    This web apparrow-up-right
    See guidelines and examples herearrow-up-right
    herearrow-up-right
    The Econometrics societyarrow-up-right
    Berk et al.arrow-up-right
    In a recent survey (Charness et al., 2022)arrow-up-right
    Evaluation (refereeing)
    herearrow-up-right
    Adjustments to metrics and guidelines/previous presentations