LogoLogo
  • The Unjournal
  • An Introduction to The Unjournal
    • Content overview
    • How to get involved
      • Brief version of call
      • Impactful Research Prize (pilot)
      • Jobs and paid projects with The Unjournal
        • Advisory/team roles (research, management)
        • Administration, operations and management roles
        • Research & operations-linked roles & projects
        • Standalone project: Impactful Research Scoping (temp. pause)
      • Independent evaluations (trial)
        • Reviewers from previous journal submissions
    • Organizational roles and responsibilities
      • Unjournal Field Specialists: Incentives and norms (trial)
    • Our team
      • Reinstein's story in brief
    • Plan of action
    • Explanations & outreach
      • Press releases
      • Outreach texts
      • Related articles and work
    • Updates (earlier)
      • Impactful Research Prize Winners
      • Previous updates
  • Why Unjournal?
    • Reshaping academic evaluation: Beyond accept/reject
    • Promoting open and robust science
    • Global priorities: Theory of Change (Logic Model)
      • Balancing information accessibility and hazard concerns
    • Promoting 'Dynamic Documents' and 'Living Research Projects'
      • Benefits of Dynamic Documents
      • Benefits of Living Research Projects
    • The File Drawer Effect (Article)
    • Open, reliable, and useful evaluation
      • Multiple dimensions of feedback
  • Frequently Asked Questions (FAQ)
    • For research authors
    • Evaluation ('refereeing')
    • Suggesting and prioritizing research
  • Our policies: evaluation & workflow
    • Project submission, selection and prioritization
      • What research to target?
      • What specific areas do we cover?
      • Process: prioritizing research
        • Prioritization ratings: discussion
      • Suggesting research (forms, guidance)
      • "Direct evaluation" track
      • "Applied and Policy" Track
      • 'Conditional embargos' & exceptions
      • Formats, research stage, publication status
    • Evaluation
      • For prospective evaluators
      • Guidelines for evaluators
        • Why these guidelines/metrics?
        • Proposed curating robustness replication
        • Conventional guidelines for referee reports
      • Why pay evaluators (reviewers)?
      • Protecting anonymity
    • Mapping evaluation workflow
      • Evaluation workflow – Simplified
    • Communicating results
    • Recap: submissions
  • What is global-priorities-relevant research?
  • "Pivotal questions"
    • ‘Operationalizable’ questions
    • Why "operationalizable questions"?
  • Action and progress
    • Pilot steps
      • Pilot: Building a founding committee
      • Pilot: Identifying key research
      • Pilot: Setting up platforms
      • Setting up evaluation guidelines for pilot papers
      • 'Evaluators': Identifying and engaging
    • Plan of action (cross-link)
  • Grants and proposals
    • Survival and Flourishing Fund (successful)
    • ACX/LTFF grant proposal (as submitted, successful)
      • Notes: post-grant plan and revisions
      • (Linked proposals and comments - moved for now)
    • Unsuccessful applications
      • Clearer Thinking FTX regranting (unsuccessful)
      • FTX Future Fund (for further funding; unsuccessful)
      • Sloan
  • Parallel/partner initiatives and resources
    • eLife
    • Peer Communities In
    • Sciety
    • Asterisk
    • Related: EA/global priorities seminar series
    • EA and EA Forum initiatives
      • EA forum peer reviewing (related)
      • Links to EA Forum/"EA journal"
    • Other non-journal evaluation
    • Economics survey (Charness et al.)
  • Management details [mostly moved to Coda]
    • Governance of The Unjournal
    • Status, expenses, and payments
    • Evaluation manager process
      • Choosing evaluators (considerations)
        • Avoiding COI
        • Tips and text for contacting evaluators (private)
    • UJ Team: resources, onboarding
    • Policies/issues discussion
    • Research scoping discussion spaces
    • Communication and style
  • Tech, tools and resources
    • Tech scoping
    • Hosting & platforms
      • PubPub
      • Kotahi/Sciety (phased out)
        • Kotahi: submit/eval/mgmt (may be phasing out?)
        • Sciety (host & curate evals)
    • This GitBook; editing it, etc
    • Other tech and tools
      • Cryptpad (for evaluator or other anonymity)
      • hypothes.is for collab. annotation
Powered by GitBook
On this page
  • What we'd like you to do
  • Writing the evaluation (aka 'the review')
  • The Unjournal's expectations and criteria
  • Target audiences
  • Quantitative metrics
  • Percentile rankings
  • Midpoint rating and credible intervals
  • Overall assessment
  • Claims, strength and characterization of evidence **
  • Methods: Justification, reasonableness, validity, robustness
  • Advancing our knowledge and practice
  • Logic and communication
  • Open, collaborative, replicable research
  • Relevance to global priorities, usefulness for practitioners**
  • Journal ranking tiers
  • The midpoint and 'credible intervals': expressing uncertainty
  • Claim identification, assessment, and implications
  • Survey questions
  • Other guidelines and notes

Was this helpful?

Export as PDF
  1. Our policies: evaluation & workflow
  2. Evaluation

Guidelines for evaluators

This page describes The Unjournal's evaluation guidelines, considering our priorities and criteria, the metrics we ask for, and how these are considered.

PreviousFor prospective evaluatorsNextWhy these guidelines/metrics?

Last updated 7 days ago

Was this helpful?

These guidelines apply to the PubPub evaluation forms, as well as to the publicly visible formms in Coda and ().

Please see For prospective evaluators for an overview of the evaluation process, as well as details on compensation, public recognition, and more.

What we'd like you to do

  1. Write an evaluation of the . This largely resembles a high-quality referee report for a traditional journal without the binary focus on 'should we accept or reject?'. Below, we describe some of our values and emphases. We also value insights for less-technical practitioners, especially in your evaluation 'abstract'.

  2. .

  3. Please identify the paper's main claims and carefully assess their validity, leveraging your own background and expertise.

  4. Answer a short questionnaire about your background and our processes.

Writing the evaluation (aka 'the review')

In writing your evaluation and providing ratings, please consider the following.

The Unjournal's expectations and criteria

In many ways, the written part of the evaluation should be similar to a report an academic would write for a traditional high-prestige journal (e.g., see some 'conventional guidelines' ). Most fundamentally, we want you to use your expertise to critically assess the main claims made by the authors. Are the claims well-supported? Are the assumptions believable? Are the methods appropriate and well-executed? Explain why or why not.

However, we'd also like you to pay some consideration to our priorities, including

  1. Advancing our knowledge and supporting practitioners

  2. Justification, reasonableness, validity, and robustness of methods

  3. Logic and communication, intellectual modesty, transparent reasoning

  4. Open, communicative, replicable science

Specific requests for focus or feedback

Please pay attention to anything our managers and editors specifically suggested that to focus on. We may ask you to focus on specific areas of expertise. We may also forward specific feedback requests from authors.

The evaluation will be made public

Unless you were advised otherwise, this evaluation, including the review and quantitative metrics, will be given a DOI and, hopefully, will enter the public research conversation. Authors will be given two weeks to respond to the evaluations (and evaluators can adjust if any obvious oversights are found) before the evaluations, ratings, and responses are made public. You can choose whether you want to be identified publicly as an author of the evaluation.

If you have questions about the authors’ work, you can ask them anonymously: we will facilitate this.

We want you to evaluate the most recent/relevant version of the paper/project that you can access. If you see a more recent version than the one we shared with you, please let us know.

Publishing evaluations: considerations and exceptions

We may give early-career researchers the right to veto the publication of very negative evaluations or to embargo the release of these for a defined period. We will inform you in advance if this will be the case for the work you are evaluating.

You can reserve some "sensitive" content in your report to be shared with only The Unjournal management or only the authors, but we hope to keep this limited.

Target audiences

We designed this process to balance three considerations with three target audiences. Please consider each of these:

  1. Crafting evaluations and ratings that help researchers and policymakers judge when and how to rely on this research. For Research Users.

  2. Ensuring these evaluations of the papers are comparable to current journal tier metrics, to enable them to be used to determine career advancement and research funding. For Departments, Research Managers, and Funders.

  3. Providing constructive feedback to Authors.

"But isn't The Unjournal mainly just about feedback to authors"?

We accept that in the near-term an Unjournal evaluation may not be seen to have substantial career value.

Furthermore, work we are considering may tend be at an earlier stage. authors may submit work to us, thinking of this as a "pre-journal" step. The papers we select (e.g., from NBER) may also have been posted long before authors planned to submit them to journals.

This may make the 'feedback for authors' and 'assessment for research users' aspects more important, relative to traditional journals' role. However, in the medium-term, a positive Unjournal evaluation should gain credibility and career value. This should make our evaluations an "endpoint" for a research paper.

Quantitative metrics

Percentile rankings

For some questions, we ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on.

Here* the population of papers should be all serious research in the same area that you have encountered in the last three years.

*Unless this work is in our 'applied and policy stream', in which case...

For the applied and policy stream the reference group should be "all applied and policy research you have read that is aiming at a similar audience, and that has similar goals".

"Serious" research? Academic research?

Here, we are mainly considering research done by professional researchers with high levels of training, experience, and familiarity with recent practice, who have time and resources to devote months or years to each such research project or paper. These will typically be written as 'working papers' and presented at academic seminars before being submitted to standard academic journals. Although no credential is required, this typically includes people with PhD degrees (or upper-level PhD students). Most of this sort of research is done by full-time academics (professors, post-docs, academic staff, etc.) with a substantial research remit, as well as research staff at think tanks and research institutions (but there may be important exceptions).

What counts as the "same area"?

This is a judgment call. Here are some criteria to consider: first, does the work come from the same academic field and research subfield, and does it address questions that might be addressed using similar methods? Secondly, does it deal with the same substantive research question, or a closely related one? If the research you are evaluating is in a very niche topic, the comparison reference group should be expanded to consider work in other areas.

"Research that you have encountered"

We are aiming for comparability across evaluators. If you suspect you are particularly exposed to higher-quality work in this category, compared to other likely evaluators, you may want to adjust your reference group downwards. (And of course vice-versa, if you suspect you are particularly exposed to lower-quality work.)

Midpoint rating and credible intervals

For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty. Our interface provides slider bars to express your chosen intervals:

The table below summarizes the percentile rankings.

Quantitative metric
Scale

Overall assessment

0 - 100%

Claims, strength and characterization of evidence:

0 - 100%

Methods: Justification, reasonableness, validity, robustness

0 - 100%

Advancing knowledge and practice

0 - 100%

Logic and communication

0 - 100%

Open, collaborative, replicable science

0 - 100%

0 - 100%

Overall assessment

Percentile ranking (0-100%)

Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness.

Claims, strength and characterization of evidence

Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence?

Methods: Justification, reasonableness, validity, robustness

Percentile ranking (0-100%)

Are the used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable?

Are the results and methods likely to be robust to reasonable changes in the underlying assumptions?

Advancing our knowledge and practice

Percentile ranking (0-100%)

To what extent does the project contribute to the field or to practice, particularly in ways that are to global priorities and impactful interventions?

(Applied stream: please focus on ‘improvements that are actually helpful’.)

Less weight to "originality and cleverness’"

Originality and cleverness should be weighted less than the typical journal, because The Unjournal focuses on impact. Papers that apply existing techniques and frameworks more rigorously than previous work or apply them to new areas in ways that provide practical insights for GP (global priorities) and interventions should be highly valued. More weight should be placed on 'contribution to GP' than on 'contribution to the academic field'.

Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions?

Does the project add useful value to other impactful research?

Logic and communication

Percentile ranking (0-100%)

Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced?

Is the "? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow?

Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims?

Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)?

Open, collaborative, replicable research

Percentile ranking (0-100%)

This covers several considerations:

Replicability, reproducibility, data integrity

Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided?

Is the source of the data clear?

Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained??

Consistency

Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper?

Useful building blocks

Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis?

Relevance to global priorities, usefulness for practitioners

Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic?

Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization?

Do they communicate (at least in the abstract or introduction) in ways policymakers and decision-makers can understand, without misleading or oversimplifying?

Earlier category: "Real-world relevance"

Real-world relevance

Percentile ranking (0-100%)

Are the assumptions and setup realistic and relevant to the real world?

Do the authors communicate their work in ways policymakers and decision-makers can understand, without misleading or oversimplifying?

Do the authors present practical impact quantifications, such as cost-effectiveness analyses? Do they report results that enable such analyses?

Earlier category: Relevance to global priorities

Percentile ranking (0-100%)

Journal ranking tiers

Note: this is less relevant for work in our Applied Stream

To help universities and policymakers make sense of our evaluations, we want to benchmark them against how research is currently judged. So, we would like you to assess the paper in terms of journal rankings. We ask for two assessments:

  1. a normative judgment about 'how well the research should publish';

  2. a prediction about where the research will be published.

Journal ranking tiers are on a 0-5 scale, as follows:

  • 0/5: "/little to no value". Unlikely to be cited by credible researchers

  • 1/5: OK/Somewhat valuable journal

  • 2/5: Marginal B-journal/Decent field journal

  • 3/5: Top B-journal/Strong field journal

  • 4/5: Marginal A-Journal/Top field journal

  • 5/5: A-journal/Top journal

We encourage you to , e.g. 4.6 or 2.2.

As before, we ask for a 90% credible interval.

Journal ranking tiers
Scale
90% CI

What journal ranking tier should this work be published in?

0.0-5.0

lower, upper

What journal ranking tier will this work be published in?

0.0-5.0

lower, upper

PubPub note: as of 14 March 2024, the PubPub form is not allowing you to give non-integer responses. Until this is fixed, . (Or use the Coda form.)

What journal ranking tier should this work be published in?

Journal ranking tier (0.0-5.0)

Assess this paper on the journal ranking scale described above, considering only its merit, giving some weight to the category metrics we discussed above.

Equivalently, if:

  1. the journal process was fair, unbiased, and free of noise, and that status, social connections, and lobbying to get the paper published didn’t matter;

  2. journals assessed research according to the category metrics we discussed above.

What journal ranking tier will this work be published in?

Journal ranking tier (0.0-5.0)

What if this work has already been peer reviewed and published?

If this work has already been published, and you know where, please report the prediction you would have given absent that knowledge.

The midpoint and 'credible intervals': expressing uncertainty

What are we looking for and why?

We want policymakers, researchers, funders, and managers to be able to use The Unjournal's evaluations to update their beliefs and make better decisions. To do this well, they need to weigh multiple evaluations against each other and other sources of information. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty.

But it's hard to quantify statements like "very certain" or "somewhat uncertain" – different people may use the same phrases to mean different things. That's why we're asking for you a more precise measure, your credible intervals. These metrics are particularly useful for meta-science and meta-analysis.

You are asked to give a 'midpoint' and a 90% credible interval. Consider this as that you believe is 90% likely to contain the true value. See the fold below for further guidance.

How do I come up with these intervals? (Discussion and guidance)

You may understand the concepts of uncertainty and credible intervals, but you might be unfamiliar with applying them in a situation like this one.

You may have a certain best guess for the "Methods..." criterion. Still, even an expert can never be certain. E.g., you may misunderstand some aspect of the paper, there may be a method you are not familiar with, etc.

Your uncertainty over this could be described by some distribution, representing your beliefs about the true value of this criterion. Your "'best guess" should be the central mass point of this distribution.

You are also asked to give a 90% credible interval. Consider this as that you believe is 90% likely to contain the true value.

For some questions, the "true value" refers to something objective, e.g. will this work be published in a top-ranked journal? In other cases, like the percentile rankings, the true value means "if you had complete evidence, knowledge, and wisdom, what value would you choose?"

Consider the midpoint as the 'median of your belief distribution'

We also ask for the 'midpoint', the center dot on that slider. Essentially, we are asking for the median of your belief distribution. By this we mean the percentile ranking such that you believe "there's a 50% chance that the paper's true rank is higher than this, and a 50% chance that it actually ranks lower than this."

Get better at this by 'calibrating your judgment'

Claim identification, assessment, and implications

We are now asking evaluators for “claim identification and assessment” where relevant. This is meant to help practitioners use this research to inform their funding, policymaking, and other decisions. It is not intended as a metric to judge the research quality per se. This is not required but we will reward this work.

Survey questions

Lastly, we ask evaluators about their background, and for feedback about the process.

Survey questions for evaluators: details

For the two questions below, we will unless you specifically ask these questions to be kept anonymous.

  1. How long have you been in this field?

  2. How many proposals and papers have you evaluated? (For journals, grants, and other peer review.)

Answers to the questions

  1. How would you rate this template and process?

  2. Do you have any suggestions or questions about this process or The Unjournal? (We will try to respond to your suggestions, and incorporate them in our practice.) [Open response]

  3. Would you be willing to consider evaluating a revised version of this project?

Other guidelines and notes

Note on the evaluation platform (13 Feb 2024)

Length/time spent: This is up to you. We welcome detail, elaboration, and technical discussion.

Length and time: possible benchmarks
Adjustments to earlier metrics; earlier evaluation forms

If you still have questions, please contact us, or see our FAQ on Evaluation ('refereeing').

See our for more details on each of these. You don't need to structure your review according to these metrics, but please pay some attention to them.

For a model of what we are looking for, see examples of Unjournal evaluations that we thought were particularly strong ("Prize winning and commended evaluations").

We discuss this, and how it relates to our impact and "theory of change", .

We ask for a set of nine quantitative metrics. For each metric, we ask for a score and a 90% credible interval. We describe these in detail below. (We explain .)

for more guidance on uncertainty, credible intervals, and the midpoint rating as the 'median of your belief distribution'.

Avoiding bias and (QRP): Did the authors take steps to reduce bias from opportunistic reporting ? For example, did they do a strong pre-registration and pre-analysis plan, incorporate multiple hypothesis testing corrections, and report flexible specifications?

Are the paper’s chosen topic and approach to

Could the paper's topic and approach help inform

Most work in our will not be targeting academic journals. Still, in some cases it might make sense to make this comparison; e.g., if particular aspects of the work might be rewritten and submitted to academic journals, or if the work uses certain techniques that might be directly compared to academic work. If you believe a comparison makes sense, please consider giving an assessment below, making reference to our guidelines and how you are interpreting them in this case.

We give some example journal rankings , based on SJR and ABS ratings.

For more information on credible intervals, may be helpful.

If you are "", your 90% credible intervals should contain the true value 90% of the time.

If you are "", your 90% credible intervals should contain the true value 90% of the time. To understand this better, assess your ability, and then practice to get better at estimating your confidence in results. will help you get practice at calibrating your judgments. We suggest you choose the "Calibrate your Judgment" tool, and select the "confidence intervals" exercise, choosing 90% confidence. Even a 10 or 20 minute practice session can help, and it's pretty fun.

.

12 Feb 2024: We are moving to a hosted form/interface in PubPub. That form is still somewhat a work-in-progress, and may need some further guidance; we try to provide this below, but please contact us with any questions. , you can also submit your response in a Google Doc, and share it back with us. Click to make a new copy of that directly.

recommends a 2–3 page referee report; suggest this is relatively short, but confirm that brevity is desirable. , economists report spending (median and mean) about one day per report, with substantial shares reporting "half a day" and "two days." We expect that reviewers tend spend more time on papers for high-status journals, and when reviewing work that is closely tied to their own agenda.

We have made some adjustments to this page and to our guidelines and processes; this is particularly relevant for considering earlier evaluations. See .

Our data protection statement is linked .

here (academic stream)
here
applied stream
here
guidelines below
here
here
why we ask for these metrics here
See below
questionable research practices
global priorities, cause prioritization, and high-impact interventions?
global priorities, cause prioritization, and high-impact interventions?
applied stream
here
this Wikipedia entry
well calibrated
well calibrated
This web app
See guidelines and examples here
here
The Econometrics society
Berk et al.
In a recent survey (Charness et al., 2022)
here
Adjustments to metrics and guidelines/previous presentations