Ask or search…
Comment on page

Why these guidelines/metrics?

31 Aug 2023: Our present approach is a "working solution" involving some ad-hoc and intuitive choices. We are re-evaluating the metrics we are asking for as well as the interface and framing. We are gathering some discussion in this linked Gdoc, incorporating feedback from our pilot evaluators and authors. We're also talking to people with expertise as well as considering past practice and other ongoing initiatives. We plan to consolidate that discussion and our consensus and/or conclusions into the present (Gitbook) site.

Why numerical ratings?

Ultimately, we're trying to replace the question of "what tier of journal did a paper get into?" with "how highly was the paper rated?" We believe this is a more valuable metric, more fine-grained and less prone to gaming and to the randomness of things like the availability of journal space in a particular field. See our discussion of Reshaping academic evaluation: beyond the binary... .
To get to this point, we need to have academia and stakeholders see our reviews as meaningful. We want the evaluations to begin to have some value that is measurable in the way “publication in the AER” is seen to have value. My (David Reinstein's) impression is that previous and ongoing efforts towards journal-independent evaluation tend not to have comparable metrics. Typically, they either have simple tick-boxes (like "This paper used correct statistical methods: yes/no") or enable descriptive evaluation without an overall rating. As we are not a journal, and we don’t accept or reject research, we need another way of assigning value. We are working to determine the best way of doing this through quantitative ratings. We hope to be able to benchmark to "traditional" publication outcomes. Thus, we think it is important to ask for both an overall quality rating and a journal "prediction."

Why these categories?

In addition to the overall assessment, we think it will be valuable to have the papers rated according to several categories. This could be particularly helpful to practitioners who may care about some concerns more than others. It also can be useful to future researchers who might want to focus on reading papers with particular strengths. (Optimistically, it could be useful in meta-analyses, as certain characteristics of papers could be weighed more heavily.) We think the use of categories might also be useful to authors and evaluators themselves in getting a sense of what we think research priorities should be, and thus how to consider an overall rating.
However, these ideas have been largely ad-hoc and based on the impressions of our management team, which includes mainly a particular set of economists and psychologists. The process is still being developed. Any feedback you have is welcome. For example, are we overemphasizing certain aspects? Are we excluding some important categories?
We are also researching other frameworks, templates, and past practice; we hope to draw from validated, theoretically grounded projects such as RepliCATS.

Why ask for confidence intervals?

In eliciting expert judgment, it is helpful to differentiate the level of confidence in predictions and recommendations. We want to know not only what you believe, but how strongly held your beliefs are. If you are less certain in one area, we should weigh the information you provide less heavily in updating our beliefs. This may also be particularly useful for practitioners. Obviously, there are challenges to any approach. Even experts in a quantitative field may struggle to convey their own uncertainty. They may also be inherently "poorly calibrated" (see discussions and tools for calibration training). Some people may often be "confidently wrong." They might state very narrow confidence intervals (or "credible intervals"), when the truth—where measurable—routinely falls outside these boundaries. People with greater discrimination may sometimes be underconfident. One would want to consider and potentially correct for poor calibration (although it is not obvious how to do so if we have no "gold standard outcomes" to judge reviewers as over- or underconfident). As a side benefit, this may be interesting for research in and of itself, particularly as The Unjournal grows. We see quantifying one's own uncertainty as a good exercise for academics (and everyone) to engage in; to try to be more precise in our stated confidence and aim to be well-calibrated.

"Weightings" for each rating category (removed for now)

2 Oct 2023 -- We previously suggested 'weightings' for individual ratings, along with a note
We give "suggested weights" as an indication of our priorities and a suggestion for how you might average these together into an overall assessment; but please use your own judgment.
We included these weightings for several reasons:
  • People are found [reference needed] do a more careful job at prediction (and thus perhaps at overall rating too) if the outcome of interest is built up from components that are each judged separately.
  • We wanted to make the overall rating better defined and thus more useful to outsiders and comparable across raters
  • Emphasizing what we think is important (in particular, methodological reliability)
  • We didn't want evaluators to think we wanted them to weigh each category equally … some are clearly more important
However, we decided to remove these weightings because:
  1. 1.
    Reduce clutter in an already overwhelming form and guidance doc. ‘More numbers’ can be particularly overwhelming
  2. 2.
    These weights were ad-hoc, and they may suggest we have a more grounded ‘model of value’ than we already do. (And there is also some overlap in our categories anyways, something we are working on addressing.)
  3. 3.
    Some people interpreted what we intended incorrectly (e.g., they thought we were saying ‘relevance to global priorities’ is not an important thing)

Pre-October 2023 'ratings with weights' table, provided for reference

Category (importance)
Sugg. Wgt.*
Rating (0-100)
90% CI
Confidence (alternative to CI)
overall-assessment(holistic, most important!)
39, 52
47, 54
45, 55
10, 35
40, 70
We had included the note:
We give the previous weighting scheme in a fold below for reference, particularly for those reading evaluations done before October 2023.
As well as:
Suggested weighting: 0. Why 0?
Elsewhere in that page we had noted:
As noted above, we give suggested weights (0–5) to suggest the importance of each category rating to your overall assessment, given The Unjournal's priorities. But you don't need, and may not want to use these weightings precisely.
The weightings were presented once again along with each description in the section "Category explanations: what you are rating".

See also

More reliable, precise, and useful metrics This page explains the value of the metrics we are seeking from evaluators.
Calibration training tools
The Calibrate Your Judgment app from Clearer Thinking is fairly helpful and fun for practicing and checking how good you are at expressing your uncertainty. It requires creating account, but that doesn't take long. The 'Confidence Intervals' training seems particularly relevant for our purposes.