> For the complete documentation index, see [llms.txt](https://globalimpact.gitbook.io/the-unjournal-project-and-communication-space/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://globalimpact.gitbook.io/the-unjournal-project-and-communication-space/~/changes/536/policies-projects-evaluation-workflow/evaluation/guidelines-for-evaluators/why-these-guidelines.md).

# Why these guidelines/metrics?

{% hint style="info" %}
*31 Aug 2023:* Our present approach is a "working solution" involving some ad-hoc and intuitive choices. We are re-evaluating the metrics we are asking for as well as the interface and framing. We are gathering some discussion [in this linked Gdoc](https://docs.google.com/document/d/1QVA0sCvrcKZLKlXuEwJBHKTBKvtn1ml7adTD-2j_X4g/edit), incorporating feedback from our pilot evaluators and authors. We're also talking to people with expertise as well as considering past practice and other ongoing initiatives. We plan to consolidate that discussion and our consensus and/or conclusions into the present (Gitbook) site. &#x20;
{% endhint %}

## **Why numerical ratings?**

Ultimately, we're trying to replace the question of "what tier of journal did a paper get into?" with "how highly was the paper rated?" We believe this is a more valuable metric. It can be more fine-grained. It should be less prone to gaming. It aims to reduce randomness in the process, through  things like 'the availability of journal space in a particular field'. See our discussion of [Reshaping academic evaluation: beyond the binary...](/the-unjournal-project-and-communication-space/~/changes/536/benefits-and-features/costs-of-playing-the-publication-game.md) .

To get to this point, we need to have academia and stakeholders see our evaluations as meaningful. We want the evaluations to begin to have some value that is measurable in the way “publication in the AER” is seen to have value.&#x20;

While there are some ongoing efforts towards journal-independent evaluation, these [tend not use  comparable metrics](#user-content-fn-1)[^1]. Typically, they either have simple tick-boxes (like "this paper used correct statistical methods: yes/no") or they enable descriptive evaluation without an overall rating.\
\
As we are not a journal, and we don’t accept or reject research, we need another way of assigning value. We are working to determine the best way of doing this through quantitative ratings. We hope to be able to benchmark our evaluations to "traditional" publication outcomes. Thus, we think it is important to ask for both an overall quality rating and a journal ranking tier prediction.

## Why these categories?

In addition to the overall assessment, we think it will be valuable to have the papers rated according to several categories. This could be particularly helpful to practitioners who may care about some concerns more than others. It also can be useful to future researchers who might want to focus on reading papers with particular strengths. It could be useful in meta-analyses, as certain characteristics of papers could be weighed more heavily. We think the use of categories might also be useful to authors and evaluators themselves. It can help them get a sense of what we think research priorities should be, and thus help them consider an overall rating.

However, these ideas have been largely ad-hoc and based on the impressions of our management team (a particular set of mainly economists and psychologists). The process is still being developed. *Any feedback you have is welcome. For example,* *are we overemphasizing certain aspects? Are we excluding some important categories?*

*We are also researching other frameworks, templates, and past practice; we hope to draw from validated, theoretically grounded projects such as* [*RepliCATS*](https://replicats.research.unimelb.edu.au/resources/)*.*

## Why ask for credible intervals?

In eliciting expert judgment, it is helpful to differentiate the *level* of confidence in predictions and recommendations. We want to know not only what you believe, but how strongly held your beliefs are. If you are less certain in one area, we should weigh the information you provide less heavily in updating our beliefs. This may also be particularly useful for practitioners.\
\
Obviously, there are challenges to any approach. Even experts in a quantitative field may struggle to convey their own uncertainty. They may also be inherently "poorly calibrated" (see discussions and tools for [calibration training](https://www.clearerthinking.org/post/2019/10/16/practice-making-accurate-predictions-with-our-new-tool)).  Some people may often be "confidently wrong." They might state very narrow "credible intervals", when the truth—where measurable—routinely falls outside these boundaries. People with greater discrimination may sometimes be *under*confident. One would want to consider and [potentially correct for poor calibration.](#user-content-fn-2)[^2]\
\
As a side benefit, this may be interesting for research [in and of itself](#user-content-fn-3)[^3], particularly as *The Unjournal* grows. We see 'quantifying one's own uncertainty' as a good exercise for academics (and everyone) to engage in.&#x20;

## "Weightings" for each rating category (removed for now)

<details>

<summary>Weightings for each ratings category (removed for now)</summary>

2 Oct 2023 -- We previously suggested 'weightings' for individual ratings, along with a note&#x20;

We give "suggested weights" as an indication of our priorities and a suggestion for how you might average these together into an overall assessment; but please use your own judgment.

*We included these weightings for several reasons:*&#x20;

* People are found \[reference needed] do a more careful job at prediction (and thus perhaps at overall rating too) if the outcome of interest is built up from components that are each judged separately.&#x20;
* We wanted to make the overall rating better defined and thus more useful to outsiders and  comparable across raters
* Emphasizing what we think is important (in particular, methodological reliability)
* We didn't want evaluators to think we wanted them to weigh each category equally … some are clearly more important

*However, we decided to remove these weightings because:*

1. Reduce clutter in an already overwhelming form and guidance doc. ‘More numbers’ can be particularly overwhelming
2. These weights were ad-hoc, and they may suggest we have a more grounded ‘model of value’ than we already do. (And there is also some overlap in our categories anyways, something we are working on addressing.) &#x20;
3. Some people interpreted what we intended incorrectly (e.g., they thought we were saying ‘relevance to global priorities’ is not an important thing)

</details>

## Adjustments to metrics and guidelines/previous presentations

<details>

<summary>Oct 2023 update - removed "weightings"</summary>

We have removed suggested weightings for each of these categories. We discuss the rationale at some length [here](/the-unjournal-project-and-communication-space/~/changes/536/policies-projects-evaluation-workflow/evaluation/guidelines-for-evaluators/why-these-guidelines.md#weightings-for-each-rating-category-removed-for-now).&#x20;

Evaluators working before October 2023 saw a previous version of the table, which you can see [HERE](/the-unjournal-project-and-communication-space/~/changes/536/policies-projects-evaluation-workflow/evaluation/guidelines-for-evaluators/why-these-guidelines.md#pre-october-2023-ratings-with-weights-table-provided-for-reference).

</details>

<details>

<summary>Dec. 2023: Hiding/de-emphasizing 'confidence Likerts'</summary>

We previously gave evaluators two options for expressing their confidence in each rating:&#x20;

Either:

1. The 90% Confidence/Credible Interval (CI) input you see below (now a 'slider' in PubPub V7) or
2. A five-point 'Likert style' measure of confidence, which we described qualitatively and explained how we would convert it into CIs when we report aggregations.&#x20;

To make this process less confusing, to encourage careful quantification of uncertainty, and to enable better-justified aggregation of expert judgment, we are de-emphasizing the latter measure.&#x20;

Still, to accommodate those who may not be familiar with or comfortable stating "90% CIs on their own beliefs" we offer further explanations, and we are providing tools to help evaluators  construct these. As a fallback, we will still allow evaluators to give the 1-5 confidence measure, noting the correspondence to CIs, but we discourage this somewhat.&#x20;

The previous guidelines [can be seen here](/the-unjournal-project-and-communication-space/~/changes/536/policies-projects-evaluation-workflow/evaluation/guidelines-for-evaluators/why-these-guidelines.md#pre-2024-ratings-and-uncertainty-elicitation-provided-for-reference-no-longer-in-use); these may be useful in considering evaluations provided pre-2024.

</details>

### Pre-October 2023  'ratings with weights' table, provided for reference (no longer in use)

<table><thead><tr><th width="262">Category (importance)</th><th width="112" align="center">Sugg. Wgt.*</th><th width="107" data-type="number">Rating (0-100)</th><th width="115" align="center">90% CI</th><th data-type="rating" data-max="5">Confidence (alternative to CI)</th><th data-hidden></th></tr></thead><tbody><tr><td><a data-mention href="#overall-assessment">#overall-assessment</a>(holistic, most important!)</td><td align="center"></td><td>44</td><td align="center">39, 52</td><td>4</td><td></td></tr><tr><td><a data-mention href="#1.-advancing-our-knowledge-and-practice">#1.-advancing-our-knowledge-and-practice</a></td><td align="center">5</td><td>50</td><td align="center">47, 54</td><td>5</td><td></td></tr><tr><td><a data-mention href="#2.-methods-justification-reasonableness-validity-robustness">#2.-methods-justification-reasonableness-validity-robustness</a></td><td align="center">5</td><td>51</td><td align="center"><em>45, 55</em></td><td>4</td><td></td></tr><tr><td><a data-mention href="#3.-logic-and-communication">#3.-logic-and-communication</a></td><td align="center">4</td><td>20</td><td align="center"><em>10, 35</em></td><td>3</td><td></td></tr><tr><td><a data-mention href="#4.-open-collaborative-replicable-science-and-methods">#4.-open-collaborative-replicable-science-and-methods</a></td><td align="center">3</td><td>60</td><td align="center"><em>40, 70</em></td><td>2</td><td></td></tr><tr><td><a data-mention href="#5.-engaging-with-real-world-impact-quantification-practice-realism-and-relevance">#5.-engaging-with-real-world-impact-quantification-practice-realism-and-relevance</a></td><td align="center">2</td><td>35</td><td align="center"><em>30,46</em></td><td>3</td><td></td></tr><tr><td><a data-mention href="#6.-relevance-to-global-priorities">#6.-relevance-to-global-priorities</a></td><td align="center">0**</td><td>30</td><td align="center">21,65</td><td>1</td><td></td></tr></tbody></table>

*We had included the note:*

> We give the previous weighting scheme in a fold below for reference, particularly for those reading evaluations done before October 2023.

As well as:

> Suggested weighting: 0. [Why 0?](#user-content-fn-4)[^4]

Elsewhere in that page we had noted:

> As noted above, we give suggested weights (0–5) to suggest the importance of each category rating to your overall assessment, given *The Unjournal*'s priorities. [*But you don't need, and may not want to use these weightings precisely.*](#user-content-fn-5)[^5]

The weightings were presented once again along with each description in the section ["Category explanations: what you are rating"](/the-unjournal-project-and-communication-space/~/changes/536/policies-projects-evaluation-workflow/evaluation/guidelines-for-evaluators.md#category-explanations-what-you-are-rating).

### Pre-2024 ratings and uncertainty elicitation, provided for reference (no longer in use)

<table><thead><tr><th width="262">Category (importance)</th><th width="152.01169590643275" data-type="number">Rating (0-100)</th><th width="115" align="center">90% CI</th><th data-type="rating" data-max="5">Confidence (alternative to CI)</th><th data-hidden></th></tr></thead><tbody><tr><td><a data-mention href="#overall-assessment">#overall-assessment</a>(holistic, most important!)</td><td>44</td><td align="center">39, 52</td><td>4</td><td></td></tr><tr><td><a data-mention href="#1.-advancing-our-knowledge-and-practice">#1.-advancing-our-knowledge-and-practice</a></td><td>50</td><td align="center">47, 54</td><td>5</td><td></td></tr><tr><td><a data-mention href="#2.-methods-justification-reasonableness-validity-robustness">#2.-methods-justification-reasonableness-validity-robustness</a></td><td>51</td><td align="center"><em>45, 55</em></td><td>4</td><td></td></tr><tr><td><a data-mention href="#3.-logic-and-communication">#3.-logic-and-communication</a></td><td>20</td><td align="center"><em>10, 35</em></td><td>3</td><td></td></tr><tr><td><a data-mention href="#4.-open-collaborative-replicable-science-and-methods">#4.-open-collaborative-replicable-science-and-methods</a></td><td>60</td><td align="center"><em>40, 70</em></td><td>2</td><td></td></tr><tr><td><a data-mention href="#5.-engaging-with-real-world-impact-quantification-practice-realism-and-relevance">#5.-engaging-with-real-world-impact-quantification-practice-realism-and-relevance</a></td><td>35</td><td align="center"><em>30,46</em></td><td>3</td><td></td></tr><tr><td><a data-mention href="#6.-relevance-to-global-priorities">#6.-relevance-to-global-priorities</a></td><td>30</td><td align="center">21,65</td><td>1</td><td></td></tr></tbody></table>

> \[FROM PREVIOUS GUIDELINES:]&#x20;
>
> You may feel comfortable giving your "90% confidence interval," or you may prefer to give a "descriptive rating" of your confidence (from "extremely confident" to "not confident").
>
> Quantify how certain you are about this rating, either giving a 90% [confidence](https://en.wikipedia.org/wiki/Confidence_interval)/[credibility](https://en.wikipedia.org/wiki/Credible_interval) interval *or* using our [scale described below](#the-confidence-rating). ([*We prefer the 90% CI. Please don't give both.*](#user-content-fn-6)[^6]&#x20;

<details>

<summary>[Previous guidelines] "1–5 dots": Explanation and relation to CIs</summary>

5 = Extremely confident, i.e., 90% confidence interval spans +/- 4 points or less

4 = Very confident: 90% confidence interval +/- 8 points or less

3 = Somewhat confident: 90% confidence interval +/- 15 points or less

2 = Not very confident: 90% confidence interval, +/- 25 points or less

1 = Not confident: (90% confidence interval +/- more than 25 points)

</details>

> \[Previous...] Remember, we would like you to give a 90% CI *or* a confidence rating (1–5 dots), but not both.

<details>

<summary>[Previous guidelines] Example of confidence dots vs CI</summary>

<img src="/files/GmHR48IPwTAYZv5JJJHO" alt="" data-size="original">

The example in the diagram above (click to zoom) illustrates the proposed correspondence.

</details>

And, for the 'journal tier' scale:

<details>

<summary>[Previous guidelines]: Reprising the confidence intervals for this new metric</summary>

**From "five dots" to "one dot":**

**5 = Extremely** confident, i.e., 90% confidence interval spans +/– 4 points or less\*

**4 = Very** confident: 90% confidence interval +/– 8 points or less

**3 = Somewhat** confident: 90% confidence interval +/– 15 points or less

**2 = Not very** confident: 90% confidence interval, +/– 25 points or less

**1 = Not** confident: 90% confidence interval +/– 25 points

</details>

#### Previous 'descriptions of ratings intervals'

*\[Previous guidelines]: The description folded below focuses on the "Overall Assessment." Please try to use a similar scale when evaluating the category metrics.*

<details>

<summary>Top ratings (90–100)</summary>

**95–100:** Among the highest quality and most important work you have ever read.

**90–100:** This work represents a major achievement, making substantial contributions to the field and practice. Such work would/should be weighed very heavily by tenure and promotion committees, and grantmakers.

*For example:*

* Most work in this area in the next ten years will be influenced by this paper.
* This paper is substantially more rigorous or more insightful than existing work in this area in a way that matters for research and practice.
* The work makes a major, perhaps decisive contribution to a case for (or against) a policy or philanthropic intervention.

</details>

<details>

<summary>Near-top (75–89) (*)</summary>

This work represents a strong and substantial achievement. It is highly rigorous, relevant, and well-communicated, up to the standards of the strongest work in this area (say, the standards of the top 5% of committed researchers in this field). Such work would/should not be decisive in a tenure/promotion/grant decision alone, but it should make a very solid contribution to such a case.

</details>

<details>

<summary>Middle ratings (40–59, 60–74) (*)</summary>

[**60–74.9**](#user-content-fn-7)[^7]**:** A very strong, solid, and relevant piece of work. It may have minor flaws or limitations, but overall it is very high-quality, meeting the standards of well-respected research professionals in this field.

**40–59.9:** A useful contribution, with major strengths, but also some important flaws or limitations.

</details>

<details>

<summary>Low ratings (5–19, 20–39) (*)</summary>

**20–39.9:** Some interesting and useful points and some reasonable approaches, but only marginally so. Important flaws and limitations. Would need substantial refocus or changes of direction and/or methods in order to be a useful part of the research and policy discussion.

**5–19.9:** Among the lowest quality papers; not making any substantial contribution and containing fatal flaws. The paper may fundamentally address an issue that is not defined or obviously not relevant, or the content may be substantially outside of the authors’ field of expertise.

**0–4:** Illegible, fraudulent, or plagiarized. *Please flag fraud, and notify us and the relevant authorities.*

</details>

<details>

<summary>(*) 20 Mar 2023: We adjusted these ratings to avoid overlap</summary>

The previous categories were 0–5, 5–20, 20–40, 40–60, 60–75, 75–90, and 90–100. Some evaluators found the overlap in this definition confusing.

</details>

## See also

[Open, reliable, and useful evaluation](/the-unjournal-project-and-communication-space/~/changes/536/benefits-and-features/more-reliable-and-useful-evaluation.md#more-reliable-precise-and-useful-metrics) This page explains the value of the metrics we are seeking from evaluators.

[Unjournal Evaluator Guidelines and Metrics - Discussion space](https://docs.google.com/document/d/1QVA0sCvrcKZLKlXuEwJBHKTBKvtn1ml7adTD-2j_X4g/edit)

<details>

<summary>Calibration training tools</summary>

The [Calibrate Your Judgment app](https://programs.clearerthinking.org/calibrate_your_judgment.html) from Clearer Thinking is fairly helpful and fun for practicing and checking how good you are at expressing your uncertainty.  It requires creating account, but that doesn't take long. The 'Confidence Intervals' training seems particularly relevant for our purposes. \
\
![](/files/kIQ9VLkNL4lAlXahxBgG)

</details>

[^1]: At least this is my (David Reinstein's) impression.

[^2]: However, it is not obvious how to do so if we have no "gold standard outcomes" to judge reviewers as over- or underconfident.

[^3]: See especially Phil Tetlock’s work.

[^4]: For the overall measures we don't want t you to consider this; we'd rather be more comparable to traditional publications, in this respect. Also note that our management team has already considered this work and evaluated it as relevant to global priorities, before passing it to evaluators. Nonetheless, we would like your informed assessment (and discussion).&#x20;

    1.

[^5]: For example, you might weight categories less where you are more uncertain, or where the category seems less relevant.

[^6]: *Above, we completed both only for illustration purposes. Below, we give a suggested correspondence between these two measures.* &#x20;

[^7]: This previously read "60-75"; we adjusted this because some evaluators found the overlap unclear.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://globalimpact.gitbook.io/the-unjournal-project-and-communication-space/~/changes/536/policies-projects-evaluation-workflow/evaluation/guidelines-for-evaluators/why-these-guidelines.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.