AI/LLM use in evaluations
The Unjournal's working policy on the use of AI and large language model (LLM) tools in the evaluation process, effective April 2026.
This is an evolving policy; we have consulted with our Management Team but have not formally confirmed all aspects.
Framework
Our goal is relevant, insightful, and correct evaluations that fairly assess research credibility and value, help researchers improve their work, and help practitioners and other researchers use it appropriately. We want evaluators to have access to all tools that facilitate this.
At the same time, we are mindful of other concerns:
continuing to cultivate and demonstrate the value of our evaluation process;
encouraging people to continue to work with us;
capturing credible and independent sources of judgment and reasoning;
being transparent with readers of our evaluations.
What we provide to evaluators
After evaluators have submitted their initial evaluations, we aim to provide (and later share publicly):
Flagship model / deep-research LLM evaluations of the paper
LLM consistency checks of the evaluation against the paper
Evaluators are encouraged to provide feedback on these, and may adjust their evaluations if they are confident the LLM has identified an important misunderstanding — noting any such adjustments.
We will also recommend and try to facilitate other 'bug-checking' tools, such as RegCheck (a preregistration checker).
What evaluators may use
Evaluators may use AI language model tools selectively. Appropriate uses include looking up or clarifying content in the paper (e.g. using NotebookLM as a reference tool), or running extensive checks that are not feasible to do by hand — in which case the AI-assisted section should appear separately (in a dedicated section, link, or footnote) with at least a hand-checked sample.
Evaluators are encouraged not to use AI tools for overall evaluations or for the ratings and predictions components, except as described above under what we provide.
Transparency requirements
Evaluators must report how AI tools were used, ideally by:
Providing links to their AI chats or explanations
Explicitly identifying any sections of text that were generated by LLMs
Human work requirement
Evaluators are expected to put in at least 8 hours of human work, over and above the processing time of any LLM tools.
Standing by your evaluation
Evaluators must stand by all content and language in their report as their own judgment. They must independently verify and carefully consider any points raised by AI tools.
Related evidence and context
Emerging evidence points toward a complementary role for AI in peer review processes like ours. Biswas et al. (2025), AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot(arXiv:2604.13940), report on a large-scale survey of AAAI-26 authors and program committee members and found "participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions" The system processed all 22,977 full-paper submissions in under a day using frontier models with tool use and safeguards.
This offers promising evidence for AI-assisted peer review at scale. But we are being cautions — we consider we use AI checks as a complement and a supplement to human evaluation, preserving independent expert judgment while adding an additional consistency check and source of suggestions.
Also relevant: Johnny Coates' "Best Practices for preprint peer review services in the use of AI".
Last updated
Was this helpful?

