by Mads Rydahl (Copy of an article submitted for publication in Learned Publishing)
A fundamental part of the academic publishing process is the “Technical Check”. When an article is submitted to a publisher, the in-house staff run a number of checks, known as “technical checks”, to evaluate the submission and determine if it is ready to send for peer review, or if the submission should be returned to the author for revisions. These checks may range from a count of the number of words in the title or abstract, so that the manuscript meets the journal submission guidelines, to a conflict of interest or ethical compliance check. Other checks that may be carried out include language quality, potential plagiarism, and missing figures or references.
UNSILO provides a range of checks to assist publishers with the evaluation of new submissions of draft scholarly articles by researchers. These checks are integrated with the leading manuscript tracking software tools Scholar One and Editorial Manager. The goal of the machine-based Technical Checks is to reduce time to publication and to ensure the quality of submissions and thereby to prevent delays in the publication workflow that might prove more costly later, after the submission has been sent for peer review.
The machine-based checks provided by UNSILO range from simple counts (for example, the number of words in the abstract), to more elaborate checks making use of AI and machine learning (for example, identifying a potential conflict of interest). The goal of the checks is to facilitate human evaluation. The Checks provided by UNSILO alert the journal editor about possible problems with the manuscript, for example, that the figure legends are not in sequence, or that a reference has not been cited in the text. It is important to remember that the UNSILO Technical Checks do not make any changes to a manuscript. The automatic tools provide evidence, such as word counts, or examples of missing or incorrect declarations by the author. The tool may identify closely related published articles that may indicate salami slicing. Nonetheless, all decisions remain with the human editor. In other words, the automated checks facilitate human decision-making, but do not replace the editor. This combination of machine- and human-checking has been described as “human augmentation”.
Of course, the success or failure of the automated tools depends on their identifying the problems accurately, which necessitates a measured trial to assess the effectiveness of the automated tools. We carried out a full study of one hundred articles, to compare the accuracy of the UNSILO Technical Checks with human editors. The UNSILO checks cover more than twenty separate assessments of the submission, but for the purpose of this evaluation we investigated four checks, for each of one hundred submissions. These checks comprised:
A team of experienced human editors at Cactus Communications was asked to record their findings for the above four checks for a random selection of 100 academic submissions during the month of November 2020. The editors carried out this assessment without the use of any machine-based assistance. The result of each manual check was recorded as either PASS or FAIL. Subsequently, the same four checks were performed on the manuscripts using an automated tool, the UNSILO Evaluate Technical Checks. Results were compared, and in the case of disagreement between the human editor and the automated check, the task was resubmitted to a different editor for manual re-checking, and the two human editors’ opinions resolved to provide a single conclusion.
The results between the human and machine tools were compared using a standard measure of accuracy, the F1 score. “Accuracy” is used here to signify a blend (actually the harmonic mean) of the precision and recall of the results. “Precision” is a measure of whether the results were correct or not, and “recall” measures if any problems have been missed incorrectly. The results are shown in Table One.
Both humans and machines make errors of precision and recall. However, the most important finding is that while human editors consistently show better precision on most checks, the automated checks have superior recall on all checks. In other words, humans miss more errors than machines.
The accuracy of the two systems is similar. For individual checks, the F1 score of the automated solution is usually slightly lower, but comparable to that of human editors. The largest difference in performance between humans and the automated system is seen on the Sequential citation of tables check, where the precision of human editors was 15%, correctly identifying 4 of the 7 problematic manuscripts, while incorrectly failing 23 correct manuscripts. The automated solution found 5 of the 7 problematic manuscripts, but only incorrectly failed 3 manuscripts, resulting in a precision of more than 60%.
Perfection is not achievable, either by humans or by machines. However, many editorial teams assume that human editors perform flawlessly. This study gives clear evidence that even experienced human editors make errors of omission by failing to identify problems with a submitted manuscript. The machine-based checks tested here had considerably fewer errors in recall.
Publishers often assume that machine-based checks will always need to be double-checked to ensure no errors are missed. By restricting the double checking to the potential errors identified by the machine, the human editor can achieve a higher quality of submission without searching for additional errors in the text.
Automatic systems can be adjusted to emphasize precision or recall, and the best results are felt to be a blend of the two measures. For this purpose, the automated checks have been designed to prioritize recall. This bias towards recall is a deliberate trade-off in the system, intended to support the automation of compliance checking. Publishers report that their priority is not to miss any problems, even if they have to check a few more documents as a result. By flagging all possible problems, false negatives (when the tool states incorrectly that the document is correct) are minimized. Since the machine finds more potential errors than a human does, the publisher can be confident that an effective strategy is to let the machine identify the potential errors and then to check those issues that have been flagged as potentially problematic manually.
This means that editors can be confident that manuscripts that pass the automated checks are unlikely to have any issues, and only the manuscripts need to be checked manually.
No editors are willing to accept automation of compliance checking if it comes at the cost of a reduction in quality. Therefore, for publishers to automate technical checks, a low number of false negatives is more important than high precision. Otherwise, all manuscripts will still need to be checked manually, and automation would only have negative effects on efficiency.
If we assume the goal is for humans to spend less time checking manuscripts by hand, then that automated tools provide a clear benefit. The present study clearly indicates that automated solutions are much better at detecting potential problems in a manuscript submission than human editors. Combining the superior precision of human editors with the superior recall of automated solutions therefore has the potential to provide both significant improvements to the accuracy of compliance checking, and at the same time a significant reduction in the resources spent on tedious editorial tasks.
Following this study, UNSILO would be glad to work with any publishers in further tests with their own content to provide statistically valid results for the use of automated checks.