Machine learning based screening of potential paper mill publications in cancer research: methodological and cross sectional study

Baptiste Scancar, Jennifer A Byrne, David Causeur, et al

BMJ 2026; 392 doi: https://doi.org/10.1136/bmj-2025-087581 (Published 30 January 2026)Cite this as: BMJ 2026;392:e087581

Abstract

Objectives To train and validate a machine learning model to distinguish paper mill publications from genuine cancer research articles, and to screen the cancer research literature to assess the prevalence of papers that have textual similarities to paper mill papers.

Design Methodological and cross sectional study applying a BERT (bidirectional encoder representations from transformers) based, text classification model to article titles and abstracts.

Setting Retracted paper mill publications listed in the Retraction Watch database were used for model training. The cancer research corpus was screened by the model using the PubMed database restricted to original cancer research articles published between 1999 and 2024.

Population The model was trained on 2202 retracted paper mill papers and validated on independent data collected by image integrity experts. 2.6 million cancer research papers were screened.

Main outcome measures Classification performance of the model. Prevalence of papers flagged as similar to retracted paper mill publications with 95% confidence intervals and their distribution over time, by country, publisher, cancer type, research area, and within high impact journals (top 10%).

Results The model achieved an accuracy of 0.91. When applied to the cancer research literature, it flagged 261 245 of 2 647 471 papers (9.87%, 95% confidence interval 9.83 to 9.90) and revealed a large increase in flagged papers from 1999 to 2024, both across the entire corpus and in the top 10% of journals by impact factor. More than 170 000 papers affiliated with Chinese institutions were flagged, accounting for 36% of Chinese cancer research articles. Most publishers had published substantial numbers of flagged papers. Flagged papers were overrepresented in fundamental research and in gastric, bone, and liver cancer.

Conclusions Paper mills are a large and growing problem in the cancer literature and are not restricted to low impact journals. Collective awareness and action will be crucial to address the problem of paper mill publications.

作者: dubin98

该日志由 dubin98 于1小时前发表在时讯速递, 进展交流分类下，
转载请注明: [BMJ发表论文]：基于机器学习对肿瘤领域研究可能的论文工厂论文进行筛查 | 中国病理生理学会危重病医学专业委员会 +复制链接

【上篇】[ICU Management & Practice]: 脓毒症器官功能障碍前的代谢改变
【下篇】[BMJ发布指南]：世界卫生组织流感临床实践指南总结

抱歉!评论已关闭.

Machine learning based screening of potential paper mill publications in cancer research: methodological and cross sectional study

Baptiste Scancar, Jennifer A Byrne, David Causeur, et al

BMJ 2026; 392 doi: https://doi.org/10.1136/bmj-2025-087581 (Published 30 January 2026)Cite this as: BMJ 2026;392:e087581

Abstract

作者: dubin98

最活跃的读者

返回首页