The Signpost


Recent research

At least 80 million inconsistent facts on Wikipedia – can AI help find them?


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


At least 80 million (3.3%) of Wikipedia's facts are inconsistent, LLMs may help finding them

[edit]

A paper titled "Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models",[1] presented earlier this month at the EMNLP conference, examines

inconsistencies, a specific type of factual inaccuracy [on English Wikipedia], and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time.
Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact [...]

In a Twitter thread, the lead author shared his

Takeaways:
- Contradictions are measurable and fixable at scale.
- LLMs aren't ready to fully automate yet (best AUROC 75.1% on WikiCollide) but are effective copilots.

The authors focus specifically on internal inconsistencies, which they define as

contradictory facts within Wikipedia that indicate errors requiring correction through consultation of original sources. In a crowdsourced repository, inconsistencies can arise from outdated information, limited awareness of related content during editing, or simple human error.

They illustrate this notion with an example (still not yet corrected on-wiki at the time of writing) drawn from FEVEROUS, a Wikipedia-derived dataset published in 2021 whose rate of inconsistencies was found to be even higher (7.3%):

François de Bourbon-Montpensier was born in 1492 and received the title “duchy-peerage of Châtellerault” in 1515. However, the Wikipedia table [rather, infobox in] “Duke of Châtellerault” incorrectly states that the title was created 23 years earlier.

To support editors in finding such inconsistencies, the authors construct the aforementioned LLM-based

CLAIRE (Corpus-Level Assistant for Inconsistency REcognition), a system for surfacing inconsistencies in large corpora. [...] CLAIRE finds and displays not only candidate contradictions but also disambiguating context and explanations of specialized terminology. It features an interactive interface implemented as a browser extension that surfaces potential inconsistencies to Wikipedia visitors.

(Unfortunately, that browser extension doesn't yet seem to have been released as part of the project's code repository or elsewhere.)

CLAIRE is then used to facilitate a (manually confirmed) lower bound estimate of the overall frequency of inconsistent facts on Wikipedia:

Applying CLAIRE to 700 atomic facts uniformly sampled from Wikipedia articles, we identified 44 potentially inconsistent facts, of which 23 were manually confirmed inconsistent. With 99% confidence, we estimate that approximately 3.3% ± 1.7%[1.6%, 5.0%] of all facts in Wikipedia contradict other information in the corpus. This is a lower bound, as CLAIRE may miss inconsistencies [...] Extrapolated to the entire encyclopedia, this corresponds to between 37.6 million and 121.9 million inconsistent facts,[...] underscoring the need for systematic inconsistency detection.

The authors then present their own WIKICOLLIDE dataset, consisting of 955 atomic facts drawn from Wikipedia [using a snapshot from November 1, 2024], each manually labeled as either consistent or inconsistent with the corpus. This sample was drawn from a subset of articles (Level 5 Vital Articles) and deliberately biased to prioritize facts more likely to be inconsistent. It thus not representative of Wikipedia as a whole. However, the paper's classification of the types of inconsistencies present in this corpus should still give an idea of which are most frequent on Wikipedia:

Breakdown of inconsistency types in WIKICOLLIDE validation and test sets (331 inconsistent facts)
Inconsistency Type Description %
Numerical Inconsistencies in numerical data, such as quantities, measurements, or percentages 54.7
Off-by-One Numerical Small discrepancy involving a margin of one unit 23.0
Clear Numerical Significant difference that cannot be explained by a margin of one unit 31.7
Logical The claim and evidence directly or indirectly contradict each other 17.5
Direct Logical Clear negation or alternative to a unique fact 14.8
Indirect Logical Contradiction inferred or indirectly implied 2.7
Definition Different definitions or interpretations for the same term or concept 10.6
Temporal Inconsistencies in dates, durations, or event sequences 7.9
Named Entity Inconsistencies identifying specific entities (people, organizations, locations) 6.0
Categorical Differences in categorizing entities, objects, or concepts 2.1
Spatial Inconsistencies in spatial descriptions or geographical information 1.2

See also:

Briefly

[edit]


Other recent publications

[edit]

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia"

[edit]

From the abstract:[2]

"Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information.[...] we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs [...] we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: [ibm.biz/wikicontradict ]."

From the paper:

"[...] Wikipedia editors use a wide range of maintenance tags to flag problematic content for improvement. However, these maintenance tags are typically removed when creating Wikipedia datasets for LLM pre-training, which results in content with various quality issues being included in the pre-training process.
In this work, we focus on three tags that indicate content inconsistencies: inconsistent, self-contradictory, and contradict-other. The first two tags denote contradictory statements within the same article, whereas the third tag highlights instances where the content of one article contradicts that of another article. In total, we collect around 1,200 articles that contain these tags [...]


"Factual Inconsistencies in Multilingual Wikipedia Tables"

[edit]

From the abstract:[3]

"Despite covering the same topics, the different [language] versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia's structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content."

From the paper:

"while English provides the most comprehensive coverage in terms of volume, German Wikipedia faces significant data quality challenges despite having substantial content"


"When Collaborative Maintenance Falls Short: The Persistence of Retracted Papers on Wikipedia"

[edit]

From the abstract:[4]

"We construct a novel dataset that integrates Wikipedia revision histories with metadata from Retraction Watch, Crossref, Altmetric, and OpenAlex, identifying 1,181 citations of retracted papers. We find that 71.6% of all citations analyzed are problematic. These are citations added before a paper's retraction, as well as the citations introduced after retraction without any in-text mention of the paper's retracted status. Our analysis reveals that these citations persist for a median of over 3.68 years (1,344 days). Through survival analysis, we find that signals of human attention are associated with a faster correction process. Unfortunately, a paper's established scholarly authority, a higher academic citation count, is associated with a slower time to correction."

From the "Discussion" section:

"A key consideration is the role of automated tools, such as RetractionBot [25]. This bot exemplifies the specialized roles that automated agents play in Wikipedia’s quality control ecosystem [66]. It primarily serves an editorial audience. By systematically adding a prominent template to the reference section, the bot is highly effective at its specific task of signaling a source’s retracted status to editors engaged in verification and maintenance. [...] However, our work highlights a persistent gap between the effectiveness of automation for these specific, often editor-facing tasks and the challenges of repairing more nuanced, epistemic issues for a general reader. This distinction is key: while a bot can efficiently apply a “technical flag,” this action is distinct from the substantive, contextual repair required to update an article’s main text."

See also a related recent blog post by Egon Willighagen: "Retracted articles cited in Wikipedia"


References

[edit]
  1. ^ Semnani, Sina; Burapacheep, Jirayu; Khatua, Arpandeep; Atchariyachanvanit, Thanawan; Wang, Zheng; Lam, Monica (November 2025). "Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models". In Christos Christodoulopoulos; Tanmoy Chakraborty; Carolyn Rose; Violet Peng (eds.). Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. EMNLP 2025. Suzhou, China: Association for Computational Linguistics. pp. 34827–34854. doi:10.18653/v1/2025.emnlp-main.1765. ISBN 9798891763326. / Data and code
  2. ^ Hou, Yufang; Pascale, Alessandra; Carnerero-Cano, Javier; Tchrakian, Tigran; Marinescu, Radu; Daly, Elizabeth; Padhi, Inkit; Sattigeri, Prasanna (2024-06-19), WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia, arXiv, doi:10.48550/arXiv.2406.13805
  3. ^ Cappa, Silvia; Kong, Lingxiao; Peet, Pille-Riin; Wei, Fanfu; Zhou, Yuchen; Kalo, Jan-Christoph (2025-07-24), Factual Inconsistencies in Multilingual Wikipedia Tables, arXiv, doi:10.48550/arXiv.2507.18406
  4. ^ Shi, Haohan; Yu, Yulin; Romero, Daniel M.; Horvát, Emőke-Ágnes (2025-09-24), When Collaborative Maintenance Falls Short: The Persistence of Retracted Papers on Wikipedia, arXiv, doi:10.48550/arXiv.2509.18403


Signpost
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.


















Wikipedia:Wikipedia Signpost/Next_issue/Recent_research