The Signpost

File:2012-01-01_Orta_No_Way_Out.jpg
Blackcat
cc by-sa 3.0
80
450
Recent research

At least 80 million inconsistent facts on Wikipedia – can AI help find them?


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


At least 80 million (3.3%) of Wikipedia's facts are inconsistent, LLMs may help finding them

A paper titled "Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models",[1] presented earlier this month at the EMNLP conference, examines

inconsistencies, a specific type of factual inaccuracy [on English Wikipedia], and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time.
Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact [...]

In a Twitter thread, the lead author shared his

Takeaways:
- Contradictions are measurable and fixable at scale.
- LLMs aren't ready to fully automate yet (best AUROC 75.1% on WikiCollide) but are effective copilots.

The authors focus specifically on internal inconsistencies, which they define as

contradictory facts within Wikipedia that indicate errors requiring correction through consultation of original sources. In a crowdsourced repository, inconsistencies can arise from outdated information, limited awareness of related content during editing, or simple human error.

They illustrate this notion with an example (still not yet corrected on-wiki at the time of writing) drawn from FEVEROUS, a Wikipedia-derived dataset published in 2021 whose rate of inconsistencies was found to be even higher (7.3%):

François de Bourbon-Montpensier was born in 1492 and received the title “duchy-peerage of Châtellerault” in 1515. However, the Wikipedia table [rather, infobox in] “Duke of Châtellerault” incorrectly states that the title was created 23 years earlier.

To support editors in finding such inconsistencies, the authors construct the aforementioned LLM-based

CLAIRE (Corpus-Level Assistant for Inconsistency REcognition), a system for surfacing inconsistencies in large corpora. [...] CLAIRE finds and displays not only candidate contradictions but also disambiguating context and explanations of specialized terminology. It features an interactive interface implemented as a browser extension that surfaces potential inconsistencies to Wikipedia visitors.

(Unfortunately, that browser extension doesn't yet seem to have been released as part of the project's code repository or elsewhere.)

CLAIRE is then used to facilitate a (manually confirmed) lower bound estimate of the overall frequency of inconsistent facts on Wikipedia:

Applying CLAIRE to 700 atomic facts uniformly sampled from Wikipedia articles, we identified 44 potentially inconsistent facts, of which 23 were manually confirmed inconsistent. With 99% confidence, we estimate that approximately 3.3% ± 1.7%[1.6%, 5.0%] of all facts in Wikipedia contradict other information in the corpus. This is a lower bound, as CLAIRE may miss inconsistencies [...] Extrapolated to the entire encyclopedia, this corresponds to between 37.6 million and 121.9 million inconsistent facts,[...] underscoring the need for systematic inconsistency detection.

The authors then present their own WIKICOLLIDE dataset, consisting of 955 atomic facts drawn from Wikipedia [using a snapshot from November 1, 2024], each manually labeled as either consistent or inconsistent with the corpus. This sample was drawn from a subset of articles (Level 5 Vital Articles) and deliberately biased to prioritize facts more likely to be inconsistent. It is thus not representative of Wikipedia as a whole. However, the paper's classification of the types of inconsistencies present in this corpus should still give an idea of which are most frequent on Wikipedia:

Breakdown of inconsistency types in WIKICOLLIDE validation and test sets (331 inconsistent facts)
Inconsistency Type Description %
Numerical Inconsistencies in numerical data, such as quantities, measurements, or percentages 54.7
Off-by-One Numerical Small discrepancy involving a margin of one unit 23.0
Clear Numerical Significant difference that cannot be explained by a margin of one unit 31.7
Logical The claim and evidence directly or indirectly contradict each other 17.5
Direct Logical Clear negation or alternative to a unique fact 14.8
Indirect Logical Contradiction inferred or indirectly implied 2.7
Definition Different definitions or interpretations for the same term or concept 10.6
Temporal Inconsistencies in dates, durations, or event sequences 7.9
Named Entity Inconsistencies identifying specific entities (people, organizations, locations) 6.0
Categorical Differences in categorizing entities, objects, or concepts 2.1
Spatial Inconsistencies in spatial descriptions or geographical information 1.2

See also:

Briefly


Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia"

From the abstract:[2]

"Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information.[...] we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs [...] we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: [ibm.biz/wikicontradict ]."

From the paper:

"[...] Wikipedia editors use a wide range of maintenance tags to flag problematic content for improvement. However, these maintenance tags are typically removed when creating Wikipedia datasets for LLM pre-training, which results in content with various quality issues being included in the pre-training process.
In this work, we focus on three tags that indicate content inconsistencies: inconsistent, self-contradictory, and contradict-other. The first two tags denote contradictory statements within the same article, whereas the third tag highlights instances where the content of one article contradicts that of another article. In total, we collect around 1,200 articles that contain these tags [...]


"Factual Inconsistencies in Multilingual Wikipedia Tables"

From the abstract:[3]

"Despite covering the same topics, the different [language] versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia's structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content."

From the paper:

"while English provides the most comprehensive coverage in terms of volume, German Wikipedia faces significant data quality challenges despite having substantial content"


"When Collaborative Maintenance Falls Short: The Persistence of Retracted Papers on Wikipedia"

From the abstract:[4]

"We construct a novel dataset that integrates Wikipedia revision histories with metadata from Retraction Watch, Crossref, Altmetric, and OpenAlex, identifying 1,181 citations of retracted papers. We find that 71.6% of all citations analyzed are problematic. These are citations added before a paper's retraction, as well as the citations introduced after retraction without any in-text mention of the paper's retracted status. Our analysis reveals that these citations persist for a median of over 3.68 years (1,344 days). Through survival analysis, we find that signals of human attention are associated with a faster correction process. Unfortunately, a paper's established scholarly authority, a higher academic citation count, is associated with a slower time to correction."

From the "Discussion" section:

"A key consideration is the role of automated tools, such as RetractionBot [25]. This bot exemplifies the specialized roles that automated agents play in Wikipedia’s quality control ecosystem [66]. It primarily serves an editorial audience. By systematically adding a prominent template to the reference section, the bot is highly effective at its specific task of signaling a source’s retracted status to editors engaged in verification and maintenance. [...] However, our work highlights a persistent gap between the effectiveness of automation for these specific, often editor-facing tasks and the challenges of repairing more nuanced, epistemic issues for a general reader. This distinction is key: while a bot can efficiently apply a “technical flag,” this action is distinct from the substantive, contextual repair required to update an article’s main text."

See also a related recent blog post by Egon Willighagen: "Retracted articles cited in Wikipedia"

"Automatically Estimating the Trustworthiness of Wikipedia Articles"

From the abstract:[5]

"We present a model to assess the trustworthiness of external sources based on manually annotated [English] Wikipedia articles. To do so, we analyze how often an external source was referenced in Wikipedia articles in which either a problem with reliability was identified or a previously identified problem was solved. From the frequency of the respective occurrences, we aim to draw conclusions about a positive or negative influence of the source on the trustworthiness of new Wikipedia articles. For this, we use the external sources referenced in a Wikipedia article to predict whether the article contains a reliability issue or not. First experiments show that our model is not able to reliably assess the trustworthiness of Wikipedia articles yet."

"Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset"

From the abstract:[6]

"Proper nouns in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP, their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it."

The authors evaluated GPT-4o on their benchmark, finding that it "performs reasonably well, especially on frequent names" in adding diacritics missing on Arabic Wikipedia, but "struggles with rarer entries and variant mappings."


"Reading between the lines with topic models and machine learning: Islam’s representation on Wikipedia"

From the abstract:[7]

"[...] we first construct a representative dataset on Islam using Wikipedia articles. Afterwards, we apply several topic modelling and machine learning based approaches on the newly constructed dataset to find representation of Islam on Wikipedia. Also, we design two algorithms based on word2vec to find the inter topic similarity and intra topic similarity for the topic models. The intra topic similarity algorithm agrees well with human judgment of topic resolution and coherence of topics. As topic models find the dominant topics prevailing in a natural language document corpus, the intra topic similarity algorithm can be used as a new metric to find the coherence of single topics within the topic model."

References

  1. ^ Semnani, Sina; Burapacheep, Jirayu; Khatua, Arpandeep; Atchariyachanvanit, Thanawan; Wang, Zheng; Lam, Monica (November 2025). "Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models". In Christos Christodoulopoulos; Tanmoy Chakraborty; Carolyn Rose; Violet Peng (eds.). Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. EMNLP 2025. Suzhou, China: Association for Computational Linguistics. pp. 34827–34854. doi:10.18653/v1/2025.emnlp-main.1765. ISBN 9798891763326. / Data and code
  2. ^ Hou, Yufang; Pascale, Alessandra; Carnerero-Cano, Javier; Tchrakian, Tigran; Marinescu, Radu; Daly, Elizabeth; Padhi, Inkit; Sattigeri, Prasanna (2024-06-19). "WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia". arXiv:2406.13805 [cs].
  3. ^ Cappa, Silvia; Kong, Lingxiao; Peet, Pille-Riin; Wei, Fanfu; Zhou, Yuchen; Kalo, Jan-Christoph (2025-07-24). "Factual Inconsistencies in Multilingual Wikipedia Tables". arXiv:2507.18406 [cs].
  4. ^ Shi, Haohan; Yu, Yulin; Romero, Daniel M.; Horvát, Emőke-Ágnes (2025-09-24). "When Collaborative Maintenance Falls Short: The Persistence of Retracted Papers on Wikipedia". arXiv:2509.18403 [cs].
  5. ^ Grumbach, Luca-Philipp (2025-02-21). Automatically Estimating the Trustworthiness of Wikipedia Articles (PDF) (bachelor thesis). Friedrich-Schiller-Universität Jena. / Presentation slides
  6. ^ Bondok, Rawan; Nassar, Mayar; Khalifa, Salam; Micallef, Kurt; Habash, Nizar (2025-06-23). "Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset". arXiv:2505.02656 [cs].
  7. ^ Khan, Sazid Zaman; As-ad, Jamil; Khaliluzzaman, Md; Anwar, Toni; Islam, Rashedul (2025-08-18). "Reading between the lines with topic models and machine learning: Islam's representation on Wikipedia". Journal of Computational Social Science. 8 (4): 89. doi:10.1007/s42001-025-00415-6. ISSN 2432-2725. Closed access icon


+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
  • Call me a Luddite, but I cannot trust the inconsistency generator to tell me what is and isn't inconsistent. Any suggestion to introduce AI tools to Wikipedia, I will contest; even if they require manual review. Drunk Experiter (talk · contribs · she/her) 08:44, 1 December 2025 (UTC)[reply]
    Cluebot NG and WP:ORES are both heavily integrated into Wikipedia already. Thebiguglyalien (talk) 🛸 19:02, 1 December 2025 (UTC)[reply]
    Do tell: are those what Drunk Experiter meant by "AI" and "inconsistency generator"? LightNightLights (talkcontribs) 13:33, 2 December 2025 (UTC)[reply]
  • Interesting. I'm working on a somewhat related side project to create a knowledge graph of the entire Wikipedia lichenology corpus. The lichenology knowledge graph (KG) is the canonical representation of "atomic facts", and Wikipedia presents the noisy realisations of that graph in prose form. Every time text is created or updated, an LLM+parser layer extracts structured semantic triples from the prose. Those triples are then matched against the existing lichen KG. Where they agree, nothing to do. Where they introduce a genuinely new fact, the KG gets a candidate update (pending human review). Where they conflict with existing triples, they get routed into an "inconsistency queue" very much in the CLAIRE/WIKICOLLIDE spirit, but now with strong domain-specific semantics rather than generic Wikipedia-wide heuristics. The KG is what lets you be much more precise than generic "3.3% of facts contradict something somewhere". One can build lichen-specific constraint layers: (1) taxonomic constraints (a species should not simultaneously belong to two different accepted families at the same time slice; a genus must have one type species; certain combinations of rank+name+authority must be unique); (2) nomenclatural constraints (authority–year patterns that are allowed vs implausible; synonym relationships that should be one-way, not circular); (3) biogeographic/eco constraints that are at least suspicious ("endemic to Tasmania" vs "widespread in boreal North America" without any explanation); and (4) chemistry constraints (e.g. one article saying "contains only atranorin and fumarprotocetraric acid" while another, or another section, says "lacks fumarprotocetraric acid and instead has protocetraric acid"). Many of these can be encoded as SHACL-style rules or hand-rolled checks on the graph; the text layer only needs to be good enough at mapping sentences to those schema slots. Esculenta (talk) 15:49, 1 December 2025 (UTC)[reply]
  • Sometimes contradictions are inherent in the sources. I have often noted in articles how reliable sources differ on details. Any editor or bot trying to eliminate contradictions needs to inspect all the sources cited across the relevant articles. - Donald Albury 16:20, 5 December 2025 (UTC)[reply]
    Good point but a few false positives doesn't make such useless – it could be used for suggestions so that users check it. Contradictions within the same article probably are more likely to be false positives than those between articles. Moreover, often it may mean that the relevant part(s) would do well with some editing, it's not just about replacing, correcting or removing things but also about e.g. 'X says Y and Z suggests W' instead of having one part or version just say Y and the other part or version W (without this context/…). Prototyperspective (talk) 15:16, 6 December 2025 (UTC)[reply]
  • To me, the inconsistency between language versions of tables suggest that they ought to be done by Wikidata. Set up some kind of bot that converts each tabular line into a WD Q-item, and each column into a P-property. Make the tables read those items. Presumably the differences will come out in an error condition to be flagged automatically and corrected manually. Jim.henderson (talk) 03:56, 9 December 2025 (UTC)[reply]
    Several issues with this: 1) most contradictions and inconsistencies are not in tables (and infoboxes btw!) but in prose and categories etc
    2) Wikidata is external to Wikipedia so the users watching the page can't really track changes to what is ultimately displaying on the page
    3) Wikidata items are watched by far fewer problems
    4) it is already possible to dynamically generate and update tables based on Wikidata contents using the listeria bot (underrated tool which needs at least 1 more dev) but tables using that bot if they get dynamically-updated by it are not allowed on EN Wikipedia (probably because of 2 and 3).
    I think a tool that enables importing more data in Wikipedia to Wikidata would be quite useful and it could be used to identify differences between WD & WP to flag potential errors. You may be interested in thread WD:How can Wikidata be useful IRL if it has less data than Wikipedia?. Currently afaik there only is a tool to import data from templates, Harvest Templates, and the InfoboxExport gadget. Prototyperspective (talk) 11:58, 9 December 2025 (UTC)[reply]
    Hmm, yes there has to be a way to watch changes in what WD is loaded into WP tables. It's part of the general problem that Wikidata is a project of great potential that will only be fully unleashed after a great amount of work is done. Jim.henderson (talk) 10:54, 16 December 2025 (UTC)[reply]
    Well one can watch the changes of tables fed from Wikidata: set up the listeriabot in your sandbox or a user-page or the talk page and then copy the output over into the Wikipedia article to update the table which means the changes show up in the Watchlist (and should be checked before saving). One would also see the changes if one configures the bot to update the table directly but that's not allowed in English Wikipedia. I meant that one can't see the changes to the displayed properties of items as they're made and mainly that dynamic display of Wikidata data in infoboxes or short descriptions or categories would have this problem. Theoretically, for various cases one could have again the listeriabot update a table somewhere where the page can be used as some sort of Watchlist of changes to items (certain properties thereof).
    I think the main work to be done is bot/batch/bulk imports of massive datasets such as IMDb ratings, data on documentaries, books data, food nutrients, etc. An issue with the tables thing is that Wikipedia often has more data than Wikidata – it would need some smart tool that enables easy imports of parsed data from Wikipedia tables etc which would then enable using the WD data which is a challenge in itself as well. Prototyperspective (talk) 13:43, 16 December 2025 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2025-12-01/Recent_research