A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
A new paper titled "The Rise of AI-Generated Content in Wikipedia"[1] estimates
"that 4.36% of 2,909 English Wikipedia articles created in August 2024 contain significant AI-generated content"
In more detail, the authors used two existing AI detectors (GPTZero and Binoculars), which
"reveal a marked increase in AI-generated content in recent[ly created] pages compared to those from before the release of GPT-3.5 [in March 2022]. With thresholds calibrated to achieve a 1% false positive rate on pre-GPT-3.5 articles, detectors flag over 5% of newly created English Wikipedia articles as AI-generated, with lower percentages for German, French, and Italian articles. Flagged Wikipedia articles are typically of lower quality and are often self-promotional or partial towards a specific viewpoint on controversial topics."
The researchers also conducted a small qualitative analysis "to better understand the motivations for using LLMs to create Wikipedia pages", by manually inspecting the edit histories of a smaller subset, namely "the 45 English articles flagged as AI-generated by both GPTZero and Binoculars" (corresponding to 1.5% of those 2,909), and looking at the other contributions of the editors who created them. In this rather small sample, they identify four different such motivations:
- Advertisement
One prominent motive is self-promotion. Of the 45 flagged pages, we identify eight that were created to promote organizations such as small businesses, restaurants, or websites. [...]
- Pages Pushing Polarization
[...] we also identify pages that advocate a particular viewpoint on often polarizing political topics. We identify eight such pages out of the flagged 45. One user created five articles on English Wikipedia, detected by both tools as AI-generated, on contentious moments in Albanian history [see also figure 2, below ...] In other cases, users created articles ostensibly on one topic, such as types of weapons or political movements, but dedicated the majority of the pages’ content to discussing specific political figures. We find two such articles that espouse non-neutral views on JD Vance and Volodymyr Zelensky.
- Machine Translation
[...] We find three cases where users explicitly documented their work as translations, including pages on Portuguese history and legal cases in Ghana. [...]
- Writing Tool
Other pages, which are often well-structured with high-quality citations, seem to have been written by users who are knowledgeable in certain niches and are employing an LLM as a writing tool. Several of the flagged pages are created by users who churn out dozens of articles within specific categories, including snake breeds, types of fungi, Indian cuisine, and American football players. [...]
These are among the first research results providing a quantitative answer to an important question that Wikipedia's editing community and the Wikimedia Foundation been weighing since at least the release of ChatGPT almost two years ago. (Cf. previous Signpost coverage: Community rejects proposal to create policy about large language models, "AI is not playing games anymore. Is Wikipedia ready?", and in this issue: "Keeping AI at bay – with a little help from volunteers", summarizing media coverage of WikiProject AI Cleanup). The "Implications of ChatGPT for knowledge integrity on Wikipedia" were also the topic of a research project conducted in 2023-2024 by UT Sydney researchers (funded by a $32k Wikimedia Foundation grant) which just published preliminary results where "Concerns about AI-generated content bypassing human curation" are highlighted as one of the challenges voiced by Wikipedians.
The new study's numbers should be valuable as concrete evidence that the generative AI has indeed started to affect Wikipedia in this manner (but might potentially also be reassuring for those who had been fearing Wikipedia would be overrun entirely by ChatGPT-generated articles).
That said, there are several serious concerns about how to interpret the study's data, and unfortunately the authors (a postdoc, a graduate student and an undergraduate student from Princeton University) address them only partially.
First, the researchers made no attempt to quantify how many of the articles from their headline result ("4.36% of 2,909 English Wikipedia articles created in August 2024 contain significant AI-generated content") had also been detected (and flagged or deleted) by Wikipedians. Inspecting the aforementioned smaller subsample of 45 (1.5%) articles where both detectors agreed, they found that
"Most of the 45 pages are flagged by moderators and bots with some warning, e.g., 'This article does not cite any sources. Please help improve this article by adding citations to reliable sources' or even 'This article may incorporate text from a large language model."
Even for this smaller sample though, we are not told what percentage of AI-generated articles survived.
In other words, the paper is a rather unsatisfactory read for those interested in the important question of whether generative AI threatens to overwhelm or at least degrade Wikipedia's quality control mechanisms - or whether these handle LLM-generated articles just fine alongside the existing never-ending stream of human-generated vandalism, hoaxes, or articles with missing or misleading references (see also our last issue, about an LLM-based system that generates gene articles with fewer such "hallucinated" references than human Wikipedia editors). Overall, while the paper's title boldly claims to show "The Rise of AI-Generated Content in Wikipedia", it leaves it entirely unclear whether the text that Wikipedia readers actually read has become substantially more likely to be AI-generated. (Or, for that matter, the text that AI systems themselves read, considering that Wikipedia is an important training source for LLMs - i.e. whether the paper is evidence for concerns that "The ouroboros has begun".)
Secondly and more importantly, the reliability of AI content detection software - such as the two tools that the study's numerical results are based on - has been repeatedly questioned. To their credit, the authors are aware of these problems and try to address them. For example by combining the results of two different detectors, and by using a comparison dataset of articles created before the release of GPT-3.5 in March 2022 (which can be reasonably assumed to be virtually free of LLM-generated text). However, their method still leaves several questions unanswered that may well threaten the validity of the study's results overall.
In more detail, the authors "use two prominent detection tools which were suitably scalable for our study". The first tool is
GPTZero [.....] a commercial AI detector that reports the probabilities that an input text is entirely written by AI, entirely written by humans, or written by a combination of AI and humans. In our experiments we use the probability that an input text is entirely written by AI. The black-box nature of the tool limits any insight into its methodology."
The second tool is more transparent:
An open-source method, Binoculars [...] uses two separate LLMs [...] to score a text s for AI-likelihood by normalizing perplexity by a quantity termed cross-perplexity [...] The input text is classified as AI-generated if the score is lower than a determined threshold, calibrated according to a desired false positive rate (FPR). [...] For our experiments, we use Falcon-7b and Falcon-7b-instruct [as the two LLMs, following the recommendation of the authors of the Binoculars paper.] Compared to competing open-source detectors, Binoculars reports superior performance across various domains including Wikipedia"
The "superior performance" of the Binoculars tool (online demo) for the Wikipedia "domain" sounds very reassuring, with both precision and recall at or near a perfect 100% according to figure 3 in the "Binoculars" paper[supp 1]. But it refers to the performance on a 2023 dataset called "M4",[supp 2] where the AI "articles" to be detected were generated in a rather simplistic manner. ("We prompted LLMs to generate Wikipedia articles given titles, with the requirement that the output articles contain at least 250 words", see also the results for e.g. ChatGPT). It seems unwise to assume that this is representative of all the ways in which actual editors try to use AI to generate new articles in August 2024. Indeed the authors explicitly acknowledge this in a different part of the "Rise" paper, pointing out they did not attempt to "simulat[e] the various ways Wikipedia authors might use LLMs to assist in writing—taking into account different models, prompts, and the extent of human integration, among other factors." As a small illustration of potential issues with this, the few concrete examples of articles detected as AI-generated that are included in the paper (figure 2, see above) all start with an infobox - something which ChatGPT can certainly generate if explicitly prompted to do so, but which seems to be absent from most or all of the examples in the M4 dataset.
What's more, as is evident from Figure 1 (above), both tools disagreed frequently, with GPTZero being much more detection-happy than Binoculars in English, French, and German, but much less in Italian. The authors acknowledge that "the tools we use are primarily for detecting AI-generated content in English. While GPTZero supports Spanish and French, it is not designed for other languages."
As mentioned in the paper's abstract (see above), the authors try to control for false positives by calibrating both detectors to a 1% false positive rate on the control dataset (of presumably AI-free Wikipedia articles from March 2022). A technical issue that appears to have been overlooked here is that this 2022 dataset was generated (by Hugging Face) from the source wikitext dumps using the well-known "mwparserfromhell" Python package, whereas the authors obtained their August 2024 articles by scraping the text rendered by the Wikipedia API and applying some of their own cleanup steps. LLM-based text classification tools can sometimes be quite sensitive to minor formatting aspects.
More importantly, it seems rather adventurous to assume that the articles from that March 2022 dataset are comparable in all relevant properties to the newly created articles from August 2024 (i.e. that the 1% false positive calibration on the former will mean a 1% false positive rate on the latter). The authors are clearly aware of this concept drift problem, but only make a very perfunctory attempt to address it:
"One concern is that pre-March 2022 pages may be more polished due to years of editing. However, we observe that a higher number of edits weakly correlates with a higher AI-detection score for pre-March 2022 articles (Appendix D), suggesting that the FPRs for those articles may even be inflated. While the base assumption cannot be watertight, we observe a relatively consistent distribution of page categories between the two data pools, and we rely on the consistency of our chosen tools’ reported FPRs."
For many people, the fact that additional edits by human Wikipedia editors make the AI detection score go up in both detectors might increase skepticism about their overall validity. But the authors take it as an argument strengthening their paper's overall "Rise of AI-Generated Content" claim, by alluding to the possibility that its estimate might be too low. At various other points of the paper the authors likewise express awareness that their measurement method is subject to substantial errors and uncertainties, but claim that these can only go in their favor (i.e. could only mean that the actual rate of AI-generated articles is higher than their estimated lower bound). And there are other issues that likewise make one wonder a bit about the stringency of the peer review process that the paper has undergone, for example its claim that "The Wikipedia data we collect is under a Creative Commons CC0 License."
The study has only been been published as an arXiv preprint at the time of writing. But according to a remark in the accompanying code, it has been accepted at the "NLP for Wikipedia Workshop" at next month's EMNLP conference.
The authors have commendably published code used for the paper (although not under an open source license). Unfortunately though for readers who might want to replicate part of the paper's quantitative or qualitative analysis (or check whether some of the AI-generated articles it detected have slipped through Wikipedia's New pages patrol), none of the data underlying the paper's main results is being published (even though they were based entirely on public information from Wikipedia):
"Detecting AI may have unexpected negative consequences for people implicated as having generated that text. We have therefore been encouraged to omit any identifying information in the specific pages we discuss; however, we will provide more specific data to researchers upon request provided that it not be disseminated further."
But these concerns did not stop the authors from discussing the aforementioned concrete examples in a way that makes it very easy to identify involved users. (One reader of the paper has already done so, pointing out the specific longstanding sockpuppeting case that the editor featured in figure 2 was involved in.)
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
From the abstract:[2]
"This paper presents a new tool to support efforts to fill in these gaps by automatically generating draft articles and facilitating post-editing and uploading to Wikipedia. A rule-based generator and an input-constrained LLM are used to generate two alternative articles, enabling the often more fluent, but error-prone, LLMgenerated article to be content-checked against the more reliable, but less fluent, rule-generated article."
From the abstract:[3]
"Wikipedia is composed from consensus. [...] While it is often positioned as a self-evident good, its usage on Wikipedia is not without concern. In this paper I mobilize Chantal Mouffe’s (2000) feminist critical political theory and Johanna Drucker’s (2014) methods of interface analysis to raise important questions about the relationship between consensus and peer production. [...] I identify the multitude of ways that Wikipedians perform consensus: not only through understanding and decision-making, but also through acts of composing, showing, processing, closing, and calculating. However, because Wikipedia’s socio-technical vision is over-determined by consensus, its political design is ill-equipped to address the political conditions of pluralist societies. As a result, I identify the reasons why Wikipedia should strengthen its democratic commitment by engaging with dissensus. By conducting this research, I demonstrate how consensus has transitioned from a democratic ideal into an interface and why it should be re-imagined within peer production projects."
From the abstract:[4]
"Working from a feminist affective framework, this article reports on a study of student-editors' experience with Wikipedia-based writing, using their reactions to a key editing guideline, 'Be bold,' as an entry-point for examining their affective experience. The 'Be bold' guideline, which encourages would-be editors to 'just go for it,' is nearly as old as the English language version of Wikipedia itself yet has received little critical attention. Drawing on survey and focus group interviews from participants at the undergraduate and graduate level, this study's findings provide new understandings of novice editors’ affective experiences in Wikipedia while offering a critical analysis of the 'Be bold/ guideline."
See also a post by one of the authors in the "Wikipedia Weekly" Facebook group, with discussion
From the abstract:[5]
"This article details settler colonial erasure of Native American and Indigenous histories, knowledges, and philosophies on [English] Wikipedia. I show that long-time Wikipedia editors follow the settler colonial logic of elimination to omit Native histories from Wikipedia’s American history pages; block Native and allied editors from adding scholarship that centers Native experience; and ban Native and allied editors from the website so that settlers can lay claim to digital space. To do so, I concentrate on Wikipedia’s United States and American history pages, and I detail editor discussions regarding Native histories on Wikipedia’s talk pages, noticeboards, and off-Wikipedia message boards where editors congregate. I supplement this information with interviews of Wikipedia editors engaged in editing these topics. [...] I ultimately provide suggestions for the Wikimedia foundation to combat settler colonial erasure on Wikipedia."
See also a "Public response to the editors of Settler Colonial Studies" by Wikipedia admin User:Tamzin in the June 8 Signpost issue, arguing that the paper's author had failed to disclose a conflict of interest and that the paper contained multiple factual errors.
From the abstract:[6]
"This study examines the impact of integrating Google Translate into Wikipedia's Content Translation system in January 2019. Employing a natural experiment design and difference-in-differences strategy, we analyze how this translation technology shock influenced the dynamics of content production and accessibility on Wikipedia across over a hundred languages. We find that this technology integration lead to a 149% increase in content production through translation, driven by existing editors become more productive as well as an expansion of the editor base. Moreover, we observe that machine translation enhances the propagation of biographical and geographical information, helping to close these knowledge gaps in the multilingual context."
See also mw:Wikimedia_Research/Showcase#July_2024
From the abstract:[7]
"[The] Wikipedia Revision History (WikiRevHist) [dataset ... can be a] valuable resource[] for NLP applications. [...] we report Blocks Architecture (BloArk), an efficiency-focused data processing architecture that reduces running time, computing resource requirements, and repeated works in processing WikiRevHist dataset. BloArk consists of three parts in its infrastructure: blocks, segments, and warehouses. On top of that, we build the core data processing pipeline: builder and modifier. The BloArk builder transforms the original WikiRevHist dataset from XML syntax into JSON Lines (JSONL) format for improving the concurrent and storage efficiency. The BloArk modifier takes previously-built warehouses to operate incremental modifications for improving the utilization of existing databases and reducing the cost of reusing others' works. In the end, BloArk can scale up easily in both processing Wikipedia Revision History and incrementally modifying existing dataset for downstream NLP use cases. The source code, documentations, and example usages are publicly available online and open-sourced under GPL-2.0 license."
From the abstract:[8]
"The article compares selected entries on Wikipedia concerning significant historical events in three language versions: Belarusian, Lithuanian, and Polish. [...] I apply the method of ideological critique to investigate whether national values influence the objectivity of Wikipedia articles written in local languages. A comparison of multilingual Wikipedia entries reveals the prevalence of “local” points of view on controversial historical events."
From the abstract:[9]
"After 2007, researchers started to observe that the number of active editors for the largest Wikipedias declined after rapid initial growth. Years after those announcements, researchers and community activists still need to understand how to measure community health. In this paper, we study patterns of growth, decline and stagnation, and we propose the creation of 6 sets of language-independent indicators that we call “Vital Signs” [formerly available at https://vitalsigns.wmcloud.org/ ]. Three focus on the general population of active editors creating content: retention, stability, and balance; the other three are related to specific community functions: specialists, administrators, and global community participation. [...] We present our analysis for eight Wikipedia language editions, and we show that communities are renewing their productive force even with stagnating absolute numbers; we observe a general lack of renewal in positions related to special functions or administratorship."
See also:
From the abstract:[10]
"we conduct the first data-driven study of ban evasion, i.e., the act of circumventing bans on an online platform, leading to temporally disjoint operation of accounts by the same user. We curate a novel dataset of 8,551 ban evasion pairs (parent, child) identified on Wikipedia [by "Wikipedia moderators", at Wikipedia:Sockpuppet investigations], and contrast their behavior with benign users and non-evading malicious users. We find that evasion child accounts demonstrate similarities with respect to their banned parent accounts on several behavioral axes — from similarity in usernames and edited pages to similarity in content added to the platform and its psycholinguistic attributes. We reveal key behavioral attributes of accounts that are likely to evade bans. Based on the insights from the analyses, we train logistic regression classifiers to detect and predict ban evasion at three different points in the ban evasion lifecycle. Results demonstrate the effectiveness of our methods in predicting future evaders (AUC = 0.78), early detection of ban evasion (AUC = 0.85), and matching child accounts with parent accounts (MRR = 0.97)."
See also earlier coverage on research about sockpuppets on Wikipedia
{{cite conference}}
: |editor=
has generic name (help)CS1 maint: multiple names: editors list (link) / Code
{{citation}}
: CS1 maint: location missing publisher (link)
Discuss this story