"As many as 5%" of new English Wikipedia articles "contain significant AI-generated content", says paper

Recent research

"As many as 5%" of new English Wikipedia articles "contain significant AI-generated content", says paper

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"As many as 5%" of new English Wikipedia articles "contain significant AI-generated content"

Figure 1 from the paper: "Using two tools, GPTZero and Binoculars, we detect that as many as 5% of 2,909 English Wikipedia articles created in August 2024 contain significant AI-generated content. The classification thresholds of both tools were calibrated to maintain a FPR of no more than 1% on a pre GPT-3.5 Wikipedia baseline, as indicated by the red line.

A new paper titled "The Rise of AI-Generated Content in Wikipedia"^[1] estimates

"that 4.36% of 2,909 English Wikipedia articles created in August 2024 contain significant AI-generated content"

In more detail, the authors used two existing AI detectors (GPTZero and Binoculars), which

"reveal a marked increase in AI-generated content in recent[ly created] pages compared to those from before the release of GPT-3.5 [in March 2022]. With thresholds calibrated to achieve a 1% false positive rate on pre-GPT-3.5 articles, detectors flag over 5% of newly created English Wikipedia articles as AI-generated, with lower percentages for German, French, and Italian articles. Flagged Wikipedia articles are typically of lower quality and are often self-promotional or partial towards a specific viewpoint on controversial topics."

The researchers also conducted a small qualitative analysis "to better understand the motivations for using LLMs to create Wikipedia pages", by manually inspecting the edit histories of a smaller subset, namely "the 45 English articles flagged as AI-generated by both GPTZero and Binoculars" (corresponding to 1.5% of those 2,909), and looking at the other contributions of the editors who created them. In this rather small sample, they identify four different such motivations:

Advertisement

One prominent motive is self-promotion. Of the 45 flagged pages, we identify eight that were created to promote organizations such as small businesses, restaurants, or websites. [...]

Pages Pushing Polarization

[...] we also identify pages that advocate a particular viewpoint on often polarizing political topics. We identify eight such pages out of the flagged 45. One user created five articles on English Wikipedia, detected by both tools as AI-generated, on contentious moments in Albanian history [see also figure 2, below ...] In other cases, users created articles ostensibly on one topic, such as types of weapons or political movements, but dedicated the majority of the pages’ content to discussing specific political figures. We find two such articles that espouse non-neutral views on JD Vance and Volodymyr Zelensky.

Machine Translation

[...] We find three cases where users explicitly documented their work as translations, including pages on Portuguese history and legal cases in Ghana. [...]

Writing Tool

Other pages, which are often well-structured with high-quality citations, seem to have been written by users who are knowledgeable in certain niches and are employing an LLM as a writing tool. Several of the flagged pages are created by users who churn out dozens of articles within specific categories, including snake breeds, types of fungi, Indian cuisine, and American football players. [...]

Figure 2 from the paper: "The activity of this user, who was flagged for instigating an ‘Edit War,’ reveals that within a single day, they created three articles (red border), all identified as AI-generated. Notably, at 13:00 (green border), the user edited the outcome of ‘War in Dibra’ from ‘Mixed Results’ to ‘Victory’ and removed key text, just an hour before creating a new page titled ‘Uprising in Dibra.’ That page (see Figure 3) has since been deleted by moderators."

Why these findings seem important

These are among the first research results providing a quantitative answer to an important question that Wikipedia's editing community and the Wikimedia Foundation been weighing since at least the release of ChatGPT almost two years ago. (Cf. previous Signpost coverage: Community rejects proposal to create policy about large language models, "AI is not playing games anymore. Is Wikipedia ready?", and in this issue: "Keeping AI at bay – with a little help from volunteers", summarizing media coverage of WikiProject AI Cleanup). The "Implications of ChatGPT for knowledge integrity on Wikipedia" were also the topic of a research project conducted in 2023-2024 by UT Sydney researchers (funded by a $32k Wikimedia Foundation grant) which just published preliminary results where "Concerns about AI-generated content bypassing human curation" are highlighted as one of the challenges voiced by Wikipedians.

The new study's numbers should be valuable as concrete evidence that the generative AI has indeed started to affect Wikipedia in this manner (but might potentially also be reassuring for those who had been fearing Wikipedia would be overrun entirely by ChatGPT-generated articles).

But how much can we rely on them?

That said, there are several serious concerns about how to interpret the study's data, and unfortunately the authors (a postdoc, a graduate student and an undergraduate student from Princeton University) address them only partially.

First, the researchers made no attempt to quantify how many of the articles from their headline result ("4.36% of 2,909 English Wikipedia articles created in August 2024 contain significant AI-generated content") had also been detected (and flagged or deleted) by Wikipedians. Inspecting the aforementioned smaller subsample of 45 (1.5%) articles where both detectors agreed, they found that

"Most of the 45 pages are flagged by moderators and bots with some warning, e.g., 'This article does not cite any sources. Please help improve this article by adding citations to reliable sources' or even 'This article may incorporate text from a large language model."

Even for this smaller sample though, we are not told what percentage of AI-generated articles survived.

Has the AI-Wikipedia ouroboros begun to devour itself? Or is it still being starved on a much more meager diet than the paper's results might make one believe?

In other words, the paper is a rather unsatisfactory read for those interested in the important question of whether generative AI threatens to overwhelm or at least degrade Wikipedia's quality control mechanisms - or whether these handle LLM-generated articles just fine alongside the existing never-ending stream of human-generated vandalism, hoaxes, or articles with missing or misleading references (see also our last issue, about an LLM-based system that generates gene articles with fewer such "hallucinated" references than human Wikipedia editors). Overall, while the paper's title boldly claims to show "The Rise of AI-Generated Content in Wikipedia", it leaves it entirely unclear whether the text that Wikipedia readers actually read has become substantially more likely to be AI-generated. (Or, for that matter, the text that AI systems themselves read, considering that Wikipedia is an important training source for LLMs - i.e. whether the paper is evidence for concerns that "The ouroboros has begun".)

Secondly and more importantly, the reliability of AI content detection software - such as the two tools that the study's numerical results are based on - has been repeatedly questioned. To their credit, the authors are aware of these problems and try to address them. For example by combining the results of two different detectors, and by using a comparison dataset of articles created before the release of GPT-3.5 in March 2022 (which can be reasonably assumed to be virtually free of LLM-generated text). However, their method still leaves several questions unanswered that may well threaten the validity of the study's results overall.

In more detail, the authors "use two prominent detection tools which were suitably scalable for our study". The first tool is

GPTZero [.....] a commercial AI detector that reports the probabilities that an input text is entirely written by AI, entirely written by humans, or written by a combination of AI and humans. In our experiments we use the probability that an input text is entirely written by AI. The black-box nature of the tool limits any insight into its methodology."

The second tool is more transparent:

An open-source method, Binoculars [...] uses two separate LLMs [...] to score a text s for AI-likelihood by normalizing perplexity by a quantity termed cross-perplexity [...] The input text is classified as AI-generated if the score is lower than a determined threshold, calibrated according to a desired false positive rate (FPR). [...] For our experiments, we use Falcon-7b and Falcon-7b-instruct [as the two LLMs, following the recommendation of the authors of the Binoculars paper.] Compared to competing open-source detectors, Binoculars reports superior performance across various domains including Wikipedia"

The "superior performance" of the Binoculars tool (online demo) for the Wikipedia "domain" sounds very reassuring, with both precision and recall at or near a perfect 100% according to figure 3 in the "Binoculars" paper^{[supp 1]}. But it refers to the performance on a 2023 dataset called "M4",^{[supp 2]} where the AI "articles" to be detected were generated in a rather simplistic manner. ("We prompted LLMs to generate Wikipedia articles given titles, with the requirement that the output articles contain at least 250 words", see also the results for e.g. ChatGPT). It seems unwise to assume that this is representative of all the ways in which actual editors try to use AI to generate new articles in August 2024. Indeed the authors explicitly acknowledge this in a different part of the "Rise" paper, pointing out they did not attempt to "simulat[e] the various ways Wikipedia authors might use LLMs to assist in writing—taking into account different models, prompts, and the extent of human integration, among other factors." As a small illustration of potential issues with this, the few concrete examples of articles detected as AI-generated that are included in the paper (figure 2, see above) all start with an infobox - something which ChatGPT can certainly generate if explicitly prompted to do so, but which seems to be absent from most or all of the examples in the M4 dataset.

What's more, as is evident from Figure 1 (above), both tools disagreed frequently, with GPTZero being much more detection-happy than Binoculars in English, French, and German, but much less in Italian. The authors acknowledge that "the tools we use are primarily for detecting AI-generated content in English. While GPTZero supports Spanish and French, it is not designed for other languages."

As mentioned in the paper's abstract (see above), the authors try to control for false positives by calibrating both detectors to a 1% false positive rate on the control dataset (of presumably AI-free Wikipedia articles from March 2022). A technical issue that appears to have been overlooked here is that this 2022 dataset was generated (by Hugging Face) from the source wikitext dumps using the well-known "mwparserfromhell" Python package, whereas the authors obtained their August 2024 articles by scraping the text rendered by the Wikipedia API and applying some of their own cleanup steps. LLM-based text classification tools can sometimes be quite sensitive to minor formatting aspects.

More importantly, it seems rather adventurous to assume that the articles from that March 2022 dataset are comparable in all relevant properties to the newly created articles from August 2024 (i.e. that the 1% false positive calibration on the former will mean a 1% false positive rate on the latter). The authors are clearly aware of this concept drift problem, but only make a very perfunctory attempt to address it:

"One concern is that pre-March 2022 pages may be more polished due to years of editing. However, we observe that a higher number of edits weakly correlates with a higher AI-detection score for pre-March 2022 articles (Appendix D), suggesting that the FPRs for those articles may even be inflated. While the base assumption cannot be watertight, we observe a relatively consistent distribution of page categories between the two data pools, and we rely on the consistency of our chosen tools’ reported FPRs."

For many people, the fact that additional edits by human Wikipedia editors make the AI detection score go up in both detectors might increase skepticism about their overall validity. But the authors take it as an argument strengthening their paper's overall "Rise of AI-Generated Content" claim, by alluding to the possibility that its estimate might be too low. At various other points of the paper the authors likewise express awareness that their measurement method is subject to substantial errors and uncertainties, but claim that these can only go in their favor (i.e. could only mean that the actual rate of AI-generated articles is higher than their estimated lower bound). And there are other issues that likewise make one wonder a bit about the stringency of the peer review process that the paper has undergone, for example its claim that "The Wikipedia data we collect is under a Creative Commons CC0 License."

The study has only been been published as an arXiv preprint at the time of writing. But according to a remark in the accompanying code, it has been accepted at the "NLP for Wikipedia Workshop" at next month's EMNLP conference.

The authors have commendably published code used for the paper (although not under an open source license). Unfortunately though for readers who might want to replicate part of the paper's quantitative or qualitative analysis (or check whether some of the AI-generated articles it detected have slipped through Wikipedia's New pages patrol), none of the data underlying the paper's main results is being published (even though they were based entirely on public information from Wikipedia):

"Detecting AI may have unexpected negative consequences for people implicated as having generated that text. We have therefore been encouraged to omit any identifying information in the specific pages we discuss; however, we will provide more specific data to researchers upon request provided that it not be disseminated further."

But these concerns did not stop the authors from discussing the aforementioned concrete examples in a way that makes it very easy to identify involved users. (One reader of the paper has already done so, pointing out the specific longstanding sockpuppeting case that the editor featured in figure 2 was involved in.)

Briefly

See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Filling Gaps in Wikipedia: Leveraging Data-to-Text Generation to Improve Encyclopedic Coverage of Underrepresented Groups"

From the abstract:^[2]

"This paper presents a new tool to support efforts to fill in these gaps by automatically generating draft articles and facilitating post-editing and uploading to Wikipedia. A rule-based generator and an input-constrained LLM are used to generate two alternative articles, enabling the often more fluent, but error-prone, LLMgenerated article to be content-checked against the more reliable, but less fluent, rule-generated article."

"Wikipedia’s socio-technical vision is over-determined by consensus" and "Wikipedia should strengthen its democratic commitment by engaging with dissensus"

From the abstract:^[3]

"Wikipedia is composed from consensus. [...] While it is often positioned as a self-evident good, its usage on Wikipedia is not without concern. In this paper I mobilize Chantal Mouffe’s (2000) feminist critical political theory and Johanna Drucker’s (2014) methods of interface analysis to raise important questions about the relationship between consensus and peer production. [...] I identify the multitude of ways that Wikipedians perform consensus: not only through understanding and decision-making, but also through acts of composing, showing, processing, closing, and calculating. However, because Wikipedia’s socio-technical vision is over-determined by consensus, its political design is ill-equipped to address the political conditions of pluralist societies. As a result, I identify the reasons why Wikipedia should strengthen its democratic commitment by engaging with dissensus. By conducting this research, I demonstrate how consensus has transitioned from a democratic ideal into an interface and why it should be re-imagined within peer production projects."

"A feminist affective analysis of student writers' engagement with the 'be bold' guideline"

From the abstract:^[4]

"Working from a feminist affective framework, this article reports on a study of student-editors' experience with Wikipedia-based writing, using their reactions to a key editing guideline, 'Be bold,' as an entry-point for examining their affective experience. The 'Be bold' guideline, which encourages would-be editors to 'just go for it,' is nearly as old as the English language version of Wikipedia itself yet has received little critical attention. Drawing on survey and focus group interviews from participants at the undergraduate and graduate level, this study's findings provide new understandings of novice editors’ affective experiences in Wikipedia while offering a critical analysis of the 'Be bold/ guideline."

See also a post by one of the authors in the "Wikipedia Weekly" Facebook group, with discussion

"Wikipedia’s Indian problem: settler colonial erasure of native American knowledge and history on the world’s largest encyclopedia"

From the abstract:^[5]

"This article details settler colonial erasure of Native American and Indigenous histories, knowledges, and philosophies on [English] Wikipedia. I show that long-time Wikipedia editors follow the settler colonial logic of elimination to omit Native histories from Wikipedia’s American history pages; block Native and allied editors from adding scholarship that centers Native experience; and ban Native and allied editors from the website so that settlers can lay claim to digital space. To do so, I concentrate on Wikipedia’s United States and American history pages, and I detail editor discussions regarding Native histories on Wikipedia’s talk pages, noticeboards, and off-Wikipedia message boards where editors congregate. I supplement this information with interviews of Wikipedia editors engaged in editing these topics. [...] I ultimately provide suggestions for the Wikimedia foundation to combat settler colonial erasure on Wikipedia."

See also a "Public response to the editors of Settler Colonial Studies" by Wikipedia admin User:Tamzin in the June 8 Signpost issue, arguing that the paper's author had failed to disclose a conflict of interest and that the paper contained multiple factual errors.

The 2019 integration of Google Translate made Wikipedia editors more productive

From the abstract:^[6]

"This study examines the impact of integrating Google Translate into Wikipedia's Content Translation system in January 2019. Employing a natural experiment design and difference-in-differences strategy, we analyze how this translation technology shock influenced the dynamics of content production and accessibility on Wikipedia across over a hundred languages. We find that this technology integration lead to a 149% increase in content production through translation, driven by existing editors become more productive as well as an expansion of the editor base. Moreover, we observe that machine translation enhances the propagation of biographical and geographical information, helping to close these knowledge gaps in the multilingual context."

"Blocks Architecture (BloArk): Efficient, Cost-Effective, and Incremental Dataset Architecture for Wikipedia Revision History"

From the abstract:^[7]

"[The] Wikipedia Revision History (WikiRevHist) [dataset ... can be a] valuable resource[] for NLP applications. [...] we report Blocks Architecture (BloArk), an efficiency-focused data processing architecture that reduces running time, computing resource requirements, and repeated works in processing WikiRevHist dataset. BloArk consists of three parts in its infrastructure: blocks, segments, and warehouses. On top of that, we build the core data processing pipeline: builder and modifier. The BloArk builder transforms the original WikiRevHist dataset from XML syntax into JSON Lines (JSONL) format for improving the concurrent and storage efficiency. The BloArk modifier takes previously-built warehouses to operate incremental modifications for improving the utilization of existing databases and reducing the cost of reusing others' works. In the end, BloArk can scale up easily in both processing Wikipedia Revision History and incrementally modifying existing dataset for downstream NLP use cases. The source code, documentations, and example usages are publicly available online and open-sourced under GPL-2.0 license."

"Historical Narratives in Different Language Versions of Wikipedia"

From the abstract:^[8]

"The article compares selected entries on Wikipedia concerning significant historical events in three language versions: Belarusian, Lithuanian, and Polish. [...] I apply the method of ideological critique to investigate whether national values influence the objectivity of Wikipedia articles written in local languages. A comparison of multilingual Wikipedia entries reveals the prevalence of “local” points of view on controversial historical events."

"Community Vital Signs: Measuring Wikipedia Communities’ Sustainable Growth and Renewal"

From the abstract:^[9]

"After 2007, researchers started to observe that the number of active editors for the largest Wikipedias declined after rapid initial growth. Years after those announcements, researchers and community activists still need to understand how to measure community health. In this paper, we study patterns of growth, decline and stagnation, and we propose the creation of 6 sets of language-independent indicators that we call “Vital Signs” [formerly available at https://vitalsigns.wmcloud.org/ ]. Three focus on the general population of active editors creating content: retention, stability, and balance; the other three are related to specific community functions: specialists, administrators, and global community participation. [...] We present our analysis for eight Wikipedia language editions, and we show that communities are renewing their productive force even with stagnating absolute numbers; we observe a general lack of renewal in positions related to special functions or administratorship."

"Characterizing, Detecting, and Predicting Online Ban Evasion" on Wikipedia based on data from past sockpuppet investigations

From the abstract:^[10]

"we conduct the first data-driven study of ban evasion, i.e., the act of circumventing bans on an online platform, leading to temporally disjoint operation of accounts by the same user. We curate a novel dataset of 8,551 ban evasion pairs (parent, child) identified on Wikipedia [by "Wikipedia moderators", at Wikipedia:Sockpuppet investigations], and contrast their behavior with benign users and non-evading malicious users. We find that evasion child accounts demonstrate similarities with respect to their banned parent accounts on several behavioral axes — from similarity in usernames and edited pages to similarity in content added to the platform and its psycholinguistic attributes. We reveal key behavioral attributes of accounts that are likely to evade bans. Based on the insights from the analyses, we train logistic regression classifiers to detect and predict ban evasion at three different points in the ban evasion lifecycle. Results demonstrate the effectiveness of our methods in predicting future evaders (AUC = 0.78), early detection of ban evasion (AUC = 0.85), and matching child accounts with parent accounts (MRR = 0.97)."

See also earlier coverage on research about sockpuppets on Wikipedia

References

^ Brooks, Creston; Eggert, Samuel; Peskoff, Denis (2024). "The Rise of AI-Generated Content in Wikipedia". arXiv:2410.08044 [cs.CL]. (accepted at the "NLP for Wikipedia Workshop" at EMNLP 2024) / code
^ Mille, Simon; Pronesti, Massimiliano; Thomson, Craig; Lorandi, Michela; Fitzpatrick, Sophie; Huidrom, Rudali; Sabry, Mohammed; O'Riordan, Amy; Belz, Anya (September 2024). "Filling Gaps in Wikipedia: Leveraging Data-to-Text Generation to Improve Encyclopedic Coverage of Underrepresented Groups". In Saad Mahamood, Nguyen Le Minh, Daphne Ippolito (eds.) (ed.). Proceedings of the 17th International Natural Language Generation Conference: System Demonstrations. Tokyo, Japan: Association for Computational Linguistics. pp. 16–19. {{cite conference}}: |editor= has generic name (help)CS1 maint: multiple names: editors list (link) / Code
^ Jankowski, S. (February 2022). "Making Consensus Sensible: The Transition of a Democratic Ideal into Wikipedia's Interface". Journal of Peer Production. 15. Peer reviews
^ Vetter, Matthew; Jiang, Jialei; Othman, Mahmoud; Vetter, Mercy (2024-04-18). "Navigating the emotional terrain of Wikipedia writing: A feminist affective analysis of student writers' engagement with the "be bold" guideline". Computers and Composition. 72 102850. doi:10.1016/j.compcom.2024.102850. / Freely available version
^ Keeler, Kyle (2024). "Wikipedia's Indian problem: settler colonial erasure of native American knowledge and history on the world's largest encyclopedia". Settler Colonial Studies: 1–22. doi:10.1080/2201473X.2024.2358697. ISSN 2201-473X.
^ Zhu, Kai; Walker, Dylan (2024-01-28), The Promise and Pitfalls of AI Technology in Bridging Digital Language Divide: Insights from Machine Translation on Wikipedia, Rochester, NY, doi:10.2139/ssrn.4708614, SSRN 4708614{{citation}}: CS1 maint: location missing publisher (link)
^ Li, Lingxi; Yao, Zonghai; Kwon, Sunjae; Yu, Hong (2024). "Blocks Architecture (BloArk): Efficient, Cost-Effective, and Incremental Dataset Architecture for Wikipedia Revision History". arXiv:2410.04410 [cs.CL]. / Code, documentation
^ Kubś, Jakub (2021). "Historical Narratives in Different Language Versions of Wikipedia". Academic Journal of Modern Philology (12): 83–94. doi:10.34616/ajmp.2021.12. ISSN 2299-7164.
^ Miquel-Ribé, Marc; Consonni, Cristian; Laniado, David (January 2022). "Community Vital Signs: Measuring Wikipedia Communities' Sustainable Growth and Renewal". Sustainability. 14 (8): 4705. Bibcode:2022Sust...14.4705M. doi:10.3390/su14084705. ISSN 2071-1050. / Code, data
^ Niverthi, Manoj; Verma, Gaurav; Kumar, Srijan (2022-04-25). "Characterizing, Detecting, and Predicting Online Ban Evasion". Proceedings of the ACM Web Conference 2022. WWW '22. New York, NY, USA: Association for Computing Machinery. pp. 2614–2623. doi:10.1145/3485447.3512133. ISBN 9781450390965. / Data

Supplementary references and notes:

^ Hans, Abhimanyu; Schwarzschild, Avi; Cherepanova, Valeriia; Kazemi, Hamid; Saha, Aniruddha; Goldblum, Micah; Geiping, Jonas; Goldstein, Tom (2024). "Spotting LLMS with Binoculars: Zero-Shot Detection of Machine-Generated Text". arXiv:2401.12070 [cs.CL].
^ Wang, Yuxia; Mansurov, Jonibek; Ivanov, Petar; Su, Jinyan; Shelmanov, Artem; Tsvigun, Akim; Whitehouse, Chenxi; Osama Mohammed Afzal; Mahmoud, Tarek; Sasaki, Toru; Arnold, Thomas; Alham Fikri Aji; Habash, Nizar; Gurevych, Iryna; Nakov, Preslav (2023). "M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection". arXiv:2305.14902 [cs.CL].

← Previous "Recent research"

Next "Recent research" →

In this issue

19 October 2024 (all comments)

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

"We have therefore been encouraged to omit any identifying information in the specific pages we discuss". A commendable approach to ethics (even if, as noted, not perfect). Unlike some other cases I can think of... --_{Piotr Konieczny aka Prokonsul Piotrus| reply here} 12:21, 19 October 2024 (UTC)[reply]
I would encourage you though to look beyond your personal experience (with a controversial paper whose central subject was the longtime impact of your and several other editors' activity on a particular historical topic area) and also consider the wider impact on open science practices here.

To be clear, my main problem with the statement quoted in the review is not that they e.g. leave out the specific user name of that editor who created five articles on English Wikipedia, detected by both tools as AI-generated, on contentious moments in Albanian history (btw, the paper goes further into the administrative actions taken against that user). I might have done the same. Rather, it is that they take this as an excuse not to adhere to the good practice (which has become more prevalent in much of quantitative Wikipedia research over the years) to publicly release the data that their paper's central conclusion is based on, which would include the output of the detectors for particular articles (without user names).

This not only prevents Wikipedians from using that data to improve Wikipedia (by reviewing and possibly deleting AI-generated Wikipedia content that the authors spent quite a bit of money on detecting - in the "Limitations" section, they describe their experiments as "costly"). It also makes it impossible for the community to discuss the performance of the AI detection method used by the paper in concrete examples (apart from those very few that were cherry-picked to be presented in the paper). After all, going back to the example of that paper from last year that (understandably) still seems very much on your mind, the fact that it had provided extensive concrete evidence for its claims across many specifically named articles and hundreds of footnotes was also what enabled you to dispute that evidence in lengthy rebuttals.

Regards, HaeB (talk) 17:39, 19 October 2024 (UTC) (Tilman)[reply]
While I concur that releasing data is a good practice we should encourage, I also believe we need to encourage the good practice of protecting the subject studied. In here, in all honesty, I think the authors should have replaced stuff like "Albanian" with "Fooian" and obscure other content. That said, I understand there we have to weight good of the project and research vs good of small number of people, and also, most likely most editors identifiable here would not have their real names connected to their accounts, but still, protecting research subject is an important ethical consideration, and compromising it leads to a slippery slope. Ethical guidelines exist for good reasons, after all (and the fact that they are often ignored is not something that we should be proud of, as a society, IMHO). All I am saying is that the authors tried to do it at least a bit more than in the case we are both familiar with, and that's a plus. _{Piotr Konieczny aka Prokonsul Piotrus| reply here} 02:50, 20 October 2024 (UTC)[reply]
Again, I can sympathize with concerns about naming specific editors in a paper. (My general view is that like in the news media and in Wikipedia articles, decisions about highlighting such already public information should be justifiable based on the relevance of that information for the reader.)

Where we really seem to disagree is the claim that it is unethical to publish information like "Algorithm X outputs the score Y for article Z" (especially when, as in this case, tool X is already publicly available in some form, or when, as in this case, this information is clearly relevant and useful for the work that Wikipedians do). How do you feel about WMF-hosted tools like ORES offering public APIs for that purpose? Should those be taken down?

Regards, HaeB (talk) 05:15, 20 October 2024 (UTC)[reply]
@HaeB I don't think this is where we disagree strongly. I think such tools are mostly ok. Naming editors or making them identifiable on wiki by their username, while it has some ethical issues, is not in the same league (ie. that worrisome, or even particularly problematic) when compared to outing (or disclosing the names of editors who outed themselves), particularly when said outing has malicious intent (malicious to some, at least, since for others, "the end justifies the means"). My point is that the authors of this paper tried to abide by the ethical standards, and did so in a way I consider passable, if not award-winning. _{Piotr Konieczny aka Prokonsul Piotrus| reply here} 04:34, 29 October 2024 (UTC)[reply]
Regarding "The Rise of AI-Generated Content in Wikipedia" link, which randomly sampled "2,909 English Wikipedia articles created in August 2024", I am puzzled about several things:
- Why aren't the data pools (table, top of page 2) exactly the same size - say, 2,500 each, since these data pools are samples of larger data sets?
- The authors say that their August 2024 sample came from Special:NewPages, which - of course - doesn't include deleted pages. But it makes a big difference if the authors did real-time collection of data during August, or took a snapshot in (say) early September, and this isn't specified. [Footnote 1, the link to "data collection and evaluation code", might provide the answere, but it returns a 404 error message.]
- Footnote 2 provides the source of the article's set of Wikipedia pages collected before March 2022, which are (from that source) datasets of "cleaned articles", stripping out "markdown and unwanted sections (references, etc.)". But the table at the top of page 3 includes "Footnotes per sentence" and "Outgoing links per word" - where did that information come from?
- And speaking of that table, perhaps it's just me, but I find it extraordinarily hard to believe that new articles in August 2024 (with the sample limited to those over 100 words) contained, on average, 1.77 outgoing links per word. -- John Broughton (♫♫) 18:14, 19 October 2024 (UTC)[reply]
on average, 1.77 outgoing links per word - indeed. This nonsensical claim is one of the things that makes one wonder about the peer review process used by the "NLP for Wikipedia Workshop". (It also doesn't seem to be a mere typo, as the "per word" is reiterated in that table's caption and in different phrasing in footnote 4: We normalize by [...] word count.) Fortunately, for this secondary result the authors have actually released some partial data, providing the raw number of links per word calculated for each article (although again while withholding the information on how each article was classified by the two detectors, see also discussion above). It looks like at least for English, the numbers there are all below 1, as they should be. So the error must have happened later in the process. Again this also illustrates the value of adhering to open science practices by publishing replication data.

Another problem about this particular table (which I had left out of the review as too detailed, but which doesn't inspire confidence either): In the text they claim that Table 2 shows how, compared to all articles created in August 2024, AI-generated ones use fewer references. But in the table itself, that is not true for one of the four listed languages: In Italian, that number was actually higher for "AI-Detected Articles". Now, perhaps one could still support the overall claim using something like a multilevel regression analysis on the underlying data. But the authors don't do that, similar to how they hand-wave their way through various other issues in the paper.

where did that information come from? - Note that Table 2 appears to refer only to articles created in August 2024, so the absence of links in the 2022 dataset would not be a problem here. But yes, one could ask why they didn't vet their conclusion that AI-generated [Wikipedia articles] use fewer references and are less integrated into the Wikipedia nexus by calculating the same metrics for their March 2022 comparison articles.

Why aren't the data pools (table, top of page 2) exactly the same size - I mean, they didn't specify what sampling method they used, so one can't expect the resulting samples to have exactly the same size. But yes, it seems one of many unexamined researcher degrees of freedom in this paper. E.g. why did English Wikipedia end up with the smallest sample in the August 2024 dataset and the second-smallest for the per-March 2022 dataset? Did German, Italian and French Wikipedia have a higher number of new articles (of >=100 words) in August 2024 than English Wikipedia?

Footnote 1, the link to "data collection and evaluation code", might provide the answere, but it returns a 404 error message. Does it? The link [1] works for me right now. In an earlier draft of this review as posted here I had linked to [2], a link that afterwards turned 404 because one of the authors renamed the file from "recent_wiki_scraper.py" to "run_wiki_scrape.py" two days ago. The published version of the review uses a permalink (search for "scraping") which still works for me.

Regards, HaeB (talk) 01:51, 20 October 2024 (UTC) (Tilman)[reply]
There needs to be an RFC on the use of Artificial Intelligence formally consigning it to the dustbin and banning off those who use it. Even as we speak there are some in the Foundation who think it's a great idea to facilitate the use of AI so that drivebys find it easier to "contribute." Carrite (talk) 19:48, 19 October 2024 (UTC)[reply]
- As already briefly mentioned in the review, such an RfC already happened, see our coverage in the Signpost: "Community rejects proposal to create policy about large language models". It's also worth noting that the use of Artificial Intelligence is a very broad term which includes things that have been widely accepted for many years, like ORES (which many editors including myself have used to revert thousands of vandalism edits), see e.g. Wikipedia:Artificial intelligence. Lastly, we need to keep in mind that AI-generated articles (as well as AI capabilities in general) are a moving target, with recent systems getting more reliable at generating Wikipedia-type articles than a simplistic ChatGPT prompt would achieve, see e.g. the previous "Recent research" issue: "Article-writing AI is less 'prone to reasoning errors (or hallucinations)' than human Wikipedia editors". Regards, HaeB (talk) 00:26, 20 October 2024 (UTC) (Tilman)[reply]
  Only speaking for myself, but I would like to see more AI in terms of tools, both in terms of helping augment the power and reach and scope of existing admins to make up for their steep decline, and for use by content editors to help them check, verify, and prepare articles for creation and reviewing. This does not mean that I support AI tools that would write the articles, but could help editors check for errors and look for plagiarism. One thing I've been thinking about for a very long time is how most of our articles stand alone within their separate topics and disciplines, without showing how the subjects cross fields, and interact with other similar and not so similar ideas. One potential use of a future AI tool would be to help editors unify the collection of all knowledge and show how it all links together. Currently, our primitive category system attempts to do this, but on an almost imperceptible level that isn't expressed as content or as a visualization. How does all of this content link together? That's what I would like to see it used for, and then, if at all possible, create new knowledge from the unification of all the information. Right now, I can ask various different systems questions, but they don't seem to be able to give me an accurate or insightful answer into anything. As everyone already knows, the weakest link here is our search interface, which doesn't provide 1% of the potential answers that it could. Viriditas (talk) 00:55, 20 October 2024 (UTC)[reply]
At WikiConference North America, several senior WMF staff had a panel discussion with an invited outside expert about the use of AI on Wikipedia. The outside expert was quite concerned about the problems AI has with inventing seemingly-plausible-but-untrue facts and how this would impact article quality; the WMF staff, not so much... —Compassionate727 ^(T·C) 00:01, 29 October 2024 (UTC)[reply]
Isn't GPTZero debunked as too inaccurate to use? –Novem Linguae (talk) 22:36, 19 October 2024 (UTC)[reply]

I was wondering the same thing. Viriditas (talk) 23:07, 19 October 2024 (UTC)[reply]

I think "debunked" is a bit too strong. But yes, there have long been concerns about its accuracy and false positive rates. I have myself advocated early on (February 2023) against relying on GPTZero and related tools for e.g. reviewing new pages, although WikiProject AI Cleanup just today weakened their previous "Automatic AI detectors like GPTZero are unreliable and should not be used" recommendation a little. It's also interesting that GPTZero themselves recently announced their goal [...] to move away from an all-or-nothing paradigm around AI writing towards a more nuanced one. An overall problem is that GenAI is has only been getting better and (presumably) harder to detect, and will quite likely continue to do so for a while.

As mentioned in the review, the authors of the paper seem broadly aware of these problems, but insist that they can work around them for their purposes. And to be fair, for a statistical analysis about the overall frequency of AI-generated articles the concerns are a bit different than when (e.g.) deciding about whether to delete an individual article or sanction an individual editor. Still, my overall impression is that they are way too cavalier with dismissing such concerns in the paper (they are not even mentioned in its "Limitations" section, and their sole attempt to validate their approach against a ground truth has too many limitations, some of which I tried to indicate in this review).

Regards, HaeB (talk) 00:47, 20 October 2024 (UTC) (Tilman)[reply]

Based on what I've read, GPTZero isn't very reliable. See GPTZero#Efficacy. ThatIPEditor ^{Talk · Contribs} 05:48, 29 October 2024 (UTC)[reply]

I question the Wikipedia competency of anyone who refers to administrators as "moderators" as is done in this research paper. It's a fundamental misunderstanding of their role. Trainsandotherthings (talk) 16:27, 20 October 2024 (UTC)[reply]

@HaeB: I just wanted to thank you for the excellent job you've done on that lead story: it's a genuinely engaging analysis! Oltrepier (talk) 10:32, 21 October 2024 (UTC)[reply]

Make sure we cover what matters to you – leave a suggestion.

Home

About