A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
A preprint titled "Do You Trust ChatGPT? -- Perceived Credibility of Human and AI-Generated Content"[1] presents what the authors (four researchers from Mainz, Germany) call surprising and troubling findings:
"We conduct an extensive online survey with overall 606 English speaking participants and ask for their perceived credibility of text excerpts in different UI [user interface] settings (ChatGPT UI, Raw Text UI, Wikipedia UI) while also manipulating the origin of the text: either human-generated or generated by [a large language model] ("LLM-generated"). Surprisingly, our results demonstrate that regardless of the UI presentation, participants tend to attribute similar levels of credibility to the content. Furthermore, our study reveals an unsettling finding: participants perceive LLM-generated content as clearer and more engaging while on the other hand they are not identifying any differences with regards to message’s competence and trustworthiness."
The human-generated texts were taken from the lead section of four English Wikipedia articles (Academy Awards, Canada, malware and United States Senate). The LLM-generated versions were obtained from ChatGPT using the prompt Write a dictionary article on the topic "[TITLE]". The article should have about [WORDS] words
.
The researchers report that
"[...] even if the participants know that the texts are from ChatGPT, they consider them to be as credible as human-generated and curated texts [from Wikipedia]. Furthermore, we found that the texts generated by ChatGPT are perceived as more clear and captivating by the participants than the human-generated texts. This perception was further supported by the finding that participants spent less time reading LLM-generated content while achieving comparable comprehension levels."
One caveat about these results (which is only indirectly acknowledged in the paper's "Limitations" section) is that the study focused on four quite popular (i.e. non-obscure) topics – Academy Awards, Canada, malware and US Senate. Also, it sought to present only the most important information about each of these, in the form of a dictionary entry (as per the ChatGPT prompt) or the lead section of a Wikipedia article. It is well known that the output of LLMs tends to have fewer errors when it draws from information that is amply present in their training data (see e.g. our previous coverage of a paper that, for this reason, called for assessing the factual accuracy of LLM output on a benchmark that specifically includes lesser-known "tail topics"). Indeed, the authors of the present paper "manually checked the LLM-generated texts for factual errors and did not find any major mistakes," something that is well reported to not be the case for ChatGPT output in general. That said, it has similarly been claimed that Wikipedia, too, is less reliable on obscure topics. Also, the paper used the freely available version of ChatGPT (in its 23 March 2023 revision) which is based on the GPT 3.5 model, rather than the premium "ChatGPT Plus" version which, since March 2023, has been using the more powerful GPT-4 model (as does Microsoft's free Bing chatbot). GPT-4 has been found to have a significantly lower hallucination rate than GPT 3.5.
A paper titled "The Risks, Benefits, and Consequences of Prepublication Moderation: Evidence from 17 Wikipedia Language Editions",[2] from last year's CSCW conference, addresses a longstanding open question in Wikipedia research, with important implications for some current issues.
Wikipedia famously allows anyone to edit, which generally means that even unregistered editors can make changes to content that go live immediately – only subject to "postpublication moderation" by other editors afterwards. Less well known is that on many Wikipedia language versions, this principle has long been limited by a software feature called Flagged Revisions (FlaggedRevs), which was developed and designed at the request of the German Wikipedia community and deployed there first in 2008, and has since been adopted by various other Wikimedia projects. (These do not include the English Wikipedia, which after much discussion implemented a system called "Pending Changes" that is very similar, but is only applied on a case-by-case basis to a small percentage of pages.) As summarized by the authors:
FlaggedRevs is a prepublication content moderation system in that it will display the most recent “flagged” revision of any page for which FlaggedRevs is enabled instead of the most recent revision in general. FlaggedRevs is designed to “give additional information regarding quality,” by ensuring that revisions from less-trusted users are vetted for vandalism or substandard content (e.g., obvious mistakes because of sloppy editing) before being flagged and made public. The FlaggedRevs system also displays the moderation status of the contribution to readers. [...] Although there are many details that can vary based on the way that the system is configured, FlaggedRevs has typically been deployed in the following way on Wikipedia language editions. First, users are divided into groups of trusted and untrusted users. Untrusted users typically include all users without accounts as well as users who have created accounts recently and/or contributed very little. Although editors without accounts remain untrusted indefinitely, editors with accounts are automatically promoted to trusted status when they clear certain thresholds determined by each language community. For example, German Wikipedia automatically promotes editors with accounts who have contributed at least 300 revisions accompanied by at least 30 comments.
The paper studies the impact of the introduction of FlaggedRevs "on 17 Wikipedia language communities: Albanian, Arabic, Belarusian, Bengali, Bosnian, Esperanto, Persian, Finnish, Georgian, German, Hungarian, Indonesian, Interlingua, Macedonian, Polish, Russian, and Turkish" (leaving out a few non-Wikipedia sister projects that also use the system). The overall findings are that
"the system is very effective at blocking low-quality contributions from ever being visible. In analyzing its side effects, we found, contrary to expectations and most of our hypotheses, little evidence that the system [...] raises transaction costs sufficiently to inhibit participation by the community as a whole, nor [that it] measurably improves the quality of contributions."
In the "Discussion" section, the authors write
Our results suggest that prepublication moderation systems like FlaggedRevs may have a substantial upside with relatively little downside. If this is true, why are a tiny proportion of Wikipedia language editions using it? Were they just waiting for an analysis like ours? In designing this study, we carefully read the Talk page of FlaggedRevs.[supp 1] Community members commenting in the discussion agreed that prepublication review significantly reduces the chance of letting harmful content slip through and being displayed to the public. Certainly, many agreed that the implementation of prepublication review was a success story in general—especially on German Wikipedia. [...]
However, the same discussion also reveals that the success of German Wikipedia is not enough to convince more wikis to follow in their footsteps. From a technical perspective, FlaggedRevs’ source code appears poorly maintained.[supp 2] [...] FlaggedRevs itself suffers from a range of specific limitations. For example, the FlaggedRevs system does not notify editors that their contribution has been rejected or approved. [...] Since April 2017, requests for deployment of the system by other wikis have been paused by the Wikimedia Foundation indefinitely.[supp 3] Despite these problems, our findings suggest that the system kept low-quality contributions out of the public eye and did not deter contributions from the majority of new and existing users. Our work suggests that systems like FlaggedRevs deserve more attention.
(This reviewer agrees in particular regarding the lack of notifications for new and unregistered editors that their edit has been approved – having filed, in vain, a proposal to implement this uncontroversially beneficial and already designed software feature to the annual "Community Wishlist", in 2023, 2022, and 2019.)
Interestingly, while the FlaggedRevs feature was (as summarized by the authors) developed by the Wikimedia Foundation and the German Wikimedia chapter (Wikimedia Deutschland), community complaints about a lack of support from the Foundation for the system were present even then, e.g. in a talk at Wikimania 2008 (notes, video recording) by User:P. Birken, a main driving force behind the project. Perhaps relatedly, the authors of the present study highlight a lack of researcher attention:
"Despite its importance and deployment in a number of large Wikipedia communities, very little is known regarding the effectiveness of the system and its impact. A report made by the members of the Wikimedia Foundation in 2008 gave a brief overview of the extension, its capabilities and deployment status at the time, but acknowledged that “it is not yet fully understood what the impact of the implementation of FlaggedRevs has been on the number of contributions by new users.”[supp 4] Our work seeks to address this empirical gap."
Still, it may be worth mentioning that there have been at least two preceding attempts to study this question (neither of these has been published in peer-reviewed form, thus their omission from the present study is understandable). They likewise don't seem to have identified major concerns that FlaggedRevs might contribute to community decline:
In any case, the CSCW paper reviewed here presents a much more comprehensive and methodical approach, not just because it examined the impact of FlaggedRevs across multiple wikis, but also regarding the formalizing of various research hypotheses and concerning the use of more reliable statistical techniques.
In more detail, the researchers formalized four groups of research hypotheses about the impact of FlaggedRevs [our bolding]:
(Disclosure: This reviewer provided some input to the authors at the beginning of their research project, as acknowledged in the paper, but was not involved in it otherwise.)
See also related earlier coverage: "Sociological analysis of debates about flagged revisions in the English, German and French Wikipedias" (2012)
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
From the abstract:[3]
"Various Wikipedia researchers have commended Wikidata for its collaborative nature and liberatory potential, yet less attention has been paid to the social and political implications of Wikidata. This article aims to advance work in this context by introducing the concept of semantic infrastructure and outlining how Wikidata’s role as semantic infrastructure is the primary vehicle by which Wikipedia has become infrastructural for digital platforms. We develop two key themes that build on questions of power that arise in infrastructure studies and apply to Wikidata: knowledge representation and data labor."
From the abstract:[4]
"In 2019, Digital Curation Lab Director Toni Sant and the artist Enrique Tabone started collaborating on a research project exploring the visualization of specific data sets through Wikidata for artistic practice. An art installation called Naked Data was developed from this collaboration and exhibited at the Stanley Picker Gallery in Kingson, London, during the DRHA 2022 conference. [...] This article outlines the key elements involved in this practice-based research work and shares the artistic process involving the visualizing of the scientific data with special attention to the aesthetic qualities afforded by this technological engagement."
From the abstract:[5]
"We [...] explore a user-oriented notion of World Literature according to the collaborative encyclopedia Wikipedia. Based on its language-independent taxonomy Wikidata, we collect data from 321 Wikipedia editions on more than 7000 characters presented on more than 19000 independent character pages across the various language editions. We use this data to build a network that represents affiliations of characters to Wikipedia languages, which leads us to question some of the established presumptions towards key-concepts in World Literature studies such as the notion of major and minor, the center-periphery opposition or the canon."
From the abstract:[6]
"Diving into the Wikipedia ecosystem [...] we identified and quantified three fundamental coordination mechanisms and found they scale with an influx of contributors in a remarkably systemic way over three order of magnitudes. Firstly, we have found a super-linear growth in mutual adjustments (scaling exponent: 1.3), manifested through extensive discussions and activity reversals. Secondly, the increase in direct supervision (scaling exponent: 0.9), as represented by the administrators’ activities, is disproportionately limited. Finally, the rate of rule enforcement exhibits the slowest escalation (scaling exponent 0.7), reflected by automated bots. The observed scaling exponents are notably robust across topical categories with minor variations attributed to the topic complication. Our findings suggest that as more people contribute to a project, a self-regulating ecosystem incurs faster mutual adjustments than direct supervision and rule enforcement."
From the abstract:[7]
"The "Wikidata Research Articles Dataset" comprises peer-reviewed full research papers about Wikidata from its first decade of existence (2012-2022). This dataset was curated to provide insights into the research focus of Wikidata, identify any gaps, and highlight the institutions actively involved in researching Wikidata."
From the abstract:[8]
"The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models."
From the paper[9]
"Fifteen years ago, I conducted a small study testing the error-correction tendency of Wikipedia. [...] I repeated the earlier study and found surprisingly similar results. [...] Between July and November 2022, I made 33 changes to Wikipedia: one at a time, anonymously, and from various IP addresses. [...] Each change consisted of a one or two sentence fib inserted into the Wikipedia entry on a notable, deceased philosopher. The fibs were about biographical or factual matters, rather than philosophical content or interpretive questions. Although some of the fibs mention “sources”, no citations were provided. If the fibs were not corrected within 48 hours, they were removed by the experimenter. The fibs were all, verbatim, ones that I used in Magnus (2008). [...] Thirty-six percent (12/33) of changes were corrected within 48 hours. Rounded to the nearest percentage point, this is the same as the adjusted result in Magnus (2008)."
Discuss this story
FlaggedRevs
There seem to be some errors in The Risks, Benefits, and Consequences of Prepublication Moderation: Evidence from 17 Wikipedia Language Editions (https://arxiv.org/abs/2202.05548) papers assumptions on how FlaggedRevs works. For example:
--Zache (talk) 09:11, 4 October 2023 (UTC)[reply]
In Russian Wikipedia, as well as in Russian Wikinews, FlaggedRevs is a disaster. You say Germans are guilty in that? --ssr (talk) 06:09, 27 October 2023 (UTC)[reply]
Wikidata
ChatGPT v. Wikipedia
The study authors comment on prose quality. I happened to ask ChatGPT yesterday to explain what government shutdowns in the U.S. are and what effects they have. I got the following answer:
I then compared that to the lead of Government shutdowns in the United States:
Personally I found ChatGPT's output a lot more readable than the Wikipedia lead – it is just better written. The English Wikipedia text often required me to go back and read the sentence again.
Take the first sentence:
At first I parsed "when funding legislation" as an indication of when shutdowns occur (i.e. "when you are funding legislation"). I needed to read on to realise that this wasn't where the sentence was going.Next, Wikipedia uses the rather technical expression "when funding legislation ... is not enacted" (which is also passive voice) where ChatGPT uses the much easier-to-understand "when Congress fails to pass a budget" (active voice).
Where ChatGPT speaks of a "temporary suspension of non-essential government services", Wikipedia says the federal government "curtails agency activities and services, ceases non-essential operations", etc. I find the ChatGPT phrase easier to understand and faster to read while providing much the same information as the quoted Wikipedia passage (a point the study authors commented on specifically).
The Wikipedia sentence
leaves me wondering even now what the word "it" at the end of the sentence is meant to refer to.I suspect our sentence construction and word use are not helping us win friends. It's one thing when we are the only service available; it's another when there is a new kid on the block. Andreas JN466 13:56, 4 October 2023 (UTC)[reply]
Even if ChatGPT or its successor becomes the predominant internet search tool, that doesn't mean Wikipedia will be obsolete. It likely means that Wikipedia will go back to its theoretical origin as a reference work rather than the internet search tool many readers use it as. Thebiguglyalien (talk) 16:11, 4 October 2023 (UTC)[reply]
Ah, the rise of AI. I've used it to get ideas for small projects in the past, but people prefer LLMs over Wikipedia? That's, just... sad. The Master of Hedgehogs is back again! 22:09, 4 October 2023 (UTC)[reply]