A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
A paper titled "The Roles Bots Play in Wikipedia", published in Proceedings of the ACM on Human-Computer Interaction by five researchers from the Stevens Institute of Technology[1] was presented at this month's CSCW conference. Bots are a core component of English Wikipedia, and account for approximately 10 percent of all edits as of 2019. After retrieving all 1,601 registered bots (as of 28 February 2019), the researchers used a procedure involving machine learning to organise them into a taxonomy with nine key "roles":
Some bots act in several roles (e.g. AnomieBOT as Tagger, Clerk and Archiver).
The last part of the paper concerns the impact of bots on new editors that they interact with. Extending previous research that had found increased retention for newbies who were invited to the Teahouse support space by HostBot, an "Advisor" bot, the researchers show that other Advisor bots have a significant positive effect as well (although in the example cited, SuggestBot, they may have wanted to mention as a confounding factor that users need to opt into receiving its messages). Likewise confirming previous research, messages from ClueBot NG were found to have a negative effect, but this wasn't the case for other "Protector" bots: "The newcomers seem to not care about the bot signing their comments (SineBot) and are even positively influenced by the bot reverting their added links that violate Wikipedia’s copyright policy (XLinkBot)."
A press release, titled "Rise of the bots: Team completes first census of Wikipedia bots", quoted one of the authors as saying "People don't mind being criticized by bots, as long as they're polite about it. Wikipedia's transparency and feedback mechanisms help people to accept bots as legitimate members of the community."
The authors note the relevance of Wikidata to their study, where the proportion of bot edits "has reached 88%" (citing a 2014 paper), and find that the move of interlanguage link information to Wikidata lead to a decrease in "Connector" bot activity on Wikipedia. At last year's CSCW, a paper titled "Bot Detection in Wikidata Using Behavioral and Other Informal Cues"[2] had presented a machine learning approach for identifying undeclared bot edits, showing that "in some cases, unflagged bot activities can significantly misrepresent human behavior in analyses". In the present study about Wikipedia, it would have been interesting to read whether the authors see any limitations in the data source they used (Category: All Wikipedia bots).
A paper in PLoS Biology[3] uses Wikipedia pageview data for "the first broad exploration of seasonal patterns of interest in nature across many species and cultures". Specifically, the researchers looked at the traffic for articles about 31,751 different species across 245 Wikipedia language editions. They found "that seasonality plays an important role in how and when people interact with plants and animals online. ... Pageview seasonality varies across taxonomic clades in ways that reflect observable patterns in phenology, with groups such as insects and flowering plants having higher seasonality than mammals. Differences between Wikipedia language editions are significant; pages in languages spoken at higher latitudes exhibit greater seasonality overall, and species seldom show the same pattern across multiple language editions." Seasonality was often found to "clearly correspond with phenological patterns (e.g., bird migration or breeding...)", but in other cases also to human-made events such as annual holidays. For example, traffic for the English Wikipedia's article on the wild turkey (Meleagris gallopavo) spiked during Thanksgiving in the US, and saw a softer peak during "the spring hunting season for wild turkey in many US states."
Overall, articles about plants and animals exhibited seasonality much more often than articles about other topics. (Concretely, 20.2% of the species articles in the dataset were found to have seasonal traffic, compared to 6.51% in a random selection of nonspecies articles. One quarter of species had a seasonal article in at least one language. Technically, seasonality was determined via a method that involved, among other steps, fitting the pageviews time series to a sinusoidal model with one or two annual peaks, using a manually defined threshold.)
See also earlier coverage of a related paper involving some of the same authors: "Using Wikipedia page views to explore the cultural importance of global reptiles"
"How Does Editor Interaction Help Build the Spanish Wikipedia?" by Taryn Bipat, Diana Victoria Davidson, Melissa Guadarrama, Nancy Li, Ryder Black, David W. McDonald, and Mark Zachry of University of Washington, published in the 2019 CSCW Companion, examines talk page discussions in Spanish Wikipedia with a specific eye to how they might be different from the types of interactions in English Wikipedia.[4] It replicates work from ACM GROUP 2007 that had developed a classification scheme for how editors use policy to discuss article changes.[supp 1]
This is a short paper so it does not have the depth of work you would expect in a full-length conference paper, but the authors select 38 talk pages from Spanish Wikipedia (presumably using the methods from the original work, which focused specifically on talk page conversations that involved high levels of conversation over the course of a month) and code them based on how often policies are linked to and in what context the policies are being linked to. The contextual codes that are applied are: "article scope", "legitimacy of source", "prior consensus", "power of interpretation", "threat of sanction", "practice on other pages", and "legitimacy of contributor". They find that "power of interpretation" and "article scope" are the most-used strategies, followed by "legitimacy of source". They also found a number of examples of editors linking to English Wikipedia pages.
While I would love to see a more robust analysis comparing English and Spanish talk pages that were sampled with the same strategy and from the same time periods, this work is an example of much-needed analyses of how the frameworks and models that are designed for one language community do or do not apply to other language communities. It would be fascinating to further understand the degree to which editors who are active across multiple languages adapt their discussion strategies to the local community versus apply similar strategies across all communities.
In this article,[5] three researchers from China present "a system dynamic model of Wikipedia based on the co-evolution theory, and [investigate] the interrelationships among topic popularity, group size, collaborative conflict, coordination mechanism, and information quality by using the vector error correction model (VECM)."
These five factors ("PSCCQ") are each represented by a monthly time series:
In the paper, they are analyzed for the English Wikipedia's article on global warming, for the timespan of February 2004 to November 2015. First, the researchers apply Granger causality tests to identify which of the five variables tend to predict which, resulting in the depicted graph. E.g. popularity is predicted by coordination (number of talk page discussions, as the only factor in this case), indicating perhaps that Wikipedia editors tend to be quicker to debate new information about global warming than the general public will take it as occasion to look up global warming on Google. Furthermore, the authors calculate the impulse response functions for each of the 20 possible pairs. In the above example, this indicates how the popularity measure tends to "react" to a given increase in coordination. The application of a third technique, forecast error variance decomposition, further corroborates the results about how the five variables relate to each other.
The study presents two quite far-reaching takeaways from the relations it identified between the five factors:
An obvious limitation of this research, only somewhat coyly mentioned in the paper, is its restriction to a single article (and only one Wikipedia language version). While an effort is made to justify the choice of global warming as a high-traffic page with a substantial amount of controversies, it remains unclear how much the takeaways can be generalized.
See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
This paper[6] found that on English Wikipedia talk pages, about 22% more uncivil messages originate from impacted regions on the Mondays following the shift to daylight saving time.
From the abstract:[7]
We analyze the relationship between the structural properties of WikiProject coeditor networks and the performance and efficiency of those projects. We confirm the existence of an overall performance-efficiency trade-off, while observing that some projects are higher than others in both performance and efficiency, suggesting the existence factors correlating positively with both. [...] Our results suggest possible benefits to decentralized collaborations made of smaller, more tightly-knit teams, and that these benefits may be modulated by the particular learning strategies in use."
From the abstract:[8]
"... we use [the] existing Web Traffic Time Series Forecasting dataset by Google to predict future traffic of Wikipedia articles. [...] we built a time-series model that utilizes RNN seq2seq mode [sic]. We then investigate the use of symmetric mean absolute percentage error (SMAPE) for measuring the overall performance and accuracy of the developed model. Finally, we compare the outcome of our developed model to existing ones to determine the effectiveness of our proposed method in predicting future traffic of Wikipedia articles."
From the abstract:[9]
"we propose a new, fast and scalable method for anomaly detection in large time-evolving graphs. It may be a static graph with dynamic node attributes (e.g. time-series), or a graph evolving in time, such as a temporal network. We define an anomaly as a localized increase in temporal activity in a cluster of nodes. [...] To demonstrate [our approach's] efficiency, we apply it to two datasets: Enron Email dataset and Wikipedia page views. We show that the anomalous spikes are triggered by the real-world events that impact the network dynamics. Besides, the structure of the clusters and the analysis of the time evolution associated with the detected events reveals interesting facts on how humans interact, exchange and search for information ..."
This paper from CSCW 2017[10] "replicates, extends, and refutes conclusions" of a paper by Yasseri et al. that had received wide and prolonged media attention for its claims that Wikipedia bots are fighting each other (cf. previous review: "Wikipedia bot wars capture the imagination of the popular press - but are they real?").
From the abstract:[11]
"We propose the construction of a Digital Knowledge Economy Index, quantified by way of measuring content creation and participation through digital platforms, namely the code sharing platform GitHub, the crowdsourced encyclopaedia Wikipedia, and Internet domain registrations and estimating a fifth sub-index for the World Bank Knowledge Economy Index for [the] year 2012."
From the abstract:[12]
"This paper will discuss a technical solution [...] for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10–11 min. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI..."
From the abstract:[13]
"We study aggregated clickstream data for articles on the English Wikipedia in the form of a weighted, directed navigational network. We introduce two parameters that describe how articles act to source and spread traffic through the network, based on their in/out strength and entropy. From these, we construct a navigational phase space where different article types occupy different, distinct regions, indicating how the structure of information online has differential effects on patterns of navigation. Finally, we go on to suggest applications for this analysis in identifying and correcting deficiencies in the Wikipedia page network that may also be adapted to more general information networks."
This paper[14] aims to understand two paradigms of information seeking in Wikipedia: search by formulating a query, and navigation by following hyperlinks.
Discuss this story
@HaeB and Miriam (WMF): your "aggregated clickstream data" link for the Gildersleve and Yasseri "Inspiration..." paper is broken, and does not appear to be in the paper itself. EllenCT (talk) 00:49, 30 November 2019 (UTC)[reply]