Wikipedia images for machine learning; Experiment justifies Wikipedia's high search rankings

Recent research

Wikipedia images for machine learning; Experiment justifies Wikipedia's high search rankings

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"Announcing WIT: A Wikipedia-Based Image-Text Dataset"

Reviewed by Bri and Tilman Bayer

Researchers from Google AI describe^{[supp 1]} a new dataset for machine learning^[1] composed of annotated images ("multimodal visio-linguistic model") scraped from Wikipedia. It's not quite the largest image dataset the authors compared to prior work, but has by far the largest amount of accompanying text, with more than 37M image-text associations. Text was derived from the article title and description, and other contextual information and metadata such as image captions, alt-text and title of the section an image appeared in. Interestingly, "hate speech" articles were ruled out for the dataset (exactly how was not defined), perhaps to head off future problems with machine learning bias.

The Google AI researchers also announced that "we are hosting a competition with the WIT dataset in Kaggle in collaboration with Wikimedia Research and other external collaborators." In its own announcement,^{[supp 2]} the Wikimedia Foundation's research team explained that it is hosting this competition with the aim of "foster[ing] the development of systems that can automatically associate images with their corresponding image captions and article titles. [...] you will be providing open, reusable systems that could help thousands of editors improve the visual content of the largest online encyclopedia".

"Evolutionary pathways of articles"

Visualization of the evolutionary pathways of articles, from "Ecology of the digital world of Wikipedia"

Reviewed by Bri

In a new Scientific Reports paper titled "Ecology of the digital world of Wikipedia",^[2] the authors define the metrics "scatteredness" of editors and "complexity" of articles, then use the metrics to show how Wikipedia articles tend to improve over time. The metrics are defined in an recursive but computable way:

"...we define the scatteredness D_i of an editor i, as the harmonic sum of the article complexities he or she edits. The complexity of an article is then naturally defined as a harmonic sum of the scatteredness values of the editors who edited the article..."

When plotted against each other, then tracked over time, the data suggest an evolutionary "flow" in which articles trend toward greater quality during their life (shown in accompanying graphic).

Experiment on the search engine DuckDuckGo reveals that "Wikipedia is ranked highly because people are looking for it"

Reviewed by Tilman Bayer

A blog post^[3] by the Wikimedia Foundation reports on the results of an experiment conducted in collaboration with the search site DuckDuckGo. The A/B test examined effects of the presence or absence of "Information modules, also referred to as 'knowledge panels' or 'information boxes,' [which] are the boxes on search result pages, generally to the right of the blue links. They often include a short summary of information from Wikipedia alongside images, facts, and links to relevant websites, including Wikipedia". When Google introduced them back in 2012, they soon gave rise to concerns that relieving (some) surfers of the need to click through to Wikipedia - by already excerpting some of its information onto the search engine results page - might be "killing Wikipedia", which derives a large majority of its traffic from Google (or at least substantially decrease its pageviews, edits and donations).

In contrast to these concerns, when the box was removed in the A/B test on DuckDuckGo, "95% of the clicks that would have gone to the Wikipedia information module instead went to Wikipedia blue links [in the standard search results list on the left]". Wikipedia's click-through rate (per SERP view) was actually higher when the information module was present (15.9%) than when it was missing (15.0%). "This indicates that the vast majority of people are not choosing Wikipedia just because it happens to be ranked high in Search and prominently in the information module but because they are explicitly looking for Wikipedia."

This increase in clickthrough rates is not entirely surprising, given that the box usually contains at least one prominent additional link to Wikipedia (example). But it is in stark contrast to the earlier fears that it would decrease traffic. A 2017 study by McMahon, Johnson & Hecht^{[supp 3]} had actually observed a decrease when removing the box in a lab experiment. But as the coauthor of a followup study pointed out, "a big limitation of this kind of [lab] study is that researchers have to select 'important' queries. But this very recent collab study from Wikimedia + DuckDuckGo bypasses that limitation."

Besides the A/B test, which was conducted on users from the US and Germany, the Foundation also analyzed existing aggregate data from DuckDuckGo from these countries, finding among other results that "Wikipedia is the most common result across all DuckDuckGo searches. It shows up either as a module or one of the top five blue links in more than 15% of searches in the United States, more than any other website."

Alongside other results, the post concludes that

"Wikipedia is central to the success of Search, and, in turn, Search is core to how people find Wikipedia. Wikipedia is ranked highly because people are looking for it."

Briefly

See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Tilman Bayer

"How are encyclopedias cited in academic research? Wikipedia, Britannica, Baidu Baike, and Scholarpedia"

From the abstract:^[4]

"This study investigates trends from 2002 to 2020 in citing two crowdsourced and two expert-based encyclopedias to investigate whether they fit differently into the research landscape: Wikipedia, Britannica, Baidu Baike, and Scholarpedia. [...] Scopus searches were used to count the number of documents citing the four encyclopedias in each year. Wikipedia was by far the most cited encyclopedia, with up to 1% of Scopus documents citing it in Computer Science. Citations to Wikipedia increased exponentially until 2010, then slowed down and started to decrease. Both the Britannica and Scholarpedia citation rates were increasing in 2020, however. Disciplinary and national differences include Britannica being popular in Arts and Humanities, Scholarpedia in Neuroscience, and Baidu Baike in Chinese-speaking countries/territories."

"What an Entangled Web We Weave: An Information-centric Approach to Time-evolving Socio-technical Systems"

From the abstract:^[5]

"... we applied a string matching function to the text associated with each Wikipedia revision entry. The matching function uses a regular expression to identify trigram noun phrases to match entities like 'The White House', 'Barack Hussein Obama' or 'Empire State Building' for example. In this situation Transcendental Information Cascades form a network of article edits, linked together by the shared trigrams found within the edit revision text. By enriching the article edits with contextual knowledge about article categories from DBpedia (http://dbpedia.org) it was possible to find that this cascade network represents meaningful article relationships not available within the explicit network of linked Wikipedia articles. [... For example,] a burst of activity was observed featuring a series of edits made within a short duration of time beginning with identifiers found in edits on the article about Edward Snowden. The cascade then branched out to span across many other articles incorporating various identifiers related to Edward Snowden's life. A detailed inspection of the time frame when the cascade emerged showed that it coincided with a presentation given by him at the SXSW conference. In other words, a relationship between an external phenomenon and a short, bursty cascade of edits within Wikipedia, which would not have been available to a more contextualized investigation, was uncovered using the method."

"Individual-driven versus interaction-driven burstiness in human dynamics: The case of Wikipedia edit history"

From the abstract:^[6]

"In this paper we [are] analyzing the Wikipedia edit history to see how spontaneous individual editors are in initiating bursty periods of editing, i.e., individual-driven burstiness, and to what extent such editors' behaviors are driven by interaction with other editors in those periods, i.e., interaction-driven burstiness. We quantify the degree of initiative (DoI) of an editor of interest in each Wikipedia article by using the statistics of bursty periods containing the editor's edits. The integrated value of the DoI over all relevant timescales reveals which is dominant between individual-driven and interaction-driven burstiness. We empirically find that this value tends to be larger for weaker temporal correlations in the editor's editing behavior and/or stronger editorial correlations [...]"

"Framing and social information nudges" on German donation banners

From the abstract:^[7]

"We analyze a series of trials that randomly assigned Wikipedia users in Germany to different web banners soliciting donations. The trials varied framing or content of social information about how many other users are donating. Framing a given number of donors in a negative way increased donation rates. [e.g. "Our donation banner is viewed more than 20 million times a day, but only 115.000 people have donated so far" (negative) vs. "... Already 115.000 people have donated so far" (positive).] Variations in the communicated social information had no detectable effects. "

References

^ Srinivasan, Krishna; Raman, Karthik; Chen, Jiecao; Bendersky, Michael; Najork, Marc (2021-07-11). "WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning". Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '21. New York, NY, USA: Association for Computing Machinery. pp. 2443–2449. doi:10.1145/3404835.3463257. ISBN 978-1-4503-8037-9.
^ Ogushi, Fumiko; Kertész, János; Kaski, Kimmo; Shimada, Takashi (2021-09-15). "Ecology of the digital world of Wikipedia". Scientific Reports. 11 (1): 18371. doi:10.1038/s41598-021-97755-w. ISSN 2045-2322.
^ Johnson, Isaac; Perry, Nicholas; Gordon, Kinneret; Katz, Jon (2021-09-23). "Searching for Wikipedia: DuckDuckGo and the Wikimedia Foundation share new research on how people use search engines to get to Wikipedia". Diff.
^ Li, Xuemei; Thelwall, Mike; Mohammadi, Ehsan (2021-09-09). "How are encyclopedias cited in academic research? Wikipedia, Britannica, Baidu Baike, and Scholarpedia". Profesional de la Información. 30 (5). doi:10.3145/epi.2021.sep.08. ISSN 1699-2407.
^ Luczak-Roesch, Markus; O'hara, Kieron; Dinneen, Jesse David; Tinati, Ramine (2018-12-01). "What an Entangled Web We Weave: An Information-centric Approach to Time-evolving Socio-technical Systems". Minds and Machines. 28 (4): 709–733. doi:10.1007/s11023-018-9478-1. ISSN 0924-6495. Freely available preprint version: Luczak-Roesch, Markus; O'Hara, Kieron; Dinneen, Jesse David; Tinati, Ramine (2018-04-15). What an entangled Web we weave: An information-centric approach to time-evolving socio-technical systems. PeerJ Preprints.
^ Choi, Jeehye; Hiraoka, Takayuki; Jo, Hang-Hyun (2021-07-26). "Individual-driven versus interaction-driven burstiness in human dynamics: The case of Wikipedia edit history". Physical Review E. 104 (1) 014312. doi:10.1103/PhysRevE.104.014312.
^ Linek, Maximilian; Traxler, Christian (2021-08-01). "Framing and social information nudges at Wikipedia". Journal of Economic Behavior & Organization. 188: 1269–1279. doi:10.1016/j.jebo.2021.06.033. ISSN 0167-2681. , Author's copy

Supplementary references and notes:

^ Srinivasan, Krishna; Raman, Karthik (September 21, 2021), "Announcing WIT: A Wikipedia-Based Image-Text Dataset", Google AI blog, Google Research
^ https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/
^ McMahon, Connor; Johnson, Issac; Hecht, Brent (2017). "The Substantial Interdependence of Wikipedia and Google: A Case Study on the Relationship Between Peer Production Communities and Information Technologies". Eleventh International AAAI Conference on Web and Social Media. AAAI. pp. 142–151.

← Previous "Recent research"

Next "Recent research" →

In this issue

26 September 2021 (all comments)

Disinformation report

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

"Wikipedia is ranked highly because people are looking for it" - in other news, water is wet 😜. Seriously though, it is interesting (but not surprising) that many "high-level" (if that term makes sense) researchers use Wikipedia in their research. I do understand that the information boxes by different search engines do help make things easier for people (and I myself do skip reading the actual article if I get what I wanted from the box), but somehow, I can't get rid of the feeling that "search engines not having to pay the WMF for information to use in information boxes is ethically wrong" though I know that all information here is under CC. Tube·of·Light 04:12, 27 September 2021 (UTC)[reply]
- Sample bias seems an obvious possible flaw, as DuckDuckGo is one of the less used Web browsers. Its users, being few, are surely unusual in some ways, and their attitude towards the usefulness of Wikipedia might be of of those ways. Jim.henderson (talk) 15:48, 27 September 2021 (UTC)[reply]
  - True that. I wonder if Google would be willing to modify their search engine to hide the information box for a couple of days and let us know how big the impact was (but then again, there is no way I am going to ask them to do so). And just to let you know, DuckDuckGo is a search engine (a website that gives search results like Google does), not a web browser (a program like Chrome, that lets you access web pages) Tube·of·Light 02:58, 28 September 2021 (UTC)[reply]
    - @Tube of Light: Google conducts studies like that all the time, using their search page. They'd be bonkers not to, considering their search-request logs are one of the greatest troves of population behavioral data ever amassed, and it's right there at their fingertips. You rightly frame the $100,000 question, though: would they be inclined to share the results with us (or anyone else)?

They're certainly under no obligation to, of course. Though I know they do either directly conduct, or authorize others to perform, research into the (presumably-anonymized, possibly aggregated) trending for certain search queries. Which is how we (the global "we") know, for instance, that Google can reliably predict (or at least detect) regional flu outbreaks by watching for an uptick in the frequency of certain search terms employed by multiple users in close geographic proximity.

I suspect any A/B testing they do on things like infoboxes is purely marketing-driven, though, and geared only towards determining which search features maximize their ad revenue. (In fact we'd better hope that the same studies that find infoboxes driving clicks through to Wikipedia also determine that they increase search engagement or return visits, because we know that driving traffic here isn't really a profit motive for Google.) -- FeRDNYC (talk) 14:30, 20 October 2021 (UTC)[reply]

"Individual-driven versus interaction-driven burstiness in human dynamics: The case of Wikipedia edit history": I have tried so hard to understand this article but it feels like it is missing the part where it actually states how their math addresses their core question. The key sentences seem to be: "The large value of AUC for an article-ego pair implies the dominance of individual-driven burstiness over interaction-driven burstiness and vice versa. By correlating the AUC value with several measures for temporal and editorial correlations, we find the tendency of the AUC values to be larger for weaker (stronger) temporal correlations of the ego (the alters) and/or stronger editorial correlations in the edit sequences." If anyone is able to figure out what this means, I would be grateful to know.~ L 🌸 (talk) 04:33, 27 September 2021 (UTC)[reply]
- This might be helpful: receiver operating characteristic. Basically AUC is a measure of how well a mathematical model classifies a group into some yes/no scheme based on some presumed characteristics. MER-C 17:03, 27 September 2021 (UTC)[reply]
  - AUC = area under the curve. More (higher values) is better for a receiver operating characteristic, if the classifier is working right. ☆ Bri (talk) 19:20, 27 September 2021 (UTC)[reply]
    - Thank you! That helps a little with the first half of the sentence. And I can tell that the bit in parentheses is offering an alternative. So it is something like, "We find a better classifier fit for individual-driven burstiness... for weaker temporal correlations of the ego and/or stronger editorial correlations in the edit sequences." Still not entirely sure what the implications of that are, but willing to let it go... ~ L 🌸 (talk) 04:42, 1 October 2021 (UTC)[reply]
"Wikipedia is ranked highly because people are looking for it" - very interesting writeup. I remember the concern that the knowledge boxes would decrease click-through to Wikipedia - good to see a solid A/B test confirming otherwise. Ganesha811 (talk) 15:55, 27 September 2021 (UTC)[reply]

The Signpost needs your help putting together the next issue.

Home

About