The Signpost

Recent research

The five barriers that impede "stitching" collaboration between Commons and Wikipedia


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


"Unpacking Stitching between Wikipedia and Wikimedia Commons: Barriers to Cross-Platform Collaboration"

The vast majority of academic research about Wikimedia projects continues to focus on Wikipedia and (in recent years) Wikidata. Publications about sister projects such as Wiktionary, Wikinews or Wikibooks exist, but are rare. A paper titled "Unpacking Stitching between Wikipedia and Wikimedia Commons: Barriers to Cross-Platform Collaboration"[1] is one of the very first social science publications to examine Wikimedia Commons (albeit still in tandem with Wikipedia). That's despite Commons being, as the authors highlight, "the world’s largest online repository of free multimedia files for anyone to contribute and use. To date, there are more than 10.5 million volunteers and over 77 million media files on Commons."

The term "stitching" in the paper's title refers to an existing concept from the field of CSCW (Computer-supported cooperative work). The authors define it as follows:

Stitching is a framework that has been used to help describe and characterize cross-platform work to build organizations and also build awareness of topical content. There are three processes of stitching, production, curation and dynamic integration that enable resources to be distributed and utilized across different technical platforms and social networks.

The paper examines in detail how these three cross-platform processes work in case of (English) Wikipedia and Wikimedia Commons, considering the role of Commons of hosting images and other media used on Wikipedia (and other Wikimedia projects). It is based an interview study with 32 Wikimedians working on both projects – from newcomers (<1k edits) to "highly active editors" – five of whom self-identified as women. Among many other examples of such Commons-Wikipedia stitching, they describe e.g. the cropping or retouching Commons images to make them more suitable for use on Wikipedia use, or aligning Commons categories with Wikipedia article names. (These contrast with activities that focus on only one of the projects – such as text editing on Wikipedia, and image uploading, image annotating, metadata tagging and categorizing on Commons. Regarding the latter, the authors observe "a large group of Commons focused editors who categorize images. Categories is 'the primary way to organize and find files on Commons'", quoting one of the interviewees.)

While much of this will come as no surprise to Wikimedians familiar with both projects, the paper's second research question provides food for thought to both the involved volunteer communities and the Wikimedia Foundation (or other actors interested in designing better collaboration features in this areas). Here, the authors identify five "barriers that inhibit effective stitching between Wikipedia and Wikimedia Commons", and propose some "design implications" that could mitigate them:

"Barrier 1: Lack of Communication Across Networks"

The paper observes the existence of

networks of photographers focused on producing images of different subjects, a network of Commons admins that handle copyright issues, a network of categorizers that work to sort pictures on Commons, and networks of Wikipedia editors who write articles in specific subject areas. These micro-networks establish their own ways of communicating and organizing their activities. However, participants argued that there was an absence of communication between these distributed micro-networks. For example, there was no formal way for Wikipedia editors and Commons curators to discuss the imagery needs of Wikipedia articles [...] Participants found it hard to engage in discussions held in other networks to understand their goals and practices [...] Though Commons curators produce images with an intent to support Wikimedia projects and Wikipedia editors rely on the images to illustrate articles, the communication channels between micro-networks and across the platforms are hard to find.

The authors (perhaps wisely) don't propose concrete solutions for this issue, but rather list a few "[p]rior studies in CSCW/HCI [human-computer interaction] [that] investigated similar situations in which stakeholders of a design problem were distributed", and suggest that "WMF could explore these approaches to engage editors distributed across platforms in a participatory design process to address their communication needs."


"Barrier 2: Differing Perspectives"

Out of over 22,000 images of a Boeing 777 on Commons, Wikipedians have selected this one as the current lead illustration for the article Boeing 777

Here, the paper discusses tensions arising from the differing self-perception of each project – Wikipedia as "reference" work vs. Wikimedia Commons as "collection". This manifests e.g. in the question of whether Commons should primarily be seen as a media repository in itself, or as infrastructure for other Wikimedia projects. Specifically, the authors note debates on whether it should aim to host more images of a subject than could conceivably be needed to illustrate pages on other projects: the paper opens with the example of a Wikipedia editor looking to illustrate the article on the Boeing 777 airplane and getting overwhelmed with the search results on Commons: "22,572 images for Boeing 777 with 5,686 categories and multiple pages created by different curators who work to sort images." (As a counterargument illustrating "the difficulties of judging the utility of Commons resources as a function of their use in other WMF platforms", another interviewee mentioned the example of a particular Boeing 777 becoming notable after an accident: "And suddenly, that photograph of that aeroplane we were hosting on Commons appeared in newspapers and all over the place, because it was the only freely available photograph they could find of that exact aircraft.")

As a solution to such issues, the authors (somewhat vaguely) suggest "a process similar to Wikipedia's notability voting. The process would enable editors from both platforms to figure out whether an image warrants significance in any contexts collaboratively, rather than relying on judgement of editors from one platform or the other."

"Barrier 3: Multilingual Resources"

The authors note that Commons is multilingual in theory (with many documentation pages, templates etc. being translatable and available in multiple language), but in practice mostly "produced and curated by English speakers". In particular, they call out the limitations of the search function:

One problem is that the search engine of Commons is key-word based and is not capable of searching 'in the middle of all the languages.' [...] This issue severely impacted participants from other language editions of Wikipedia who have limited or no English proficiency. They would find 'so little of Wikimedia Commons' was available for them to search and use in their own language. [...] This barrier is not just a one-way street, it impacts English-speaking contributors as well. It is difficult for English speakers to find materials about non-English speaking countries because most of the related content was produced and curated in the language spoken in the respective country [...]

The paper remarks that the WMF-led "Structured Data on Commons" project (launched in 2017 with a $3 million grant from the Alfred P. Sloan foundation) aims to improve this by incorporating multilingual information from Wikidata, but that it has "made little progress on Commons because many contributors simply did not know about it or did not care", or "preferred their 'own' [category-based] system over a new structure designed by the foundation". (Here it is worth noting that the study's interviews took place from April 2020 to January 2021, i.e. shortly before the default search interface was switched to "Media search" which is supposed to eventually integrate such structured information. However, as of this time – August 2023 – it retains the same limitations of text-based search.) As a possible way out of this conundrum, the authors suggest that

"One potential solution is for the foundation to investigate ways to incorporate Commons existing categories into the Structured Data Project"

"Barrier 4: Cross-Platform Vandalism"

This issue mainly refers to the problem that vandals can overwrite an image on Commons to affect articles on Wikipedia, which is difficult to detect for Wikipedia editors using their existing monitoring processes. And on the other side, "Though Wikimedia Commons can track and detect when an image is overwritten, it is hard to evaluate the legitimacy of the overwrite because the context of reuse is unknown."

The authors note that this is partly a technical issue, as cross-platforms notifications could be implemented to alert Wikipedians of such incidents. However, they argue that "Even if these notifications existed, these platforms would need to collaborate on addressing the problem. In the general case, resolving barriers will require technical and social collaboration across or between platforms."

"Barrier 5: Differing Policies"

Here, the paper gives two examples. The first one is about copyright:

One key misalignment between Commons and Wikipedia is how copyright is treated. Commons implements a "Precautionary principle" which states that "where there is significant doubt about the freedom of a particular file, it should be deleted." This delete first and discuss it later approach is in contrast to Wikipedias’ "Assume Good Faith" policy that encourages discussion first.

(Contrary to what the authors appear to imply here, the "Assume Good Faith" policy specifically clarifies that "When dealing with possible copyright violations, good faith means assuming that editors intend to comply with site policy and the law. That is different from assuming they have actually complied with either. Editors have a proactive obligation to document image uploads, etc. [...]")

As a second "misalignment", the authors calls out "the differences between Wikipedia and Commons reliance on data sources":

The practice on Wikipedia is to “citing sources”, and in particular, “reliable sources” all in the service of making statements “verifiable”. Media resources on Commons do not need to satisfy all of these standards and there is no judgement as to the validity or correctness of the media artifact. From one perspective similar versions of something like a map or a deep fake image, might have high utility when contrasted with alternate versions [...] Given this generally inclusionary standard, Commons contributors sometimes produce images without including information about the sources of the data as part of the content metadata. Without this key metadata, media is then suspect under Wikipedia’s stricter policies [...]

Here, the paper doesn't offer solutions, apart from the already mentioned general proposals to improve cross-platform and cross-network communications.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Evolution of the Coordination of Activities Aimed at Building Knowledge in the Wikipedia Community"

From the abstract:[2]

"This paper aims to characterize the variability in creating new concepts of cooperation in selected language versions of Wikipedia and identify the factors of participating in various forms of cooperation. [...] The research conducted was both qualitative and quantitative. A netnographic approach was used, as well as a statistical analysis of user activity records. Thanks to the netnographic research, the stages of Wikipedia’s evolution were identified. Quantitative research has shown a correlation between the number of activity areas (a user’s affiliation to WikiProjects) and their overall activity (the number of edits made). A change in Wikipedians’ activity style was also observed depending on their seniority on the website."

(See also other recent publications co-authored by the same author: "Power Distance and Hierarchization in Organizing Virtual Knowledge Sharing in Wikipedia", "Wikipedia as a Space for Collective and Individualistic Knowledge Sharing")


"Quantifying the scientific revolution" using Wikipedia

From the abstract:[3]

"The Scientific Revolution represents a turning point in the history of humanity. Yet it remains ill-understood, partly because of a lack of quantification. Here, we leverage large datasets of individual biographies (N = 22,943) and present the first estimates of scientific production during the late medieval and early modern period (1300–1850). [...] Finally, we investigate the interplay between economic development and cultural transmission (the so-called ‘Republic of Letters’) using partially observed Markov models imported from population biology. Surprisingly, the role of horizontal transmission (from one country to another) seems to have been marginal. Beyond the case of science, our results suggest that economic development is an important factor in the evolution of aspects of human culture."

From the paper:

"[...] we gathered the Wikipedia pages of all individuals classified as scientists during the early modern period: mathematicians, astronomers, physicists, biologists, naturalists, chemists, botanists, entomologists and zoologists (see Table S1). Then, we estimated the scientific contribution of each of these 6620 individuals through different proxies (the size of the page, the number of translations in other languages and the number of Wikipedia pages containing a link to this page). With such a large dataset, we can go beyond key figures such as Newton and Galileo, and take into account the thousands of little-known individuals who contributed to the rise of science "


"Comparison of metrics for measuring Wikipedia ecology: characteristics of self-consistent metrics for editor scatteredness and article complexity"

From the abstract:[4]

"[...] To measure the quality of the editors and articles on Wikipedia, self-consistent metrics for the network defined by the edit relationship have been introduced previously. This scatteredness–complexity measure can evaluate the editors and articles more sensitively than the local characteristics such as degrees of the network and capture well the editors’ activity and the articles’ level of complexity. [...] In addition, the distributions of the editor scatteredness and article complexity become smoother when the network is randomized and loses its detailed local structure eliminating the correlation between the editors and articles. When the degree distributions of the editors or articles are changed and become uniform in the randomized network, the distributions of the editor scatteredness or article complexity become flatter, respectively. This results suggest that the scatteredness–complexity measure reflects not only the degree distribution of the editors or articles but also the local network structure."


This master's thesis includes a chapter about discussions on English Wikipedia. From the abstract:[5]

The aim of this thesis was [...] to understand Spoiler Avoidance (SA) from an Information Avoidance (IA) view, treating it as an example of beneficial IA. [...] a number of Guidelines [about] spoilers were collected from Reddit, Fandom, multiple newssites, Wikipedia and Google. Results were found for multiple levels of abstraction. Firstly, spoiler guidelines exist due to difficulties in defining spoilers, different aims of websites and the different desires for users. Secondly it could be found that SA was not always assumed to be positive, but can be explained through many IA-theories. [...]


References

  1. ^ Yu, Yihan; McDonald, David W. (2022-11-11). "Unpacking Stitching between Wikipedia and Wikimedia Commons: Barriers to Cross-Platform Collaboration". Proceedings of the ACM on Human-Computer Interaction. 6 (CSCW2): 346–1–346:35. doi:10.1145/3555766.
  2. ^ Skolik, Sebastian (2022-08-25). "Evolution of the Coordination of Activities Aimed at Building Knowledge in the Wikipedia Community". European Conference on Knowledge Management. 23 (2): 1088–1096. doi:10.34190/eckm.23.2.569. ISSN 2048-8971.
  3. ^ Courson, Benoît de; Thouzeau, Valentin; Baumard, Nicolas (2023-04-13). "Quantifying the scientific revolution". Evolutionary Human Sciences. 5: –19. doi:10.1017/ehs.2023.6. ISSN 2513-843X.
  4. ^ Ogushi, Fumiko; Shimada, Takashi (2023-02-01). "Comparison of metrics for measuring Wikipedia ecology: characteristics of self-consistent metrics for editor scatteredness and article complexity". Artificial Life and Robotics. 28 (1): 62–66. doi:10.1007/s10015-022-00819-x. ISSN 1614-7456.Closed access icon
  5. ^ Klaus, Jan Christopher (2021-12-10). "Why a Guideline for Spoilers? A comparison between Spoiler Guidelines, related user comments of Wikis, Newssites and Fan Forums". Humboldt-Universität zu Berlin. doi:10.18452/23772.


+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Really good to see some research in this area. It deserves so much more honestly. —TheDJ (talkcontribs) 09:15, 2 September 2023 (UTC)[reply]

Minor matter, I suspect ". . . host more images of a subject that could conceivably be needed to illustrate pages on other projects . . ." might better be "than could".
Down to substance, somewhat, I sometimes hope Wikidata will help with our problems with searches, categorization, and languages in Commons. Some day, far in the future, I fear. Thus far, the cost (mainly in annoyance to old-time cat wranglers like me) of implementing Structured Data is more easily visible than the benefit. Jim.henderson (talk) 23:53, 2 September 2023 (UTC)[reply]
Fixed, thanks! (the typo, not Commons - yet)
Btw, someone also started a thread on the Commons village pump about this review and paper. Regards, HaeB (talk) 03:36, 4 September 2023 (UTC)[reply]
Excellent research topic. I think a major problem for the future is the new skin not putting a link to Wikimedia Commons on the left. It massively hinders discovery of the resources Wikimedia Commons has, which people should be able to find. I intend to petion more about this in future, and it's a major reason why I've switched back to the old skin.
The point about photo replacement is very good to note. This could be added to the watchlist ("on Commons, a photo on the article was overwritten with a different one").
With the problem that Wikimedia Commons has lots of photos: this is a good thing. People may be looking for something unexpected. There's a trend for major topics on Wikimedia Commons to create a curated highlights gallery of images and link wikipedia articles to this rather than the category. I'm very opposed to this, it hinders discovery of photos if people are looking for something unusual that the gallery creator wasn't expecting. It's much better to have that kind of gallery in the Wikipedia article, presenting photo highlights, and a link to the Wikimedia Commons category for people looking for something unusual. Blythwood (talk) 23:16, 4 September 2023 (UTC)[reply]
  • The characteristics, or problems, identified all exist, but are perhaps not major. Commons is absolutely enormous, and the "curation" process can only work in a very patchy way. Commercial picture libraries work on the principle that you can never predict what will interest a user, so you just offer everything you have. So I think the solution proposed for "barrier 2" "The process would enable editors from both platforms to figure out whether an image warrants significance in any contexts collaboratively, rather than relying on judgement of editors from one platform or the other" is a complete non-starter. Johnbod (talk) 23:34, 5 September 2023 (UTC)[reply]
    • It does feel as if non-English searches get pretty short shrift on Commons. For instance "forest" gets 1,321,948 hits, but "forêt" (in French) gets 29,638. "skogen" ('the forest') in Swedish gets just 1,329, and the first batch of images are about a place with that in its name, not forest at all; whereas "skog" (forest in general, in Swedish) gets 1,459,858, and the first batch combines quite a few images of forest with people of that name. The Greek "δάσος" (forest) gets 1,335,550, but many of the first batch of images are of mangrove forests, which might seem rather a specialised result. My curiosity piqued, I tried the Indonesian for forest, "hutan" (rain forest is "hutan hujan", a good word, huh?): it returns an impressive 1,323,199 results, the first batch mixing forest resorts, forest parks, forest, and a boat named after a forest. On the other hand, "hutan hujan" on the other hand returns 13,228 results, where "rain forest" gives just 5,872. About all one can safely conclude from this brief and wholly unscientific experiment is that search is language-dependent, or to put it another way, not very reliable for anyone arriving and searching in their own language. Chiswick Chap (talk) 17:10, 9 September 2023 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2023-08-31/Recent_research