The Signpost

File:Figure 4. RWFork.png
Mykola Trokhymovych, Oleksandr Kosovan, Nathan Forrester, Pablo Aragón, Diego Saez-Trumper, Ricardo Baeza-Yates in the academic paper preprint of "Characterizing Knowledge Manipulation in a Russian Wikipedia Fork"
CC BY 4.0
270
7
650
Recent research

Knowledge manipulation on Russia's Wikipedia fork; Marxist critique of Wikidata license; call to analyze power relations of Wikipedia

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Knowledge manipulation on Ruwiki, the Russian Wikipedia fork

Reviewed by Smallbones

Ruwiki, a fork of the Russian Wikipedia, is widely believed to be financed and published by people close to the Kremlin. The authors of this paper[1] construct a dataset consisting of 33,664 pairs of articles taken from over 1.9 million articles on the official WMF Russian Wikipedia and the Ruwiki articles of the same title. To avoid confusion, Ruwiki is generally called "RWFork" in this paper.

The authors do not use the word "propaganda" in the paper, nor do they directly refer to RWwiki as "disinformation". But you can take "knowledge manipulation" as used in the title as having the same meaning. Accusations of spreading propaganda have long been made between Russia and Western countries. The situation has only gotten worse since the start of the Russian invasion of Ukraine in February 2022. The Putin government has attempted several times to replace, block, or just undermine the Russian Wikipedia — and they haven’t been shy in saying so. See Signpost coverage in May 2020, April 2022, June 2023, January 2024, July 2024, and June 2025.

The stated purpose of RWFork according to the paper is that it is "edited to conform to the Russian legislation" without directly saying that Russian legislation requires the use of propaganda, e.g. writing "special military operation" instead of the "Russian invasion of Ukraine."

The structure of RWFork facilitates a direct comparison of articles on both encyclopedias. This comparison effectively reveals not just the topics required to be modified by Russian legislation, but also which are controversial enough that an active ally of the government in practice has made further edits. Both encyclopedias are powered by MediaWiki software. RWFork copied almost all of the over 1.9 million articles from Russian Wikipedia. 97.33% of the articles were unchanged (identified as "duplicates") over the period studied 2022- 2023. 0.92% of the articles were never copied or immediately deleted and are identified as "missing" in the paper. Only 1.75% of the articles were changed - which may be the most surprising result of the paper. 0.96% had changes which affected the article text and 0.79% had changes that didn’t affect the text, such as article categorization or references. Though the percentage of changed articles is small, the resulting dataset is still quite large at 33,664 entries. Most variables, such as page views, and edit reversion rates and IP editing rates are collected from the Russian Wikipedia articles. RWFork's lack of available data other than the articles themselves and the article's editing history result in most comparisons based solely on Russian Wikipedia data - e.g. if the Russian Wikipedia article has a high number of page views, both articles in the pair are considered as frequently viewed. The main exception is that the timing of edits (often called the "time-card" on Wikipedia) is available for both articles in the pair.

This dataset is the major accomplishment of the authors, and is freely available online. It is described in enough detail to answer several important questions. Were the changed articles relevant or controversial (using page views and reversion rates)? When were the articles changed (using time-cards)? Were there patterns in the articles changed (using article geography and subject matter)?

Three figures from the paper give these basic results.

Figure 3a. shows that page views from the Russian Wikipedia articles are much higher for the changed articles than for the duplicate and missing articles. Figure 3b. shows very similar results for edit counts. Figures 3c. (for IP edit rates) and 3d. (for revert rates) have smaller differences between the changed articles and the duplicate articles, but overall these results strongly support the hypothesis that changed articles are especially relevant and controversial.

Figure 4 shows the editing time-cards for RWFork (top) and Russian Wikipedia (bottom). The top card shows that RWFork is mostly edited during ordinary Moscow working hours on weekdays, whereas Russian Wikipedia is edited at earlier and later times as well as during the weekend. This strongly suggests that RWFork is edited more by professional editors and Russian Wikipedia by more volunteers.

Figure 5 is a bit more complex. It shows how all the article groups (changed, duplicate and missing) change for the geography of the article subject. Articles about Ukraine (UA) fall, much more often than those from elsewhere, into the changed group. Conversely, articles about Russian or U.S. topics fall most commonly in the missing group, which suggests that there are different reasons that country-specific articles end up in different groups.

The authors' also offer a "taxonomy of patterns of knowledge manipulation" (Table 4 from the paper), i.e. a classification of the different types of changes made on RWFork to the imported articles. This is more refined data, based on clustering algorithms, and begs for further analysis:

There is indeed far more research that this data might be used for. For example, researchers might investigate whether the articles modified on RWFork have also been modified on Polish, Hungarian, or other eastern European language Wikipedias, possibly indicating a Russian interest in spreading propaganda beyond its borders.

Mapping the Dispossession of the Commons

Reviewed by E mln e

A collective of humanities scholars publishes a manifesto and a commentary[2] to renew critical research approaches in Wikimedia research, grounded in critical humanist traditions. The group and the manifesto emerges from last year's Wikihistories symposium,[supp 1] a new research events series in the critical humanist tradition (co-organized by Wikimedia Australia). The manifesto and commentary are a call for the community to focus on the following themes:

  1. Map the dispossession of the commons
  2. Recognise Wikimedia’s role as a hub of global knowledge infrastructure
  3. Examine power relations
  4. Explore the juxtapositions of Wikimedia policies and practices
  5. Investigate linguistic and cultural plurality
  6. Assess the implications of algorithms
  7. Historicise Wikimedia's epistemology
  8. Study Wikimedia’s data as partial, temporary, fallible and shifting
  9. Situate research practice
  10. Build a shared project of critical investigation across disciplines

In a blog post last week,[supp 2] one of the authors (Heather Ford) characterized the manifesto as a continuation of the Critical Point of View Conference series in 2010/11 (Signpost coverage), and the collective volume developed from it[supp 3].

While there is previous research on the manifesto's topics - in particular the "dispossession of the commons", i.e. the impact of Large Language Models and other reuses by technology companies (cf. below) on the ways Wikimedia projects function as commons - the call seems designed to encourage further inquiries and strengthen the academic community in this area.

"The Realienation of the Commons": A Marxist critique of Wikidata's license choice

Reviewed by Tilman Bayer

In a paper titled "The Realienation of the Commons: Wikidata and the Ethics of 'Free' Data",[3] Zachary McDowell and Matthew Vetter argue that

In many ways, Wikipedia, and its parent company Wikimedia, can be viewed as the standard-bearers of Web 2.0’s early promises for a free and open Web. However, the introduction of Wikipedia’s sister project Wikidata and its movement away from “share alike” licensing has dramatically shifted the relationship between editors and complicated Wikimedia’s ethics as it relates to the digital commons. This article investigates concerns surrounding what we term the “re-alienation of the commons,” especially as it relates to Google and other search engine companies’ reliance on data emerging from free/libre and open-source (FOSS/FLOSS) Web movements of the late 20th and early 21st centuries. Taking a Marxist approach, this article explores the labor relationship of editors to Wikimedia projects and how this “realienation” threatens this relationship, as well as the future of the community."

In more detail, the authors explain their application of Marx's theory of alienation to Wikipedia and Wikidata as follows:

[...] Wikipedia editing allowed the average editor to subvert the capitalist status quo. The Wikipedia community was created around this new economic model—CBPP [commons-based peer production], which connected editors with their labor and connected other editors to each other through that labor. Karl Marx [...] defined alienation as “appropriation as estrangement” and stated that “realisation of labour appears as loss of realisation for the workers” [...]. Marx’s concept here refers to the relationship between the product of the labor and how it is both used and disconnected from the laborer. This relationship with labor (and the community around it) marks the important distinction that helps illustrate our use of the term “realienation” with regard to Marx’s usage of “alienation.”[...]
Instead of Wikipedia’s CC-BY-SA (“share alike”) license (a license that requires derivatives and other uses of the licensed material to retain the same license), Wikidata utilizes a license that has no requirements. This might sound ideal for “freedom,” but in reality, Wikidata seems to appropriate that particular FOSS imaginary of sharing while instead delicensing information into data by assigning it a CC0 license—allowing companies to extract, commodify, and otherwise use these data in ways to create systems without requirements to honor the license or reference the works that were utilized.

A problem with the paper's argument here is that their depiction of the CC0 license as contrary to Wikimedian values (and mocking scare quotes around “freedom [...]”) is incompatible with the Wikimedia movement's conception of free licenses itself, as pointed out by several Wikimedians in a Facebook discussion with the authors in the "Wikipedia Weekly" Facebook group:

I think this [paper] is bad for the open movement as they try to make a new definition of what "free" is, contrary to Freedom defined [i.e. the definition used in the Wikimedia Foundation's 2007 licensing policy resolution that specifies the admissible content licenses on all Wikimedia projects, not just Wikidata], the Open definition and for example the Free in Free Software Foundation or the open source definition.

One of the authors rejected this criticism as making a mountain out of a molehill, while the other stated that the main argument I would emphasize in response is that we need to be more attentive and critical to the outcomes of CCZero licensing.

As per its abstract (quoted above), the paper explores the postulated re-alienation [...] especially as it relates to Google and other search engine companies’ reliance on data from Wikimedia projects. In case of Wikipedia, the authors devote ample space to summarizing earlier research about its importance for Google's search engine, and concerns that Google's Knowledge panel feature (introduced in 2012) might have significantly reduced traffic to Wikipedia as well as average Web users’ understanding of where information comes from when sourced from Wikipedia. However, they also acknowledge that the relationship between Google and Wikipedia had been (somewhat) mutually beneficial overall.

In contrast though, and rather peculiarly considering their overall claim that Wikidata's CC0 license makes the project more exploitable by search engine companies, the paper cites no research or other concrete evidence about whether and how much information from Wikidata is being using in Google Search or in its knowledge panels. At one point, the authors even lament that

it is of deep concern that the Wikimedia community and Wikidata volunteers know very little with regard to how third-party consumers use Wikidata.

But McDowell and Vetter don't seem to have considered how they themselves, and the strong claims they make in their paper about the exploitation of Wikdata due to its license choice, might be affected by this lack of knowledge.

Published in the 2024 issue of the International Journal of Communication, the paper also briefly mentions

large language model generative artificial intelligence (AI) applications such as ChatGPT or Google’s Bard

as a more recent example of this "realienation". However, it largely focuses on search engines and discusses artificial intelligence mostly in form of AI apps such as Google Knowledge Graph [and] VAs voice assistants (e.g., Siri, Alexa), presumably due to its submission date (the ambiguous 11-9-2022) predating the release of ChatGPT on November 30, 2022.

Briefly

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Tilman Bayer and JPxG

"Digital sovereignty": A history of "Wikimedia’s atypical organizational model"

From the abstract:[4]:

"Based on the authors’ extensive involvement [with e.g. Wikimedia Germany and the Wikimedia Foundation] since the early years, this article examines Wikipedia’s journey of over two decades to unravel relevant aspects of sovereignty within its unconventional organizational framework. The concept of digital sovereignty was nascent when Wikipedia emerged in 2001. In its 24-year evolution, Wikimedia’s atypical organizational model, shaped by a mix of intent and happenstance, fostered digital independence while unintentionally creating pockets of dependence. Looking at the origins and the foundational principles, this article sheds light on various aspects of dependence, brought about in the areas of content, collaboration, governmental influence, legal framework and funding models."

The authors envision Wikipedia — which at the time of its origin "could have remained a marginal experiment" — as a self-determining digital space. However, they conclude that this state is not the result of deliberately orchestrated hierarchy, but as an almost-accidental stumbling into independence through a mix of idealism and adaptation.

"Jürgen Habermas revisited via Tim Cook's Wikipedia biography: A hermeneutic approach to critical Information Systems research"

From the abstract:[5]

"Critical Information Systems (IS) research is sometimes appreciated for the shades of gray it adds to sunny portraits of technology's emancipatory potential. In this article, we revisit a theory about Wikipedia’s putative freedom from the authority of corporate media's editors and authors. We present the curious example of Tim Cook's Wikipedia biography and its history of crowd-sourced editorial decisions [... W]hat we found pertained to authoritative discourse – the opposite of “rational discourse” – as well as Jürgen Habermas's concept of dramaturgical action. Our discussion aims to change how critical scholars think about IS's Habermasian theories and emancipatory technology. Our contribution – a critical intervention – is a clear alternative to mainstream IS research's moral prescriptions and mechanistic causes."

Specifically, the paper focuses on talk page debates about whether the article should mention the Apple CEO's homosexuality, where advocates of privacy prevailed until Cook himself

[...] wrote an auto-biographical essay about his sexuality, published by Bloomberg Media. [...] Corporate powers determined and disseminated the final word about Cook’s sexuality, not Wikipedia’s global pool of co-authors and co-editors.
In short, Wikipedia’s putatively “rational discourse” (Hansen et al., 2009) did not establish the consensus; corporate media authority, the author (Cook, 2014), and his auto-biography established an orthodox position, which Wikipedia then copied.

How this critique, carried out by means of a "hermeneutic excursion", relates to our own policies on biographies of living people is not specified, as actions taken here are broadly commensurate with what policy recommends for biographies in general. The authors are unclear on this point, but offer the suggestion that the article was tainted by the use of that reference, since Cook's biography was published by a company owned by a billionaire, and he "did not release it through a social media outlet" (although Facebook, Twitter, Instagram, and Truth Social are also owned by billionaires).

(See also earlier coverage of other publications involving Habermas)

References

  1. ^ Trokhymovych, M., Kosovan, O., Forrester, N., Aragón, P., Saez-Trumper, D., & Baeza-Yates, R. (2025). Characterizing Knowledge Manipulation in a Russian Wikipedia Fork. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 1924-1936. https://doi.org/10.1609/icwsm.v19i1.35910 With downloads available A preprint is also available at https://arxiv.org/abs/2504.10663, licensed CC BY-SA 4.0
  2. ^ Jankowski, Steve; Ford, Heather; Iliadis, Andrew; Sidoti, Francesca (2025-07-07). "Uniting and reigniting critical Wikimedia research". Big Data & Society. 12 (3): 20539517251357292. doi:10.1177/20539517251357292. ISSN 2053-9517.
  3. ^ McDowell, Zachary J.; Vetter, Matthew A. (2023-12-26). "The Realienation of the Commons: Wikidata and the Ethics of 'Free' Data". International Journal of Communication. 18 (0): 19. ISSN 1932-8036.
  4. ^ Klempert, Arne; Ménard, Delphine (2025). "Wikipedia's Atypical Oganizational [sic] Model: Digital Sovereignty 20 Years in the Making". In Schmuntzsch, Ulrike; Shajek, Alexandra; Hartmann, Ernst Andreas (eds.). New Digital Work II: Digital Sovereignty of Companies and Organizations. Cham: Springer Nature Switzerland. pp. 145–160. ISBN 9783031699948.
  5. ^ Smethurst, Reilly; Young, Amber G.; Wigdor, Ariel D. (2024-12-01). "Jürgen Habermas revisited via Tim Cook's Wikipedia biography: A hermeneutic approach to critical Information Systems research". Journal of Responsible Technology. 20: 100090. doi:10.1016/j.jrt.2024.100090. ISSN 2666-6596.
Supplementary references and notes:


+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
Good roundup/selective dive as usual, HaeB. I saw an early presentation of the realienation research at a conference a couple years ago (and might as well disclose I know the authors) and had an initial pragmatic-defensive reaction: Wikidata can't just switch to a different license -- it doesn't function without CC0, so what's the point? But the more I sat with it, the more I felt like there was a really important point here about alienation, wikis, wiki contributors, and licensing.
Contributors are more and more frequently separated from our work. No amount of reaffirmation of our definition of freedom changes the reality that many people in our community regularly express feelings ranging from annoyance to demotivation because they feel like their labor is exploited.
Back in 2018, for example, Bfpage wrote a Signpost article about the experience of hearing Alexa read something she wrote on Wikipedia, without attribution. The paper focuses on Wikidata, but the objection about Alexa, and one of the chief criticisms here and elsewhere about more recently relevant companies like OpenAI and Google, isn't simply that they use Wikipedia, but that they treat Wikipedia (and everything else) as if they're CC0.
Google and Wikipedia/the rest of the web have had a historically mutually beneficial relationship, but that undeniably began to erode with Knowledge Panels, which have now given way to AI Search.
It seems to me the distance created between contributors and readers, owing to companies treating our work as though it's CC0, regardless of whether it is, does take a toll worth examining. I think there are now several people/groups working to better understand just that, like the WMF's Future Audiences, but "realienation" seems like a natural frame through which to talk about it.
BTW: research or other concrete evidence about whether and how much information from Wikidata is being using in Google Search or in its knowledge panels - How much of this is available? My sense is that such information would be difficult to find, and that it is easily obscured for reasons that align with the authors' arguments, but I would be happy to be wrong about that. — Rhododendrites talk \\ 15:47, 18 July 2025 (UTC)[reply]
The Wikimedia sound logo was meant to eliminate the difference between Alexa announcing that it will use Spotify to play a requested song while lifting Wikipedia text without attribution, so it is dispiriting that two years later, there is no external adoption of this Wikimedia branding. As Barbara Page pondered in that 2018 article, perhaps the WMF has calculated that large donations from reliant tech companies are better than enforcement of Wikipedia's attribution requirement especially since they do not experience the alienation. ViridianPenguin🐧 (💬) 04:22, 19 July 2025 (UTC)[reply]
  • Re the Wikidata license paper: It would be of far "deep[er] concern" if Wikidata/WMF was claiming to own simple DB connections, like who the painter of the Mona Lisa was. The author apparently acts like it's dispiriting to editors that such basic info is being shared, but it's the other way around. It would be dispiriting if the kind of stuff Wikidata does was being locked down further as they seem to advocate for. The CC0 license is a good fit for Wikidata. SnowFire (talk) 19:21, 18 July 2025 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2025-07-18/Recent_research