The Signpost

File:Taxon item properties diagram.jpg
Siobhan Leachman & Heidi Meudt
cc-0
25
25
550
Recent research

Minority-language Wikipedias, and Wikidata for botanists

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"The Wikipedia Editions of Low German and Other European Minority Languages"

Reviewed by Katarzyna Makowska (WMPL)
A map of Northern Europe with some areas filled in with a color to highlight
Post-1945 Low German language area in Europe

The author of this 2021 study[1] looks at Wikipedia projects in several European minority, regional and endangered (MRE) languages. His main focus is on Low German (Plattdeutsch, a minority language spoken in northern Germany), but he also considers other European languages: Occitan, Piedmontese, Sardinian, Kashubian and Ladin.

The introduction provides an overview of the history of Low German, once lingua franca in Northern Europe, its erosion and loss of popularity until being added to the EU list of Endangered Languages. The author mentions the recent generational divide, the role of new media including the internet and, of course, Wikipedia. He reviews studies looking at the online presence of MRE languages, including a notable 2013 study dividing digital presence of languages into four groups: thriving, vital, heritage and still.[supp 1]

The paper then compared the minority languages on Wikipedia: their rank by number of articles, number of administrators and active users. It was noted that the presence of a Wikipedia project is a positive achievement in itself, considering that the rules for creating a new language project are described as fairly complex, and that many of the world’s languages are not represented on Wikipedia. The author also notes that at least as of 2021, dedicated, small groups of speakers/users are mostly responsible for maintaining the languages' presence on Wikipedia.

The author moved on to search for fifty common terms in each language, doing a word count for each entry if applicable (a search for 400 Wikipedia entries in total). This study was carried out between November 2016 and January 2017 with data reexamined in March 2021. The author does not explain in much detail how he chose the 50 common words. This part is somewhat comparable to a recent study by Lewoniewski et al.,[2] although in that case the authors explained their choice of words to analyse more transparently, and were looking at quality of articles.

Interestingly, the author notes:

"The brevity of many of the articles and the paucity of information may cause more problems than benefits, such as perceptions that these languages are unrefined, unsophisticated, and second-rate. The voluntary work of the respective Wikipedia crews can, of course, not be faulted for this dilemma. Rather, these numbers underscore the fact that languages with fewer speakers do not have a good or even fair chance of succeeding on Wikipedia, which in turn negatively affects a meaningful online presence. These results, however, must be considered preliminary, and a study larger in scope will be needed to genuinely validate it."

The paper also provides an intriguing case study for changes in Low German Wikipedia:

“Something unexpected happened with the Low German Wikipedia edition in the course of this study. At the beginning of this study, in January 2018, Low German had 27,342 articles, 4 administrators, and fifty-two active users. In terms of total number of entries, it ranked at number one hundred. In March 2021, the Low German Wikipedia had 85,467 entries and had climbed to number seventy-five. It still had four administrators in March 2021 and seventy active users. This means that the number of articles in the Low German Wikipedia edition more than tripled in three years, while the volunteer staff only slightly increased. What are the reasons for this remarkable increase in articles? The answer is baffling, as many of the newly added articles may not have been authored by humans.”

As shown in table 6 of the study, it turned out that over half of the articles in the languages examined, with the exception of Ladin, are bot-generated. This is in contrast to large Wikipedias such as English or German, which use many bots, but less frequently for content creation. While bots can be useful for generating stub articles with basic information, these entries tend to be repetitive and have very few or no references.

The author provides an interesting critique of the digital world achieving or failing to achieve more equality and presence for marginalised groups and minority languages. Beyond digital limitations, the paper reminds its readers that the barriers to the representation of MRE languages can go beyond the domain of linguistics, into the cultural sphere, as many of these languages are “embedded in a profound oral tradition, which often includes the lack of a common orthography. This means that not only Wikipedia but also the Internet in general is not really an adequate medium for communication in these languages.”

After writing this review, I was informed that Book Publisher International (the company which published the volume in which this study is a chapter) is considered a predatory publisher (thank you to the editor who pointed it out!). Considering that the author published in other reputable places on this subject,[3] I think there is still merit in this small yet interesting study. The author himself was careful to underline that the results must be understood as samples and snapshots rather than definite conclusions. Still, the findings raise important questions about the future of minority languages online. It would be interesting to see a follow up study on this, although understandably, as the author points out, that would require more resources and finances. It would be equally intriguing to see a comparison of how the languages have been doing since the study was conducted and published.

"Wikidata for botanists"

Reviewed by Katarzyna Makowska (WMPL)

This paper[4] makes a case for using Wikidata in botany, highlighting its benefits and multiple application opportunities in this field. The authors are researchers and Wikidata editors, and the publication comes after a workshop and poster presentation during the International Botanical Congress (IBC) in Spain in 2024.

The paper starts with a narrative about Oxalis psoraleoides (a species of flowering plant), accompanied by a knowledge graph, demonstrating how many elements mentioned in the botanical story (such as collections, species, places, explorers) are interlinked on Wikidata and can be visualised there:

"Knowledge graph of entities linked to Oxalis psoraleoides subsp. insipida (Q131350563)" (Figure 1 from the paper)

To quote from the paper:

"(...) there is a huge amount of botany-related information that has been published over centuries that contains hidden connections between such entities. Much of this textual information is made available on the internet in a digital format. This information is usually unstructured, and hence is siloed, lacking in context, and not interoperable. In addition, information in different biodiversity databases as well as digital libraries is often not linked. (...)

Publishing information in Wikidata ensures it is findable, able to be accessed, interoperable (the structure follows a documented standard), and obstacle-free with regard to reuse (licensing), thus progressing towards achieving compliance with FAIR principles (Findability, Accessibility, Interoperability, and Reuse of digital assets, [Q29032644], Wilkinson et al., 2016[supp 2])"

This is followed by the basics of Wikidata, and a comprehensive list of various botany-related types of data in Wikidata, with a detailed explanation for each of them – from people, to taxa, publication, institutions, collections, expeditions, and more.

The authors describe several tools relevant to botanists that visualise Wikidata and use Wikidata QIDs in websites or catalogues. The paper is concluded by practical examples of how the botanical community can use Wikidata to its advantage, and a Wikidata call to action closely tied to the Madrid Declaration, "collectively published by the congress participants at the end of the IBC and is aimed at botanists, institutions and citizens to 'strengthen the connection between plants and people, nurture mutual benefits, and enhance planetary health and resilience' (XX International Botanical Congress, 2024)."

The article is accompanied by the original research poster and other interesting graphs. This story is also summarised in an accessible blog post: The power and potential of Wikidata for botany. In my view, it would be really interesting to see similar Wikidata overviews for other disciplines.

"Visualization of a Wikidata data model for a taxon, including nearly 50 examples of Wikidata properties (not including identifiers) that can be used on taxon items in Wikidata (Figure 9 from the paper)


Wikimedia Foundation publishes draft guidelines instructing researchers how to study NPOV on Wikipedia

By Tilman Bayer

The Wikimedia Foundation has published a draft document titled "Guidance for NPOV Research on Wikipedia". Besides general explainers about Neutral Point of View as a core Wikipedia policy, the document also appears to attempt to address some fallacies in prior studies of biases on Wikipedia, e.g. by asking researchers to "Distinguish between bias in sources on Wikipedia vs. bias in sources outside of Wikipedia", and suggesting other ways to "make rigorous assessments of Wikipedia's adherence to NPOV". The Foundation also solicited feedback on its guidelines draft from researchers and community members until August 31.

The document appears to be related to the "Common global standards for NPOV policies" working group launched by the Foundation earlier this year (Signpost coverage: "WMF to explore 'common standards' for NPOV policies; implications for project autonomy remain unclear"), which itself recently published an "Analysis of Neutral Point of View Policies across Wikipedias".

In "The Conversation", researcher Heather Ford raised concerns about the Foundation's "new rules":

[...] instead of supporting open inquiry, the guidelines reveal just how unaware the Wikimedia Foundation is of its own influence.

These new rules tell researchers – some based in universities, some at non-profit organisations or elsewhere – not just how to study Wikipedia’s neutrality, but what they should study and how to interpret their results. That’s a worrying move.
As someone who has researched Wikipedia for more than 15 years – and served on the Wikimedia Foundation’s own Advisory Board before that – I’m concerned these guidelines could discourage truly independent research into one of the world’s most powerful repositories of knowledge.
[...] the Wikimedia Foundation has lots of control over research on Wikipedia. It decides who it will work with, who gets funding, whose work to promote, and who gets access to internal data. That means it can quietly influence which research gets done – and which doesn’t.

Now the foundation is setting the terms for how neutrality should be studied.

Ford is also part of a group of researchers who recently published a manifesto and commentary calling for "Uniting and reigniting critical Wikimedia research" (see our previous coverage), which suggests to "Examine power relations" as one of several research focus areas.

Briefly

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by 三猎 and Tilman Bayer

"Wikipedia as a Global Social Movement"

Translated from the abstract:[5]

Previous studies have mainly positioned Wikipedia as a "meta-media", "reference book" or "social media", which seem unable to explain the reasons behind the continuous expansion of the Wikimedia project. Our research notes that Wikipedia originated from an open-source culture, and its development process can be traced back to the concept of the rights revolution. By evolving from Wikipedia to projects such as Wikidata and Wikimedia Commons, as well as holding activities like edit-a-thons, Wikimedia has transformed from an internet encyclopedia to a global social movement. The Wikimedia movement can be regarded as an open-source knowledge movement without boundaries, of which the action process is fully transparent throughout the domain, the action subjects heterogeneous and integrated, and the forms of interaction competitive and cooperative. Viewing Wikipedia as a social movement and an open-source community helps to further understand the logic of global open-source knowledge dissemination.

FYI: One of the co-authors, Gan Lihao (a professor of communication at East China Normal University), has just led his students to finish a book titled Wikipedia Politics and shared a report on the book at Wikimania 2025. Gan was mentioned in a 2019 BBC article. In this compiler’s opinion, BBC had misinterpreted the Chinese scholar’s words, which followed a special way of expression under China’s context.

"Collective Folk Writing on Chinese Martial Arts in English Contexts"

From the English-language abstract of this Chinese-language paper:[6]

With the increasing globalization of Chinese martial arts, diverse forms and levels of documentation have emerged worldwide. This study examines the collaborative folk writing of Chinese martial arts in English-language contexts by analyzing the development trajectory, contributor demographics, writing practices, and negotiations/competitions observed in the Wikipedia entry "Chinese Martial Arts". The research reveals that the entry's evolution is characterized by continuous growth and refinement, yet remains an ongoing, "unfinished" process. While the writing inherently exhibits distinct international and collective traits, the emergence of core contributor groups and Wikipedia’s editorial protocols dominate the negotiation and competition among diverse perspectives, ultimately shaping the entry's narrative direction and textual representation. Notably, the study identifies a striking absence of Chinese voices in this collaborative writing process. It emphasizes the urgent need to integrate Chinese perspectives and knowledge into the global dissemination of martial arts discourse to bridge this representational gap.

"ALPET: Active Few-shot Learning for Citation Worthiness Detection in Low-Resource Wikipedia Languages"

From the abstract:[7]

"Citation Worthiness Detection (CWD) consists in determining which sentences, within an article or collection, should be backed up with a citation to validate the information it provides. This study, introduces ALPET, a framework combining Active Learning (AL) and Pattern-Exploiting Training (PET), to enhance CWD for languages with limited data resources. Applied to Catalan, Basque, and Albanian Wikipedia datasets, ALPET outperforms the existing CCW baseline while reducing the amount of labeled data in some cases above 80%."

From the abstract:[8]

"Despite [Wikipedia's] widespread use, significant disparities persist among language publications, including variations in the number of articles, the spectrum of topics covered, and even the number of contributing community editors. In this paper, we aim to alleviate this gap in the coverage of low-resource languages. Although previous work has focused on multilingual interoperability efforts, the potential of hyperlinks has not been fully realized. Therefore, this study introduces a novel approach focused on hyperlinks, specifically emphasizing hyperlink types derived from Wikidata. We extract and analyze patterns related to these hyperlink types across different languages, using them as recommended solutions to connect the topics of various languages, particularly low-resource languages"

By "hyperlink type", the authors refer to the Wikidata topic that a Wikipedia article is an "instance of", via Wikidata property P31. From the paper:

[...] our research is carried out in a case study involving the English (en), Japanese (ja), and Vietnamese (vi) [Wikipedia] languages [...]

[... W]e notice a significant preference for topics such as film, automobile models, music groups, singles, and video games in English editors. In Japanese articles, hyperlinks emphasize various aspects of Japan, including city, town, chōchō, municipality, railway station, and manga series. In contrast, the Vietnamese context focuses primarily on topics such as world war, sovereign state, chemical compound, and organization.

"Leveraging LLM For Synchronizing Information Across Multilingual Tables" to help update Wikipedias in low-resource languages

From the abstract:[9]

"[Wikipedia] content in low-resource languages [is] frequently outdated or incomplete. Recent research has sought to improve cross-language synchronization of Wikipedia tables using rule-based methods. These approaches can be effective, but they struggle with complexity and generalization.This paper explores large language models (LLMs) for multilingual information synchronization, using zero-shot prompting as a scalable solution. We introduce the Information Updation dataset, simulating the real-world process of updating outdated Wikipedia tables, and evaluate LLM performance. [...] Our proposed method outperforms existing baselines, particularly in Information Updation (1.79%) and Information Addition (20.58%)"

References

  1. ^ Wiggers H (31 May 2021). "The Wikipedia Editions of Low German and Other European Minority Languages". In Seda Koc E (ed.). Modern Perspectives in Language, Literature and Education Vol. 4. Book Publisher International. pp. 126–144. doi:10.9734/bpi/mplle/v4/9161D. ISBN 978-93-91215-24-8. Closed access icon (freely accessible version)
  2. ^ Lewoniewski W, Węcel K, Abramowicz W (22 May 2025). "Utilizing citation index and synthetic quality measure to compare Wikipedia languages across various topics". arXiv:2505.16506 [cs.IR].
  3. ^ Wiggers H (3 June 2018). "The Struggle of Small and Non-Western Wikipedia Editions". Proceedings of the 4th Annual Linguistics Conference at UGA. 4th Annual Linguistics Conference at UGA. The Linguistics Society of UGA. pp. 66–86.
  4. ^ von Mering S, Leachman S, Santos J, Meudt HM (2025). "Wikidata for botanists: benefits of collaborating and sharing Linked Open Data". Annals of Botany mcaf062. doi:10.1093/aob/mcaf062. PMID 40481658.
  5. ^ 高海芮, 甘莅豪 (2025). "全球性社会运动视野下的维基百科全书". 科技传播. 17 (4): 81–85. doi:10.16607/j.cnki.1674-6708.2025.04.036. Closed access icon
  6. ^ 李正一, 罗诚迪, 李柏槐 (2025). "英语语境中中国武术的民间集体书写——以维基百科中"中国武术"(Chinese martial arts)条目为例". 成都体育学院学报. 51 (2): 143–150. doi:10.15942/j.jcsu.2025.02.15.
  7. ^ Halitaj A, Zubiaga A (1 July 2025). "ALPET: Active few-shot learning for citation worthiness detection in low-resource Wikipedia languages". Expert Systems with Applications. 281 127503. doi:10.1016/j.eswa.2025.127503. ISSN 0957-4174.
  8. ^ Nguyen N, Takeda H (2025). "Augmenting Low-Resource Language Wikipedia through Hyperlink Type Recommendation". IEICE Transactions on Information and Systems. advpub 2024EDP7258. doi:10.1587/transinf.2024EDP7258. https://www.jstage.jst.go.jp/article/transinf/advpub/0/advpub_2024EDP7258/_article/-char/en
  9. ^ Khincha S, Kataria T, Anand A, Roth D, Gupta V (April 2025). "Leveraging LLM For Synchronizing Information Across Multilingual Tables". In Luis Chiruzzo, Alan Ritter, Lu Wang (eds.). Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). NAACL-HLT 2025. Albuquerque, New Mexico: Association for Computational Linguistics. pp. 6474–6492. doi:10.18653/v1/2025.naacl-long.329. ISBN 9798891761896. / Code and dataset / Author's thread
Supplementary references and notes:
  1. ^ Kornai A (22 October 2013). "Digital Language Death". PLOS ONE. 8 (10): e77056. Bibcode:2013PLoSO...877056K. doi:10.1371/journal.pone.0077056. PMC 3805564. PMID 24167559.{{cite journal}}: CS1 maint: article number as page number (link)
  2. ^ Wilkinson MD, et al. (15 March 2016). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data. 3 (1) 160018. Bibcode:2016NatSD...360018W. doi:10.1038/sdata.2016.18. PMC 4792175. PMID 26978244.


+ Add a comment

Discuss this story

To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
No comments yet. Yours could be the first!





















Wikipedia:Wikipedia Signpost/2025-09-09/Recent_research