The Signpost

In focus

Are Wikipedia articles representative of Western or world knowledge?

Wikipedia aims at representing the sum of all knowledge. It is not so easy to define "the sum of all knowledge". We could expect the sum of all knowledge means knowledge from every region in the world (geographical distribution), from every era in History, from every culture, every ethnic group, every gender group, etc.

Trying to measure diversity of knowledge on Wikipedia, we can look at diversity of contributors, number of Wikipedia articles, diversity of sources and references[1] or diversity in mentioned entities inside a given article.[2]

In this article, I look at the geographical distribution of people mentioned in an article (people mentioned with a blue link).

I apply my methodology to a selection of articles about general topics such as music, culture or knowledge in a selection of Wikipedia versions and I discuss the results.

Methodology

Given a Wikipedia article, I select all internal links (blue links) and I call them "mentioned entities". This can be done through the endpoint "links" in the MediaWiki generator API. The magic is that this API can be integrated in a SPARQL query in the Wikidata Query Service. So I combine the call to the API with a Wikidata query. I select all mentioned entities with P31 equal to Q5 (humans) with a known birthplace (P19) and I collect the country of the birthplace with property P17.

SELECT DISTINCT ?item ?itemLabel ?country ?countryLabel ?birthplace
?birthplaceLabel
WHERE {
  SERVICE wikibase:mwapi {
     bd:serviceParam wikibase:endpoint "en.wikipedia.org";
                     wikibase:api "Generator";
                     mwapi:generator "links";
                     mwapi:titles "Music";.
     ?item wikibase:apiOutputItem mwapi:item.
  } 
  FILTER BOUND (?item)
  ?item wdt:P31 wd:Q5 ; wdt:P19
?birthplace. 
  ?birthplace wdt:P17 ?country . 
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,mul". }
}
Click here to launch the Wikidata query

I then collect a mapping between actual countries and continents. The mapping comes from Wikidata but is consistent with United nations M49 classification.[3]

SELECT DISTINCT ?continent ?continentLabel ?country  ?code WHERE {
VALUES ?continent {
wd:Q55643
wd:Q48
wd:Q15
wd:Q18
wd:Q49
wd:Q46
} 
 ?continent (wdt:P527*) ?country.
  ?country 
    wdt:P2082 ?code.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,mul". }
}
Click here to launch the Wikidata query

I perform a left join of the two data frames using the Arquero JavaScript library.[4]

Finally, I regroup Europe and North America as "Western World" and the four other continents as "Rest of the world". This is an opinionated and radical approach but it makes the numbers easier to read. Places of birth which cannot be associated with a current country are labeled "Unclassified".

I've developed a user interface using Observable notebook.[5] Users can choose two parameters: the Wikipedia project (i.e. "pt.wikipedia.org") and the title of the article. Parameters can be added in the URL directly. For instance, you can look at article "Kennis" (i.e. knowledge) in Afrikaans: https://observablehq.com/@pac02/wwrw?wikipedia=af.wikipedia.org&article=Kennis.

All computations are performed in the appendix of the notebook. The code is open source licensed under the ISC.

Results

This approach makes sense for articles about general topics such as music, work, art, beauty, love, humanity, knowledge, education, school, religion, etc. Also it makes sense if the number of people mentioned in the article is high enough to compute percentages.

In this section, we focus on three notions, music, knowledge and culture in five languages English, French, Spanish, Portuguese and Arabic.

Music

Music comes from all over the world. I would expect an encyclopedic article to mention people from all the continents. Let's take a look at the numbers.

Geographical distribution of mentioned peoples in articles about music
Linguistic version Article Rest of the world Unclassified Western world
English Music 6 (8.2%) 2 (2.7%) 65 (89.0%)[6]
Spanish es:Música 0 (0%) 3 (5.0%) 57 (95.0%)[7]
French fr:Musique 4 (6.5%) 0 (0%) 58 (93.5%)[8]
Portuguese pt:Música 1 (5.0%) 0 (0%) 19 (95.0%)[9]
Arab ar:موسيقى 23 (79.3%) 0 (0%) 6 (20.7%)[10]

On the French, English, Portuguese and Spanish Wikipedias, the proportion of people born in Europe and North America is higher than 89%. This leaves little room for people born in Asia, Africa, South America or Oceania. Although Spanish is widely spoken in South America, the article in Spanish does not mention any musician born on this continent or in Africa, Asia or Oceania.

Knowledge

Knowledge is another general topic. One would expect the article to mention people from all over the world. Wikipedia in English and Wikipedia in Portuguese have articles with more than 90% of mentioned entities born in Europe or North America. Wikipedia in French has too few entities and Wikipedia in Spanish has more diversity.

Geographical distribution of mentioned peoples in articles about knowledge
Linguistic version Article Rest of the world Unclassified Western world
English Knowledge 5 (6.4%) 2 (2.6%) 71 (91.0%)[11]
Spanish es:Conocimiento 1 (3.1%) 5 (15.6%) 26 (81.3%)[12]
French fr:Connaissance 0 (-) 0 (-) 10 (-) [13]
Portuguese pt:Conhecimento 1 (3.4%) 1 (3.4%) 27 (93.1%)[14]
Arab ar:معرفة 6 (20.7%) 4 (13.8%) 19 (65.5%)[15]

Culture

Looking at culture shows that the article in French lacks diversity, with 96.5% of mentioned people from Europe and North America. Articles in English and Spanish are a little bit more diverse, with 84.0% and 88.9% of people from Europe and North America. The article in Portuguese is a good example of diverse article with respect to our criteria, with 55% people from Europe and North America.

Geographical distribution of mentioned peoples in articles about culture
Linguistic version Article Rest of the world Unclassified Western world
English Culture 21 (12.9%) 5 (3.1%) 137 (84.0%)[16]
Spanish es:Cultura 4 (8.9%) 1 (2.2%) 40 (88.9%)[17]
French fr:Culture 2 (3.5%) 0 (0%) 55 (96.5%)[18]
Portuguese pt:Culture 10 (20.4%) 12 (24.5%) 27 (55.1%)[19]
Arab ar:ثقافة 0 (-) 0 (-) 4 (-)[20]

Discussion

Globally, the results show that on the English, Spanish, French, and Portuguese Wikipedias, people born outside Europe and North America are not mentioned very often.

Of course, there are multiple layers of explanations. The total number of written sources about those topics may be higher in Europe and North America than in the rest of the world. The total number of contributors may also be higher in those regions than in the rest of the world. There is also maybe an imbalance in the number of biographies between people born in Europe and North America and people born in other continents.

Of course, nobody knows what would be the fair percentage of people born outside Europe and North America for a given Wikipedia article. But WWRW helps raise awareness of some imbalances. If people from Oceania, South America, Asia or Africa are not mentioned in an article about the topic, it's worth asking why and looking for new sources which could help to add some diversity in the article.

More work is needed to measure diversity in Wikipedia articles. Anyone can play with the WWRW tool or any other tool in "article analytics"[21] and do his or her own report, and anyone can develop new ways to measure diversity.

References

  1. ^ For instance, Piotr Konieczny and Włodzimierz Lewoniewski look at the number and of articles related to the United States of America and the number of American sources in references. See their presentation at Wikimania 2024: https://prezi.com/view/C7snnAZFWqZz7vPD0kLu/
  2. ^ In a previous Signpost article, I look at the gender distribution of people (human entities) mentioned in Wikipedia articles: Measuring gender diversity in Wikipedia articles, The Signpost, may 2022
  3. ^ https://unstats.un.org/unsd/methodology/m49/overview/#
  4. ^ Arquero is JavaScript library developed by Jeffrey Heer: https://idl.uw.edu/arquero/api/
  5. ^ Observable is a platform created by Melody Meckfessel and Mike Bostock which proposes to write notebooks in JavaScript. It is widely used by the data visualization community
  6. ^ https://observablehq.com/@pac02/wwrw?wikipedia=en.wikipedia.org&article=Music
  7. ^ https://observablehq.com/@pac02/wwrw?wikipedia=es.wikipedia.org&article=M%C3%BAsica
  8. ^ https://observablehq.com/@pac02/wwrw?wikipedia=fr.wikipedia.org&article=Musique
  9. ^ https://observablehq.com/@pac02/wwrw?wikipedia=pt.wikipedia.org&article=M%C3%BAsica
  10. ^ https://observablehq.com/@pac02/wwrw?wikipedia=ar.wikipedia.org&article=%D9%85%D9%88%D8%B3%D9%8A%D9%82%D9%89
  11. ^ https://observablehq.com/@pac02/wwrw?wikipedia=en.wikipedia.org&article=Knowledge
  12. ^ https://observablehq.com/@pac02/wwrw?wikipedia=es.wikipedia.org&article=Conocimiento
  13. ^ https://observablehq.com/@pac02/wwrw?wikipedia=fr.wikipedia.org&article=Connaissance
  14. ^ https://observablehq.com/@pac02/wwrw?wikipedia=pt.wikipedia.org&article=Conhecimento
  15. ^ https://observablehq.com/@pac02/wwrw?wikipedia=ar.wikipedia.org&article=%D9%85%D8%B9%D8%B1%D9%81%D8%A9
  16. ^ https://observablehq.com/@pac02/wwrw?wikipedia=en.wikipedia.org&article=Culture
  17. ^ https://observablehq.com/@pac02/wwrw?wikipedia=es.wikipedia.org&article=Cultura
  18. ^ https://observablehq.com/@pac02/wwrw?wikipedia=fr.wikipedia.org&article=Culture
  19. ^ https://observablehq.com/@pac02/wwrw?wikipedia=pt.wikipedia.org&article=Cultura
  20. ^ https://observablehq.com/@pac02/wwrw?wikipedia=ar.wikipedia.org&article=%D8%AB%D9%82%D8%A7%D9%81%D8%A9
  21. ^ https://observablehq.com/collection/@pac02/article-analytics
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Interesting. Thanks for mentioning our presentation. The Prezi can be seen in our Wikimania presentation: File:Wikimania 2024 - Dilijan - Day 1 - Exploring Americanization in different regions of the world using Wikipedia and Wikidata.webm. You may also want to check our paper on this, that the presentation was based on, published earlier this year: Americanization: Coverage of American Topics in Different Wikipedias. Accessible through WP:Wikipedia Library, I hope (not in LibGen yet, sorry...). No OA as WMF does not support grants for OA on Wikipedia studies (we asked), and no other funding source was available. --Piotr Konieczny aka Prokonsul Piotrus| reply here 00:44, 13 December 2024 (UTC)[reply]

Now, comments on your analysis. 1) I'd nitpick not adding Australia and New Zealand to the Western world, but let's face it - their numbers are not likely to be very game changing. (Sorry, Aussies... I don't even know what is the nickname for New Zealanders...) 2) I understand very well why you have no Asian group (it's a pain to make); I'd still suggest having at least Japanese for some decent-ish comparison. Also I'd add German, as well as Russia to the set, those are big wikis (see also below). 3) Riding on - let's remember that Spanish and Portuguese significantly represent Latin America (you mention this for Spanish, but you seem to have forgotten Brazil...we have data from few years ago on views and edits to wiki by country - see [1] and [2]; sadly they are a few years old, the new Wikipedia Stats pages suck and if that information is still somewhere, I was never able to find it...), and English also includes many readers and contributors from India. Again, if anyone is interested in more, see our paper, we have like two page limitation chapter discussing this stuff. Anyway, the point is that the numbers above are not pure 'Western' world and to some degree (hard to estimate quickly) include Latin America and India. French is probably the 'purest', although it is popular in some African countries. That's why German would be very good here (big Western wiki not used much outside Europe). Russia would be good, since they not really 'West' (nor 'Asia'; Russia is, well, Russia). 4) As for the numbers, it's fascinating to see how different Arab numbers are, I'd love to learn more about what kind of people are and aren't discussed on Arab wiki, compared to 'Western'. 5) What's wrong with the table data for Arab and Culture? --Piotr Konieczny aka Prokonsul Piotrus| reply here 01:08, 13 December 2024 (UTC)[reply]
Sorry, Kiwis. Jim.henderson (talk) 02:17, 13 December 2024 (UTC)[reply]
Ah, right, I forgot... :P Piotr Konieczny aka Prokonsul Piotrus| reply here 02:24, 13 December 2024 (UTC)[reply]

I guess it's a basic understanding in eastern world editors that their is a supremecy of the so called west in term of equal distribution of content.––kemel49(connect)(contri) 00:48, 13 December 2024 (UTC)[reply]

While bringing a spotlight to a specific region with a new article is pretty doable (love seeing articles dedicated to specific fields in specific countries), trying to bring up a non-canonical region in a broad-topic article tends to be controversial. Here's one experience I've had with this, for example, trying to add a little section on Latin America in the History of video games. (Same with Africa but I'm not sure when that was removed) ~Maplestrip/Mable (chat) 08:23, 13 December 2024 (UTC)[reply]

@Maplestrip Did you try starting a talk discussion to judge consensus? Or just start a dedicated stand-alone articles on these topics first, I am sure they are notable enough for that. Piotr Konieczny aka Prokonsul Piotrus| reply here 03:46, 14 December 2024 (UTC)[reply]
I actually just started a talk-page discussion, inspired by this In-Focus! I also wrote a few dedicated articles on those subjects, which is cool, but nobody looks at Video games in Nigeria if they aren't already interested in Nigeria. ~Maplestrip/Mable (chat) 09:10, 14 December 2024 (UTC)[reply]
Or gaming trivia - or broader concepts. I mean, this won't be a top viewed article, but hey, I am planning on improving/writing some articles about related topics, namely science fiction (I have some materials on science fiction in some less known countries and wider regions). Science fiction in Africa is likely a notable concept and not just a fork of Afrofuturism. ([3]/[4] for example). Wishing you luck with your creations - I find this stuff very interesting, myself :) Piotr Konieczny aka Prokonsul Piotrus| reply here 09:50, 14 December 2024 (UTC)[reply]
Oh that's fascinating and exciting! I've been really happy to see topics related to Polish speculative fiction on the front-page recently (sad I couldn't find a translation of CyberJoly Drim), so I am really excited to see what else you might write like this ^_^ ~Maplestrip/Mable (chat) 18:14, 14 December 2024 (UTC)[reply]
@Maplestrip Machine translation is passable these days - you can translate CJD (published here) to English or such with just two mouse clicks. Granted, it is literature so the result want be as pretty as it would be from a professional literature translator (that's still is a few years away).
Many of my sf articles are published these days on pl wiki, but I am trying to get the better ones translated here. But again, I expect in few years we will have AI translating stuff... if you are curious, check for example my newest article on pl wiki at pl:Ostatnia godzina (powieść) and again, two clicks in browser should give you a passable English output. Piotr Konieczny aka Prokonsul Piotrus| reply here 03:30, 15 December 2024 (UTC)[reply]
I do use machine translation, especially for sources and such, but it just feels terrible to try to parse a story that way. You lose all of the original writing and get something much flatter in return. Did read a bit of Cyberjoly Drim and it was cool, but I didn't get as much out of it. And to tie it back to this article's topic, that is of course one of the many challenges of making our articles represent the full human range :p ~Maplestrip/Mable (chat) 10:32, 15 December 2024 (UTC)[reply]
  • Is it realistic to expect a population of researchers to not have this sort of inherent linguistic bias? It is worthy of being done, but of course people who read, write, and speak English will go to English language sources first and that means subjects and topics that have happened within the Anglosphere get covered more. I see this all the time in plants. The sources are plentiful and easily accessed if a plant grows in North America. If it grows in Europe it is somewhat harder if it is not one of the cosmopolitan plants that have moved around the world. If the plant only grows somewhere like Mali or the Congo... it may well be impossible because the efforts to put sources for these countries online have not yet happened. And if they are online they may very well be in a language I do not read, I only know two. Anecdote: The other month when I wanted to de-stub an article for a Peruvian plant I had to physically go to a library to borrow a paper book. That slows things down. I could have done three much more in depth articles about American plants in the time it took me to write a short one for Castilleja ecuadorensis. And if I wanted a picture I would have to go to Peru and travel well into the Andes. So, of course, I'm going to work on plants closer to home more often. And given the huge gaps in the articles closer to home I'll never make even the local flora complete even if I keep editing steadily for another 20 years. I expect the reverse to also be true. Usually when I'm working on a North American species there is not an article in any language other than the four Wikis with a large number of bot created stubs (Cebuano, Svenska, Tiếng Việt, & Winaray). If the article does exist it is usually because of one editor. The one person with the skills, the time, and the interest has to exist to do the work. Because outside a few of the most famous or contentious topics it is going to be one editor who does most of the work to make an article really complete. TLDR: Editors are people. They're not perfectly spherical editing machines that have a universal set of skills to impartially and perfectly cover the world. 🌿MtBotany (talk) 04:47, 18 December 2024 (UTC)[reply]
    • Excellent points! Of course, there's a bias on English Wikipedia. We speak English, we use English sources, and we write in English. Editors are also biased toward using their time to create and edit articles that get attention rather than working on articles that get little attention. (I also love to toil on obscure Peruvian topics.) It would be interesting to see research on reader bias, such as the ebb and flow of interest in subjects, the trends in readership, etc. So, should we do something about reader and editor bias? No. The bias is perhaps more interesting than any remedy that could be devised. Smallchief (talk) 10:27, 18 December 2024 (UTC)[reply]
      Thanks. I forgot to add the opposite problem. A super famous topic means there is too much information. Not too much variety, but too much of the same thing repeated over and over and over. If I go to edit catnip I can find a flood of information about it being used with cats. Trying to get specific information about its role in the environment is much harder because of all the sources repeating the same information about cats. I imagine the same thing may happen with broad topics like music or dance. There will be lots and lots of information in English about all the most popular parts of this broad topic as focused on the English speaking world making it harder for an editor looking for something different. Especially if they don't have an idea of what to look for. It is the one yellow straw in a pile of white ones problem. 🌿MtBotany (talk) 02:51, 21 December 2024 (UTC)[reply]

The bias tends to be geographically oriented from major cities, in my opinion. This can be seen within the UK and USA easily by checking the origin of edits against articles about geographic items of interest. The geographic items (towns, statues, etc) tend to be written by a more diverse community according to proximity of larger population centres. Lazy Wikipedians read & edit only in their native language. Connections and filling in of blanks are the hobby of multi-lingual Wikipedians, who tend to be in the minority. Jane (talk) 11:29, 18 December 2024 (UTC)[reply]

  • A bit late, but for the record, the choice of articles is fairly eccentric. I don't think that many people are actually reading the articles like knowledge and culture. A quick check of pageviews shows around 1,000 hits daily for knowledge, and a bit more for culture. But I imagine lots of these hits are, say, elementary schoolers or English-as-a-second-language readers who are just reading the lead section for a dictionary definition, and not closely interrogating the whole article and its links. For comparison, the article Taylor Swift gets 25K hits a day at minimum, which spikes to ~100K hits when she's in the news. Basically, if the goal is countering systemic bias, I think it would be way more helpful to include non-Western scientists or artists on Wikipedia than it would be to edit articles that are actually not that important to include more links to non-Western figures. SnowFire (talk) 20:45, 18 December 2024 (UTC)[reply]
    I think the best way to think of how people read Wikipedia articles, is to check yourself on a random day. You may have read something in depth because of your own specific "Wikipedian bias" based on your own Wikipedia work, but you may also have checked the lead section for any number of articles while on the go downtown, sitting before the tv, reading a book, or shopping around for your next holiday meal, vacation, or visitor attraction. Some Wikipedians bother to correct or add to their "fly-by" reads, but most oldies like myself just stick to their area of interest, due to preference of desktop over mobile editing. Jane (talk) 12:30, 19 December 2024 (UTC)[reply]
    Er... no? Not the point I was making. Experienced Wikipedians are a poor analogue to how regular normies read Wikipedia. I've seen smart, college-educated people not even realize that the article extended beyond the lead section (back on the older version of Vector where there was a Table of Contents after the lead). Anyway, we don't have to speculate, we can just look at the pageviews, which gives us an objective maximum of people interested in the article content. Per above, there's reason to think that at least some of those hits are not "genuine" hits but really people who just want dictionary definitions, and also per above, casual readers very often read the lead section and absolutely nothing else. SnowFire (talk) 13:42, 19 December 2024 (UTC)[reply]
  • Apart from User:Piotrus (twice, in his first comment) no-one has mentioned India (or Pakistan, South Asia). But we have huge numbers of writers and readers from there (including diaspora) most of whom write mainly or entirely on Indian topics, where in some respects en:wp is extremely strong. Our Indian readers are perhaps not so exclusive, but Indian topics, especially in popular culture and sport (but also politics) get huge views, often making the Signpost "traffic report". But though numerous, Indian writers and readers are a minority, and we should neither be surprised nor too abashed that topics from "the West" do best. Johnbod (talk) 02:47, 21 December 2024 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2024-12-12/In_focus