The Signpost

In focus

Measuring gender diversity in Wikipedia articles

When thinking about gender diversity in Wikipedia, we often think of the number of biographical articles about men and women. The Humaniki project shows that about 19% of biographical articles on the English Wikipedia are about women. However, this is only one aspect of gender diversity. In this article, I develop a method which measures gender diversity at the article level and show why it's useful.

Motivation

While working on the article about economics on the French Wikipedia, I was surprised by the low number of women among the people cited in the article. So I've started exploring methods to measure gender diversity. I draw a distinction between gender diversity and gender parity[1]. First, gender parity supposes binary gender, which excludes non-binary people. Second, gender parity implies that the ideal would be a fifty-fifty divide between men and women. After some iterations, I've found a way to measure gender diversity at the article level. This tool can be used to explore gender diversity for articles about academic fields, activities, or occupations. My approach is very basic and simply computes the share of people cited in an article by gender.

This simple quantitative approach to measure gender diversity is similar to many research projects on this theme in computational social sciences. David Doukhan is tracking women's speaking time on the radio[2]. Antoine Mazières and his co-authors are computing the share of screen time with women in popular movies[3] and Gilles Bastin and his co-authors are computing gender frequency of people cited in French newspapers[4].

Methodology

For each article, I get the list of internal links (also known as blue links). I retrieve them using the Wikipedia links API. Then I combine this query with a Wikidata SPARQL query[5]. I select all links corresponding to human beings in Wikidata (property P31 is Q5) and I retrieve their gender (property P21 in Wikidata). Note that gender in Wikidata can be male, female, non-binary, intersex, transgender female, transgender male, or agender. I'd find it more intuitive to group together transgender males with males and transgender females with females but I prefer to keep the classification of Wikidata.

Last, I count the number of entities by gender and compute the share.

Everyone can compute gender diversity for a single Wikipedia article using the gender diversity explorer tool.

This is a very basic approach. It doesn't distinguish any difference between entities cited in the references and entities cited in the core of the article. It doesn't take into account people cited in the article without a link to a Wikipedia article. But even if it's imperfect, I believe this is a useful approach.

Numbers should be interpreted with caution. The number of gendered entities cited in a single article is often very low. I personally don't interpret proportions if the total number of gendered entities is lower than 50.

Insights

Focus on economics

Chart measuring gender diversity in the Wikipedia article Economics in May 2022.

Let's have a look at the article about economics. In May 2022, we find 137 males, 6 cisgender females, and 1 transgender female[6]. So fewer than 5% of people quoted in the article are female. Of course, everyone knows that many prominent economists from Adam Smith to Jean Tirole are male. So no one is really surprised to find a vast majority of males in the results. Nobody would be able to say what a fair share of females in the article would be. However, I personally think that 5% is not much and that the contribution of women to economics is more important. Harriet Martineau, Mary Paley Marshall, Joan Robinson, Elinor Ostrom, Anna Schwartz, Janet Yellen, Esther Duflo, or Susan Athey have all made major contributions to economics.

Focus on academic fields

Share of people cited in articles by gender for academic fields

In this section, I compare gender diversity in Wikipedia articles about some important academic fields. As with economics, we know that most academic fields have long been dominated by male figures. So we're not surprised to find a relative low share of women in Wikipedia articles. By comparing Physics, Architecture, Economics, Social science, Computer science, Philosophy, Mathematics, Psychology, Medicine, Music, Political science, Sociology, Biology, Science, Art, History, and Literature, I find that all of them have a proportion of men higher than 80%[7]. Values for computer science and political science should be taken with caution since the number of people cited in those articles is lower than 50. If we exclude computer science and political science, we find that 10 out of 15 articles have less than 10% of women among all gendered entities! If we look at raw numbers, the count of women in each article is really low: 4 women in mathematics, 4 women in medicine, 1 woman in physics.

Conclusion and discussion

I believe that measuring helps to raise awareness of the problem of gender diversity in Wikipedia articles. Anyone can play with the gender diversity inspector and discover some insights.

In the next months, I would like to explore gender diversity in articles about occupations (journalist, politician, etc.) and activities (journalism, politics, sports, etc.). I would also like to have large scale studies looking at all articles about academic fields or all articles about an occupation.

My experiments with measuring gender diversity in Wikipedia articles lead me to believe that women are often forgotten or undermined in Wikipedia articles about general topics. It would be worthwhile to give specific attention to this topic. WikiProjects such as Women in Red could focus on this issue to ensure that the role of women hasn't been diminished in articles.

References

  1. ^ "The idea of closing the “gender gap” itself has always struck me as somewhat problematic as it implies a gulf between two equivalent sides and reinforces the idea of binary gender. An aspiration to equitable “gender diversity” might be more fitting" writes Katherine Maher in "Capstone: Making History, Building the Future Together", in Wikipedia @ 20, MIT Press, 2020, https://wikipedia20.pubpub.org/pub/4d61w771/release/2?readingCollection=08ec69da
  2. ^ https://larevuedesmedias.ina.fr/la-radio-et-la-tele-les-femmes-parlent-deux-fois-moins-que-les-hommes
  3. ^ "Computational appraisal of gender representativeness in popular movies", https://www.nature.com/articles/s41599-021-00815-9
  4. ^ Gendered News project, https://gendered-news.imag.fr/genderednews/
  5. ^ See the SPARQL queries in the project methodology
  6. ^ https://observablehq.com/@pac02/explore-gender-diversity-in-a-single-wikipedia-article?wikipedia=en.wikipedia.org&article=Economics
  7. ^ https://observablehq.com/@pac02/gender-diversity-in-wikipedia-articles-evidence-from-some?collection=@pac02/gender-diversity-in-wikipedia-articles
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
It could be argued, though, that “increasing the proportion of women in our citations for the sake of such” is one way of countering systemic bias. I don’t think we need to know detailed statistics about the contribution of women to economics to know that 95% of citations being from men is likely to be unrepresentative and worth improving on. — OwenBlacker (he/him; Talk; please {{ping}} me in replies) 09:22, 30 May 2022 (UTC)[reply]
We face the same issue with the share of women among biographies. No one know what is the good or fair share (15%, 19%, 30%?). But in the last years, projects such as Women in Red have focused on this issue and made an effort to increase the number of biographies dedicated to women. I'm just raising the same issue at the article level (poke Chess). Of course we need to rely on sources and reflect the reality of the topic. But we have some editorial freedom in the way we write articles and we can develop some aspects of the topic. In the article about economics in French, I've dedicated a section to the question of women in economics. I think it's a good way to start (if there are some sources of course). Last but not least, it's also in my opinion one aspect of the concept of "knowledge equity", which is key in Wikimedia movement strategy (Wikipedia:Wikimedia Strategy 2018–20/Innovate in Free Knowledge). PAC2 (talk) 16:45, 31 May 2022 (UTC)[reply]
  • I am a bit wary of the methodology here. As acknowledged in the article above, only those people with Wikipedia articles get counted in the statistics. If we go with the opening premise that non-male genders are under represented in Wikipedia articles, we are compounding the error by multiplying two disparities together. For example, an article has x% of citations from males and y% of citations from females. Now, for the sake of simplicity, lets say 80% of all biographical articles are about males and 19% are about females. Only comparing citations with linked articles we have x*0.8 and y*0.19. This results in a far lower percentage for female citations in the graphs than is mentioned in the article. I am not sure how we improve the calculation methodology but it is worth remembering that the level of the imbalance reported is distorted by our own distorted data. From Hill To Shore (talk) 17:33, 30 May 2022 (UTC)[reply]
    From Hill To Shore you're right. This can be part of the interpretation of the results and one way to improve gender diversity in an article would simply be to create articles about women named in the article. PAC2 (talk) 06:55, 31 May 2022 (UTC)[reply]
    Since the level of representation is being approximately squared by your methodology, perhaps the square root of the result would be a more accurate estimate of the representation. ~Kvng (talk) 19:36, 31 May 2022 (UTC)[reply]
    This occurred to me as well. In absence of information on any differences between the two proportions (% cited and % bluelinked), root-transforming sounds like a reasonable hack to remove the compound effect. --Elmidae (talk · contribs) 07:16, 2 June 2022 (UTC)[reply]
  • You say in the text that you're measuring the "share of people cited in an article by gender" (later you refer to the percentage of "people quoted in the article"). I think most readers would understand this to mean that you're looking at the gender distribution of the authors of works cited in the article and listed in the "References" section. So it was surprising to see that you're actually measuring the people wikilinked from the article. Colin M (talk) 19:50, 30 May 2022 (UTC)[reply]
    Thanks for the feedback Colin M. The term wiki linked is more precise but I'm not sure that everyone understands it. PAC2 (talk) 06:55, 31 May 2022 (UTC)[reply]
    Even if the term "wikilinked" isn't perfect, "cited" is worse. It means it won't count a female author who either isn't mentioned, isn't linked, or is red linked in the references section, but will count a bluelink that says something like "Economist John Doe wrote his seminal work on economics while on sabbatical in France and having an affair with the musician [[Marie LastName]]", where the bluelink has little to do with the topic at hand. "Mentioned and linked" might be better if you truly think "wikilinked" is too jargon-y. As a side note, I'd argue that an attempt to do this same test but for references / Bibliographies only would be a worthy endeavor, just some articles don't have well-formatted citations, and you can't look at Wikidata for unlinked authors. SnowFire (talk) 23:42, 2 June 2022 (UTC)[reply]
    @SnowFire: Yes, the tool cannot presently look at Wikidata for authors that do not have a link from the analyzed Wikipedia article, but many scholarly publications are in Wikidata, many of their authors have been disambiguated, and still a sizeable number of these have gender information, so by looking up the publications in Wikidata and their gender diversity, a more finegrained picture might emerge for the Wikipedia article in question. Daniel Mietchen (talk) 01:43, 5 June 2022 (UTC)[reply]
  • Thank you for this interesting article! Zarasophos (talk) 09:15, 3 June 2022 (UTC)[reply]
  • Although it is noted that there is indeed a disparity, that is mainly because women were not entitled to study at degree level until the 20th century. As such, there were very few women who COULD be mentioned on such a wide topic as "Economics", and they would be massively outnumbered prior to 1940. I am surprised that there is no mention of either Anna Schwartz (co-edited the Friedman book), nor Elinor Ostrom (who is THE ONLY woman to ever have won a Nobel prize for economics). I have added Mary Paley Marshall to the article, as she co authored a book where her husband was mentioned (but not her!?!). I fear all this analysis & debate is less positive than adding stuff that is seen to be obviously missing Chaosdruid (talk) 03:34, 9 June 2022 (UTC)[reply]
@Chaosdruid: The analysis has drawn to your attention (someone who has knowledge of female involvement in Economics) that there is a gap in the article and you have made an improvement to it. I would say that is a positive. Similar analysis of other articles may help identify other areas where there are particular gaps.
Most of the "debate" above is about refining the method of analysis to produce more accurate data. With accurate data, we will be able to spot articles that have an unusual disparity and correct them. From Hill To Shore (talk) 05:44, 9 June 2022 (UTC)[reply]
@From Hill To Shore: I have no fore-knowledge, except A level economics - I simply did a search on Google for the top 10 female economists, read about them, and used that info. That should have already been done, since this page has discussion involving 8+ editors going back for at least two weeks. I feel that is a negative. Similarly these edits include removing a male author, instead of leaving him and adding the inserted female one; which actually looks like more of a negative considering that the article now does not include the counter statement to the previous paragraph end.
Perhaps action is more important than discussion - do we wait to see if anyone else actually adds the other 2 I mentioned? Maybe then we can do an analysis of why no one bothered to actually fix the thing you were all discussing? I will leave it up to one of the other nine or so editors to maybe add some detail on the ladies I mentioned as I feel perhaps there is a litle bit of looking for a disparity rather than curing it. I did not see a "gap in the article", I saw a gap in the editing of said article after someone had raised a flag.
... and yes, I get annoyed about things that are discussed and never actually acted upon Wikiwide, as well as hasty knee-jerk editing that tries to correct a perceived wrong but actually lowers the accuracy of an article. Chaosdruid (talk) 07:15, 9 June 2022 (UTC)[reply]
@Chaosdruid: So, you are complaining about knee jerk reactions but want 8+ editors to jump in and attempt to fix something they may not be familiar with? My interest and expertise do not lie with economics, so you are better placed than I to look at that article. Also, your example of a set of bad edits involve an ongoing content dispute on the article talk page that predated the publication of this edition of Signpost. Why are you trying to link an unrelated content dispute to the editors here?
You are also misrepresenting this discussion. While a few people here have talked about the example used of the economics article, most of the comments are about the principles and methods of analysis. Is there actually a problem and is the data a valid representation of the situation? You want us to fix wiki-wide problems but seem to begrudge people giving up their own time to discuss how we can better understand what the problem is and where we should fix it. That you wanted to improve the economics article and went ahead and edited it is great. However, you shouldn't expect every editor to conform to your expectations and timescales. We all improve the project in our own ways and at our own speed. From Hill To Shore (talk) 09:25, 9 June 2022 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2022-05-29/In_focus