The Signpost

The Science Hall of Fame

Building a pantheon of scientists from Wikipedia and Google Books

On 14 January 2011, Science, one of the world's most prominent scientific journals, published a "Science Hall of Fame" (SHoF),[1] which they described as "a pantheon of the most famous scientists of the past 2 centuries". Unlike in traditional assessment of "fame" and "influence", which usually relies on polls and the opinion of experts in the field, Science has opted for an objective approach based on how many times people's names are found in the digitized copies of books available on Google Books, and whether Wikipedia considered these people scientists.

The Signpost takes a look at what they did and reports some of the trends in the data.

Origins and overview

The origin of the SHoF was made possible by a paper written by Jean-Baptiste Michel and Erez Lieberman Aiden and published on 17 December 2010.[2] Michel and Aiden aggregated data from Google Books, which at the time contained 15 million digitized books—roughly 12% of all books ever published. Filtering for quality revealed that about a third of the original data was suitable for analysis; this is available online at ngrams.googlelabs.com. Based on Michel and Aiden's work, John Bohannon from Science and Adrian Veres from Harvard University teamed up to create a pantheon of the most influential scientists as measured by the number of times their names appear in Google Books. However, millions of names can be found in books, and thus a way was needed to decide who is a scientist and who is not. This is where Wikipedia comes in, with its 900,000 biographical entries (the authors report 750,000),[3][4] which can then be searched using for science-related categories, science-related keywords, and years of birth and death.[5]

They have created a new unit, the darwin (D), defined as "the average annual frequency that 'Charles Darwin' appears in English-language books from the year he turned 30 years old (1839) until 2000". Scientists named more often than Darwin himself would have a fame greater than 1 darwin. However, as few people were as influential as Darwin, the millidarwin (mD; a thousandth of a Darwin) is used instead. As it turns out, only three people beat Darwin in terms of fame as measured by this metric: John Dewey (1752.7 mD), Bertrand Russell (1500.1 mD), and Sigmund Freud (1292.9 mD). Other famous figures, such as Albert Einstein (878.2 mD), Marie Curie (188.6 mD), and Louis Pasteur, (237.5 mD) rank lower.[6] This is not a measure of the impact of their scientific work, but rather of how often they are mentioned in all types of books. For example, a scientist could have a moderate scientific impact but be famous for political involvement or even for negative scientific impact, such as involvement in scientific fraud or high-profile pseudoscience.

The authors warn that the current version of the Science Hall of Fame is a rough draft subject to further refinements; not all fields are covered equally and some scientists were excluded for technical reasons. Further details are on the Science Hall of Fame website, especially their FAQ section. As an aside, the authors called this an experiment in "culturomics" (the analysis of large sets of data to find cultural trends), which has been dubbed by the American Dialect Society as the "least likely to succeed" word of 2010.[7] It will be interesting to see if the word catches on, or if the culturomics link will remain red or turn blue.

Fame, article quality, and other trends on Wikipedia

Based on this measure of fame as established by Bohannon and Veres, a comparison of Wikipedia with the SHoF (WP:SHOF) was created by Snottywong, based on a suggestion from this article's writer. The comparison lists scientists, along with their fame in mD and years of birth and death as reported by Science, as well as years of birth and death as reported by Wikipedia and assessment ratings (taken from the {{WikiProject Biography}} banner). Since the SHoF remains a rough draft at the moment, a highly-rigourous analysis of its findings would be pointless at this stage; however, some things are worth noting.

Numbers

First, some numbers. As of writing ...

The high degree of "completion" or of "accuracy" should not be considered a sign that Wikipedia is "complete" or "accurate", because the authors used Wikipedia to determine whether people were scientists or not and possibly used dates from Wikipedia articles. People who lack a Wikipedia entry would presumably be excluded on "technical grounds". It is also possible that the discrepancies merely reflect changes in Wikipedia due to vandalism, mistakes, and corrections which occurred since the data was acquired.

Fame and quality

How, then, are article quality and fame related? Number crunching by The Signpost revealed the following.

Class FA GA B C Start Stub ???
Average Fame
(mD)
107.8 113.45 60.4 17.9 11.73 5.92 5.9
Fame (mD) →
Class ↓
0–0.1 0.1–1 1–10 10–100 100–1000 1000–10000
FA 0 0 50 21.4 28.6 0
GA 7.7 7.7 46.2 15.4 23.1 0
B 2.5 8.4 38.7 38.7 10.5 1.3
C 4.1 16.3 53.7 21.1 4.9 0
Start 2.8 18.6 58.5 17.9 2.3 0
Stub 4.6 25.3 57.8 11.8 0.4 0
??? 4.4 22.7 60.5 11.8 0.6 0

We can see that as the quality of articles increases, so does the mean fame of the scientists within the assessment class. Another way to look at this is through the distribution of fame within assessment classes: as the quality increases, the distribution of fame shifts towards higher fame – that is, higher-quality articles tend to be about more famous people. While fame correlates to some extent with quality, it is still in no way a guarantee that a famous person will have a high-quality article on Wikipedia (or vice-versa). There is not much else to say about these rather unsurprising results, except perhaps to mention that the distribution of unassessed articles most closely matches that of stubs.

Gender and quality

We can also take a look at some trends. For gender, the rankings of the top 10 men and top 10 women look like this.

Men
Rank Name Fame Ranking Main fields
1 John Dewey 1752.7 B Philosophy, psychology
2 Bertrand Russell 1500.1 B Philosophy, mathematics
3 Sigmund Freud 1292.9 B Psychology
4 Charles Darwin 1000.0 FA Biology
5 Albert Einstein 878.2 GA Physics
6 Havelock Ellis 672.4 Start Psychology
7 Noam Chomsky 506.6 B Philosophy
8 G. Stanley Hall 480.5 Start Psychology
9 Lewis Carroll 479.3 C Literature, religion
10 Erich Fromm 466.0 B Psychology, philosophy
Women
Rank Name Fame Ranking Main fields
1 Anna Freud 537.1 Start Psychology
2 Julia Kristeva 444.0 Start Philosophy, feminism
3 Ruth Benedict 314.7 Start Anthropology
4 Melanie Klein 285.1 Start Psychology
5 Luce Irigaray 242.9 Stub Sociology, feminism
6 Carol Gilligan 235.3 Start Psychology, feminism
7 Juliet Mitchell 210.2 Stub Psychology, feminism
8 Nancy Chodorow 209.0 Stub Sociology, feminism
9 Marie Curie 188.6 B Physics
10 Karen Horney 158.7 Start Psychiatry

Unsurprisingly, the top 10 men have higher fame than the top 10 women. At first glance, it seems there is an enormous disparity between the relative quality of articles on men (1 FA, 1 GA, 5 B, 1 C, 2 Start, 0 Stub) compared to those on women (0 FA, 0 GA, 1 B, 0 C, 6 Start, 3 Stub). However that conclusion is not strongly supported at the moment. Since quality & fame are somewhat correlated, it would be natural for the top 10 women (who are on average, less famous) to have articles of somewhat lower quality, although the discrepancy here seems to be too big to be entirely explainable only by lower fame. Instead, what seems to be the top indicator of quality is a combination of fame plus field. Many of the men in the top 10 ranking come from hard sciences and philosophy, while most of the women come from humanities (especially feminism and psychology/psychiatry). Indeed, the two lower-ranked articles (Start-class) for men concern psychologists (Havelock Ellis and G. Stanley Hall), and the only one above Start-class for women is for a physicist (Marie Curie).

Based on this, a more sensible conclusion would be that famous people from the humanities are under-represented on Wikipedia, compared to other fields. However, even that conclusion should not be embraced blindly. After all, it relies on a very small sampling (10 men, 10 women). Someone interested in doing a rigourous analysis of the data would have to weight ratings scores according to fame and year of birth, classify people according to the fields for which they are famous, and make sure the ratings are up to date. For example, the articles on Anna Freud and Melanie Klein could arguably be rated as a C-class instead of their current Start-class, and Karen Horney is rated B-class by WikiProject Psychology but the WikiProject Biography rating has not yet been updated, and is still a Start-class. And lastly, there is possibly a bias in the Google Books selection. Books from certain fields could be digitized more often than books from other fields, or the writing conventions of the field could make full names more prevalent than in other fields, boosting the "measured fame" compared to their "actual fame".

Closing remarks

The above analysis just gives a hint of the type of questions that can be asked and answered by analysis of the SHoF data. It will be very interesting to see if concerted efforts to improve coverage in fields which are lacking will take place, or see if the gender gap truly exists, or if it is the result of a coverage bias amongst fields. It will be equally interesting to see if culturomics will take off as a field and what direction it will take.

Got comments or ideas of your own? Share them below!

References

  1. ^ J. Bohannon (2011). "The Science Hall of Fame". Science. 331 (6014): 143. doi:10.1126/science.331.6014.143-c.
  2. ^ J.-B. Michel; et al. (2010). "Quantitative analysis of culture using millions of digitized books". Science. 331 (6014): 176. doi:10.1126/science.1199644. PMID 21163965.
  3. ^ WP 1.0 bot (23 January 2011). "Wikipedia:Version 1.0 Editorial Team/Biography articles by quality statistics". Retrieved 2011-01-27.{{cite web}}: CS1 maint: numeric names: authors list (link)
  4. ^ J. Bohannon, A. Veres. "FAQ: How are the members of the Hall of Fame chosen?". The Science Hall of Fame. Retrieved 2011-01-27.
  5. ^ J. Bohannon, A. Veres. "FAQ: How are the members of the Hall of Fame chosen?". The Science Hall of Fame. Retrieved 2011-01-27.
  6. ^ Snottywong, Headbomb (25 January 2011). "Science Hall of Fame". WikiProject Biography. Retrieved 2011-01-27.
  7. ^ ""App" 2010 Word of the Year, as voted by American Dialect Society" (PDF) (Press release). American Dialect Society. 7 January 2011. Retrieved 2011-01-27.
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
  • Under "Fame, article quality, and other trends on Wikipedia" (permanent link here), the subsection "Numbers" says "the SHoF contains 5,631 entries ... of these, 1828 are living, 3808 are dead." 5631 ≠ 1828 + 3808.—Wavelength (talk) 04:06, 1 February 2011 (UTC)[reply]
  • I wouldn't ascribe this directly to US-centricity - although I doubt I'm the only one surprised that Isaac Newton didn't make the list -- or other notable older scientific figures like Galileo, Plato or Aristotle. (Moreover most Americians, if asked to identify John Dewey, probably could do little better than guess he invented the Dewey Decimal System -- which was the work of another man.) But a glance at the John Dewey article offers an possible answer: he has been the target of much vitriol by American conservatives. (Damn that man for working towards te goal of offering the average American a useful & liberal education!) -- llywrch (talk) 18:53, 1 February 2011 (UTC)[reply]
  • There appears to be a high degree of correlation between "fame" and controversy. I think Newton and Galileo might not rank in the top ten because controversy over Newton's primacy and Galileo's polemics had died down well before the advent of the Google era, even if they did fall within the two century timeframe of the project, which they do not. ~ Ningauble (talk) 19:51, 1 February 2011 (UTC)[reply]
  • As covered in this 31 January 2011 issue of The Signpost, the New York Times prints an article to announce to the world "Wikipedia's gender gap" merely by selecting a few Wikipedia articles purportedly on "topics more likely to be followed by boys" and "topics more likely to be followed by girls" (which itself has bias, scope, author age, and target audience problems) and eyeball compares them to draw a predetermined conclusion.[1] In contrast, as covered in the same 31 January 2011 issue of The Signpost, the Signpost publishes "Building a pantheon of scientists from Wikipedia and Google Books," an objective analysis based on analytical thought that publishes its support for the conclusions drawn by the article. The New York Times continues to be held out as THE reliable source of reliable sources, whereas The Signpost is held very low on the totem pole when it comes to usage in Wikipedia articles. What's wrong with that picture? Headbomb and The Signpost, congratulations on another outstanding job. -- Uzma Gamal (talk) 11:05, 1 February 2011 (UTC)[reply]
  • For surnames only, I think in general Newton and Darwin are more common in uses other that the scientists' names; especially for Newton, many results found from the periods at bottom are not related to Sir Isaac. With the bigrams you misspelled Newton's first name and probably noticed how common typos or misreadings are instead (at least in the link above): here's the real link. —innotata 14:48, 1 February 2011 (UTC)[reply]
  • Haidar Abbas Rizvi's birth year was reported incorrectly in this article, listing both his Science and Wikipedia birth years as 1967. I changed it to 1969, the correct value. And while I was looking into it, I determined that the 1967 date was actually vandalism that had been left in the article for over five months. Oops. I've fixed it, but it's probably a good idea to check the other articles on this list for vandalism. In particular we should be looking at those who are alive (according to either Wikipedia or Science), died recently, or have discrepancies in birth or death dates. Reach Out to the Truth 22:00, 1 February 2011 (UTC)[reply]
  • Bizarrely the SHOF list is missing Mary Anning. I would have expected her to make the top 10 list of of women scientists. I left a comment at their website about her being missed. She should easily qualify as a search for her name at Google books returns more than 9000 results.
  • Culturomics "least likely to succeed" word of 2010? By creating a mechanism for others to stroke the egos of both Google (who can give grants) and Wikipedia (who can boost fame) at the same time and to 'scientifically' validate what these two giants do, I think that that culturomics soon will be on the lips of everyone. In researching the term, Google books shows "culturomic" being used in 2008 as in "genomic-proteomic-culturomic enterprise". I think the first two relate to enterprises/organisms based on hereditary (genomic enterprise) and an enterprises based on proteins (proteomic enteprise). In that context, does anyone has a guess as to what culturomic enterprise might be? -- Uzma Gamal (talk) 11:41, 2 February 2011 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2011-01-31/The_Science_Hall_of_Fame