Building a pantheon of scientists from Wikipedia and Google Books

The Science Hall of Fame

Building a pantheon of scientists from Wikipedia and Google Books

Origins and overview

The origin of the SHoF was made possible by a paper written by Jean-Baptiste Michel and Erez Lieberman Aiden and published on 17 December 2010.^[2] Michel and Aiden aggregated data from Google Books, which at the time contained 15 million digitized books—roughly 12% of all books ever published. Filtering for quality revealed that about a third of the original data was suitable for analysis; this is available online at ngrams.googlelabs.com. Based on Michel and Aiden's work, John Bohannon from Science and Adrian Veres from Harvard University teamed up to create a pantheon of the most influential scientists as measured by the number of times their names appear in Google Books. However, millions of names can be found in books, and thus a way was needed to decide who is a scientist and who is not. This is where Wikipedia comes in, with its 900,000 biographical entries (the authors report 750,000),^[3]^[4] which can then be searched using for science-related categories, science-related keywords, and years of birth and death.^[5]

They have created a new unit, the darwin (D), defined as "the average annual frequency that 'Charles Darwin' appears in English-language books from the year he turned 30 years old (1839) until 2000". Scientists named more often than Darwin himself would have a fame greater than 1 darwin. However, as few people were as influential as Darwin, the millidarwin (mD; a thousandth of a Darwin) is used instead. As it turns out, only three people beat Darwin in terms of fame as measured by this metric: John Dewey (1752.7 mD), Bertrand Russell (1500.1 mD), and Sigmund Freud (1292.9 mD). Other famous figures, such as Albert Einstein (878.2 mD), Marie Curie (188.6 mD), and Louis Pasteur, (237.5 mD) rank lower.^[6] This is not a measure of the impact of their scientific work, but rather of how often they are mentioned in all types of books. For example, a scientist could have a moderate scientific impact but be famous for political involvement or even for negative scientific impact, such as involvement in scientific fraud or high-profile pseudoscience.

The authors warn that the current version of the Science Hall of Fame is a rough draft subject to further refinements; not all fields are covered equally and some scientists were excluded for technical reasons. Further details are on the Science Hall of Fame website, especially their FAQ section. As an aside, the authors called this an experiment in "culturomics" (the analysis of large sets of data to find cultural trends), which has been dubbed by the American Dialect Society as the "least likely to succeed" word of 2010.^[7] It will be interesting to see if the word catches on, or if the culturomics link will remain red or turn blue.

Fame, article quality, and other trends on Wikipedia

Based on this measure of fame as established by Bohannon and Veres, a comparison of Wikipedia with the SHoF (WP:SHOF) was created by Snottywong, based on a suggestion from this article's writer. The comparison lists scientists, along with their fame in mD and years of birth and death as reported by Science, as well as years of birth and death as reported by Wikipedia and assessment ratings (taken from the {{WikiProject Biography}} banner). Since the SHoF remains a rough draft at the moment, a highly-rigourous analysis of its findings would be pointless at this stage; however, some things are worth noting.

Numbers

First, some numbers. As of writing ...

... the SHoF contains 5,631 entries
... of these, 1,783 are living, 3,848 are dead
... the breakdown of these articles is 14 FA-class, 13 GA-class, 238 B-class, 246 C-class, 2011 Start-class, 2479 Stub-class; 630 articles are unassessed
... Wikipedia is missing only two articles: Herbert Mayer, which was deleted on CSD A7/BLP grounds on 9 January 2011, and Jacob Jaffe, deleted 9 December 2010 for lack of notability through AfD
... the Science compilation missed 42 deaths (34 of which occurred in 2010 and 2011—possibly after the data was compiled)
... Wikipedia articles are missing nine births, but no deaths
... there are 18 discrepancies between births and deaths, other than missing dates; the first date is from Science, the second from Wikipedia

Jim Fowler b. 1930 vs 1932
Leon Kamin b. 1924 vs 1927
Alexander Rich b. 1920 vs 1924
Julius H. Taylor b. 1918 vs 1914
David Sidney Feingold b. 1914 vs 1922
Henry Kolm b. 1920 vs 1924
John Beddington b. 1946 vs 1945
Carolyn S. Gordon b. 1940 vs 1950
Haidar Abbas Rizvi b. 1969 vs 1967
Ishtiaq Hussain Qureshi d. 1981 vs 1988
Lester R. Ford d. 1975 vs 1967
Carl Chun d. 1914 vs 1938
Loránd Eötvös d. 1947 vs 1919
Michio Suzuki d. 1999 vs 1998
Hu Ning d. 1916 vs 1997
Emil Petrovici d. 1958 vs 1968
Michael Grzimek b. 1909 vs 1934, d. 1987 vs 1959

The high degree of "completion" or of "accuracy" should not be considered a sign that Wikipedia is "complete" or "accurate", because the authors used Wikipedia to determine whether people were scientists or not and possibly used dates from Wikipedia articles. People who lack a Wikipedia entry would presumably be excluded on "technical grounds". It is also possible that the discrepancies merely reflect changes in Wikipedia due to vandalism, mistakes, and corrections which occurred since the data was acquired.

Fame and quality

How, then, are article quality and fame related? Number crunching by The Signpost revealed the following.

Class	FA	GA	B	C	Start	Stub	???
Average Fame (mD)	107.8	113.45	60.4	17.9	11.73	5.92	5.9

Fame (mD) → Class ↓	0–0.1	0.1–1	1–10	10–100	100–1000	1000–10000
FA	0	0	50	21.4	28.6	0
GA	7.7	7.7	46.2	15.4	23.1	0
B	2.5	8.4	38.7	38.7	10.5	1.3
C	4.1	16.3	53.7	21.1	4.9	0
Start	2.8	18.6	58.5	17.9	2.3	0
Stub	4.6	25.3	57.8	11.8	0.4	0
???	4.4	22.7	60.5	11.8	0.6	0

We can see that as the quality of articles increases, so does the mean fame of the scientists within the assessment class. Another way to look at this is through the distribution of fame within assessment classes: as the quality increases, the distribution of fame shifts towards higher fame – that is, higher-quality articles tend to be about more famous people. While fame correlates to some extent with quality, it is still in no way a guarantee that a famous person will have a high-quality article on Wikipedia (or vice-versa). There is not much else to say about these rather unsurprising results, except perhaps to mention that the distribution of unassessed articles most closely matches that of stubs.

Gender and quality

We can also take a look at some trends. For gender, the rankings of the top 10 men and top 10 women look like this.

Men
Rank	Name	Fame	Ranking	Main fields
1	John Dewey	1752.7	B	Philosophy, psychology
2	Bertrand Russell	1500.1	B	Philosophy, mathematics
3	Sigmund Freud	1292.9	B	Psychology
4	Charles Darwin	1000.0	FA	Biology
5	Albert Einstein	878.2	GA	Physics
6	Havelock Ellis	672.4	Start	Psychology
7	Noam Chomsky	506.6	B	Philosophy
8	G. Stanley Hall	480.5	Start	Psychology
9	Lewis Carroll	479.3	C	Literature, religion
10	Erich Fromm	466.0	B	Psychology, philosophy

Women
Rank	Name	Fame	Ranking	Main fields
1	Anna Freud	537.1	Start	Psychology
2	Julia Kristeva	444.0	Start	Philosophy, feminism
3	Ruth Benedict	314.7	Start	Anthropology
4	Melanie Klein	285.1	Start	Psychology
5	Luce Irigaray	242.9	Stub	Sociology, feminism
6	Carol Gilligan	235.3	Start	Psychology, feminism
7	Juliet Mitchell	210.2	Stub	Psychology, feminism
8	Nancy Chodorow	209.0	Stub	Sociology, feminism
9	Marie Curie	188.6	B	Physics
10	Karen Horney	158.7	Start	Psychiatry

Unsurprisingly, the top 10 men have higher fame than the top 10 women. At first glance, it seems there is an enormous disparity between the relative quality of articles on men (1 FA, 1 GA, 5 B, 1 C, 2 Start, 0 Stub) compared to those on women (0 FA, 0 GA, 1 B, 0 C, 6 Start, 3 Stub). However that conclusion is not strongly supported at the moment. Since quality & fame are somewhat correlated, it would be natural for the top 10 women (who are on average, less famous) to have articles of somewhat lower quality, although the discrepancy here seems to be too big to be entirely explainable only by lower fame. Instead, what seems to be the top indicator of quality is a combination of fame plus field. Many of the men in the top 10 ranking come from hard sciences and philosophy, while most of the women come from humanities (especially feminism and psychology/psychiatry). Indeed, the two lower-ranked articles (Start-class) for men concern psychologists (Havelock Ellis and G. Stanley Hall), and the only one above Start-class for women is for a physicist (Marie Curie).

Based on this, a more sensible conclusion would be that famous people from the humanities are under-represented on Wikipedia, compared to other fields. However, even that conclusion should not be embraced blindly. After all, it relies on a very small sampling (10 men, 10 women). Someone interested in doing a rigourous analysis of the data would have to weight ratings scores according to fame and year of birth, classify people according to the fields for which they are famous, and make sure the ratings are up to date. For example, the articles on Anna Freud and Melanie Klein could arguably be rated as a C-class instead of their current Start-class, and Karen Horney is rated B-class by WikiProject Psychology but the WikiProject Biography rating has not yet been updated, and is still a Start-class. And lastly, there is possibly a bias in the Google Books selection. Books from certain fields could be digitized more often than books from other fields, or the writing conventions of the field could make full names more prevalent than in other fields, boosting the "measured fame" compared to their "actual fame".

Closing remarks

The above analysis just gives a hint of the type of questions that can be asked and answered by analysis of the SHoF data. It will be very interesting to see if concerted efforts to improve coverage in fields which are lacking will take place, or see if the gender gap truly exists, or if it is the result of a coverage bias amongst fields. It will be equally interesting to see if culturomics will take off as a field and what direction it will take.

Got comments or ideas of your own? Share them below!

References

^ J. Bohannon (2011). "The Science Hall of Fame". Science. 331 (6014): 143. doi:10.1126/science.331.6014.143-c.
^ J.-B. Michel; et al. (2010). "Quantitative analysis of culture using millions of digitized books". Science. 331 (6014): 176. doi:10.1126/science.1199644. PMID 21163965.
^ WP 1.0 bot (23 January 2011). "Wikipedia:Version 1.0 Editorial Team/Biography articles by quality statistics". Retrieved 2011-01-27.{{cite web}}: CS1 maint: numeric names: authors list (link)
^ J. Bohannon, A. Veres. "FAQ: How are the members of the Hall of Fame chosen?". The Science Hall of Fame. Retrieved 2011-01-27.
^ J. Bohannon, A. Veres. "FAQ: How are the members of the Hall of Fame chosen?". The Science Hall of Fame. Retrieved 2011-01-27.
^ Snottywong, Headbomb (25 January 2011). "Science Hall of Fame". WikiProject Biography. Retrieved 2011-01-27.
^ ""App" 2010 Word of the Year, as voted by American Dialect Society" (PDF) (Press release). American Dialect Society. 7 January 2011. Retrieved 2011-01-27.

In this issue

31 January 2011 (all comments)

News and notes

In the news

The Science Hall of Fame

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

This page has been mentioned by multiple media organizations:

Fister, Barbara (3 February 2011). "Women and Wikipedia". Library Babel Fish.
Reagle, Joseph (2 January 2013). ""Free as in sexist?" Free culture and the gender gap". First Monday. 18 (1): 4291.

Nice work Headbomb!
What rankings are Herbert Mayer and Jacob Jaffe on the list? Perhaps we should consider recreating those articles. NW (Talk) 01:47, 1 February 2011 (UTC)[reply]

Herbert Mayer has a fame of 4.3 mD (ranking 1916th, tied with 33 people) and Jacob Jaffe has 0.7 mD (ranking 4794th, tied with 244 people). Headbomb {talk / contribs / physics / books} 02:48, 1 February 2011 (UTC)[reply]

So has anyone started a pool to see when articles on Mayer & Jaffe will be re-created, seeing how we have an arguably objective measure of their notability? -- llywrch (talk) 06:27, 1 February 2011 (UTC)[reply]

I started a discussion thread about Jaffe at So where did Wikipedia go wrong?. -- Uzma Gamal (talk) 13:41, 3 February 2011 (UTC)[reply]

Lewis Carroll was also a professional, though not I believe an important, mathmetician, which is presumably what he is doing on this list in the first place. Julia Kristeva is also not famous as a scientist, though her work touches on psychological areas. Johnbod (talk) 02:14, 1 February 2011 (UTC)[reply]

Under "Fame, article quality, and other trends on Wikipedia" (permanent link here), the subsection "Numbers" says "the SHoF contains 5,631 entries ... of these, 1828 are living, 3808 are dead." 5631 ≠ 1828 + 3808.—Wavelength (talk) 04:06, 1 February 2011 (UTC)[reply]

Right you are, I'll recrunch the numbers. Headbomb {talk / contribs / physics / books} 04:09, 1 February 2011 (UTC)[reply]

This is MegaBullshit Idiocy. Dewey? Give me a break! Edison (talk) 05:00, 1 February 2011 (UTC)[reply]

I must confess that I didn't recognise 2 of the top 10 men and Dewey was one of them. Is this US-centricity? S a g a C i t y (talk) 11:02, 1 February 2011 (UTC)[reply]

I wouldn't ascribe this directly to US-centricity - although I doubt I'm the only one surprised that Isaac Newton didn't make the list -- or other notable older scientific figures like Galileo, Plato or Aristotle. (Moreover most Americians, if asked to identify John Dewey, probably could do little better than guess he invented the Dewey Decimal System -- which was the work of another man.) But a glance at the John Dewey article offers an possible answer: he has been the target of much vitriol by American conservatives. (Damn that man for working towards te goal of offering the average American a useful & liberal education!) -- llywrch (talk) 18:53, 1 February 2011 (UTC)[reply]

There appears to be a high degree of correlation between "fame" and controversy. I think Newton and Galileo might not rank in the top ten because controversy over Newton's primacy and Galileo's polemics had died down well before the advent of the Google era, even if they did fall within the two century timeframe of the project, which they do not. ~ Ningauble (talk) 19:51, 1 February 2011 (UTC)[reply]

Ah. I missed that small detail about the "two century timeframe". I'm standing by the rest of my comment, though. -- llywrch (talk) 20:21, 1 February 2011 (UTC)[reply]

Melvil Dewey lived from December 10, 1851 to December 26, 1931 and John Dewey lived from October 20, 1859 to June 1, 1952. Did the last name of the inventor of the Dewey Decimal System somehow help boost the culturomic darwins of John Dewey's name within ngrams.googlelabs.com? -- Uzma Gamal (talk) 11:41, 2 February 2011 (UTC)[reply]

As covered in this 31 January 2011 issue of The Signpost, the New York Times prints an article to announce to the world "Wikipedia's gender gap" merely by selecting a few Wikipedia articles purportedly on "topics more likely to be followed by boys" and "topics more likely to be followed by girls" (which itself has bias, scope, author age, and target audience problems) and eyeball compares them to draw a predetermined conclusion.[1] In contrast, as covered in the same 31 January 2011 issue of The Signpost, the Signpost publishes "Building a pantheon of scientists from Wikipedia and Google Books," an objective analysis based on analytical thought that publishes its support for the conclusions drawn by the article. The New York Times continues to be held out as THE reliable source of reliable sources, whereas The Signpost is held very low on the totem pole when it comes to usage in Wikipedia articles. What's wrong with that picture? Headbomb and The Signpost, congratulations on another outstanding job. -- Uzma Gamal (talk) 11:05, 1 February 2011 (UTC)[reply]

Also 90% of the top ten men are dead, only 60% of the top ten women. That's a significant difference. Rich Farmbrough, 13:59, 1 February 2011 (UTC).[reply]

Interesting to compare surnames only (unigram) for Newton,Asimov,Einstein,Darwin (Newton scores far higher), and contrast those results to bigram data Isaac Newton,Isaac Asimov,Albert Einstein,Charles Darwin where Newton looses top spot to Darwin,who in turn looses it to Einstein. Rich Farmbrough, 14:19, 1 February 2011 (UTC).[reply]

For surnames only, I think in general Newton and Darwin are more common in uses other that the scientists' names; especially for Newton, many results found from the periods at bottom are not related to Sir Isaac. With the bigrams you misspelled Newton's first name and probably noticed how common typos or misreadings are instead (at least in the link above): here's the real link. —innotata 14:48, 1 February 2011 (UTC)[reply]

Thanks thought I had fixed that typo - I have now. Rich Farmbrough, 14:59, 1 February 2011 (UTC).[reply]

Haidar Abbas Rizvi's birth year was reported incorrectly in this article, listing both his Science and Wikipedia birth years as 1967. I changed it to 1969, the correct value. And while I was looking into it, I determined that the 1967 date was actually vandalism that had been left in the article for over five months. Oops. I've fixed it, but it's probably a good idea to check the other articles on this list for vandalism. In particular we should be looking at those who are alive (according to either Wikipedia or Science), died recently, or have discrepancies in birth or death dates. Reach Out to the Truth 22:00, 1 February 2011 (UTC)[reply]

Bizarrely the SHOF list is missing Mary Anning. I would have expected her to make the top 10 list of of women scientists. I left a comment at their website about her being missed. She should easily qualify as a search for her name at Google books returns more than 9000 results.

She was born in 1799, and this list covers people born from 1800 onwards. Headbomb {talk / contribs / physics / books} 16:49, 2 February 2011 (UTC)[reply]

And again Miss Anning is denied her due by the scientific establishment. -- 15.252.0.76 (talk) 18:46, 2 February 2011 (UTC)[reply]

Culturomics "least likely to succeed" word of 2010? By creating a mechanism for others to stroke the egos of both Google (who can give grants) and Wikipedia (who can boost fame) at the same time and to 'scientifically' validate what these two giants do, I think that that culturomics soon will be on the lips of everyone. In researching the term, Google books shows "culturomic" being used in 2008 as in "genomic-proteomic-culturomic enterprise". I think the first two relate to enterprises/organisms based on hereditary (genomic enterprise) and an enterprises based on proteins (proteomic enteprise). In that context, does anyone has a guess as to what culturomic enterprise might be? -- Uzma Gamal (talk) 11:41, 2 February 2011 (UTC)[reply]

Michael Grzimek: Science has probably confounded the birth and death dates of Michael Grzimek with that of his father Bernhard Grzimek. --Longbow4u (talk) 18:13, 2 February 2011 (UTC)[reply]

Get the latest headlines on your user page – just add {{Signpost-subscription}}.

Home

About