Growth study

Study: Wikipedia's growth may indicate unlimited potential

According to a new study, Wikipedia has a pattern of growth that may indicate unlimited potential.

In a study published in the August issue of Communications of the ACM entitled "The Collaborative Organization of Knowledge" (abstract; working draft), computer scientists Diomidis Spinellis and Panagiotis Louridas analyze the relationship between references to non-existent articles (redlinks) and the creation of new articles.

The study, based on the February 2006 dump of English Wikipedia, finds that the link rate from complete (i.e., non-stub) articles to incomplete (non-existent or stub) articles remained nearly constant between 2003 and 2006 (about 1.8 incomplete articles linked from every complete article). A long-term trend in either direction, according to the authors, would indicate an unsustainable growth pattern. If the average number of redlinks per article is increasing, it means that Wikipedia is becoming diffuse and will become less useful as more and more of the terms in the average article are not covered. If the average number is decreasing, it suggests that Wikipedia's growth will slow or stop as the number of links to uncreated articles approaches zero. The stable redlink ratio suggests that Wikipedia is a scale-free network, in principle capable of unlimited growth.

The study also notes that most new articles were created within the first month that they were referenced in another article. Furthermore, only 3% of new articles were created by the same user who created the first link to that article (whether as a redlink or a bluelink). This implies that the connection between redlinks and new articles is a collaborative one, and that adding redlinks actually spurs others to create new articles.

The statistics were re-run with a more recent dump (from January 3, 2008), with results that "don't appear to differ from the ones based on the study's 2006 data set", according to Spinellis (User:Diomidis Spinellis). Wikipedia's growth rate peaked in late 2006, and it declined slightly in 2007 and in the first 7 months of 2008. According to the updated statistics, the incomplete:complete ratio has been dropping gradually since early 2006, and was less than 1.4 in January 2008. However, Spinellis argues that "As long as the ratio is above 1.0, growth as we know it should continue."

Earlier studies

A 2006 study, "Preferential attachment in the growth of social networks: The case of Wikipedia", showed that Wikipedia's early growth (through June 2004) demonstrated preferential attachment: highly-linked articles were more likely to be the target of new links. According to the authors, this indicates one of two things: either Wikipedia editors failed to take full advantage of the wiki model to create a more balanced network, or preferential attachment to highly-linked articles results from "the intrinsic organization of the underlying knowledge". The former case would indicate that Wikipedia's structure cannot overcome the "bounded rationality" of its contributors, each of whom may have limited knowledge beyond his/her area of activity.

The new study is consistent with a "bounded rationality" model, since the creation of new articles depends significantly on the topics editors choose to link to from existing articles. However, it also suggests a possible mechanism for achieving more balanced coverage, as less-covered areas will contain more redlinks, leading to more coverage and even more redlinks.

In contrast to many of the academic studies of Wikipedia, long-term observers within the community have tended to analyze Wikipedia's growth trends in terms of changing content conventions and social dynamics. For example, in a series of blog posts from 2007 ("Wikipedia Plateau?", "Unwanted: New articles in Wikipedia", and "Two Million English Wikipedia articles! Celebrate?") Andrew Lih (User:Fuzheado) examined some of the community factors limiting new article creation. An analysis of article creation and deletion logs by User:Dragons flight from late 2007 showed that for every three articles created, one article was deleted.

A more complete picture of how the size and activity level of the English Wikipedia community has evolved in recent months and years should be available once Erik Zachte updates his statistics website with a recent dump. Zachte was recently hired as a Data Analyst by the Wikimedia Foundation.




Also this week:
  • Growth study
  • Board Nominating Committee
  • Greenspun project
  • WikiWorld
  • News and notes
  • Dispatches
  • Features and admins
  • Technology report
  • Arbitration report

  • Signpost archives

    + Add a comment

    Discuss this story

    These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
    ==Discussion moved from tip line==

    In a study published in the August issue of Communications of the ACM entitled "The Collaborative Organization of Knowledge," computer scientists Diomidis Spinellis and Panagiotis Louridas analyze the relationship between references to non-existent articles (redlinks) and the creation of new articles. The study, based on the February 2006 dump of English Wikipedia because more recent dumps were unavailable (for shame!), finds that the ratio of complete (i.e., non-stub) articles to incomplete (non-existent or stub) articles remained nearly constant between 2003 and 2006 (about 1.8 incomplete articles per complete article). A trend in either direction, according to the authors, would indicate an unsustainable growth pattern. If the average number of redlinks per article is increasing, it means that Wikipedia is becoming diffuse and will become less useful as more and more the terms in the average article are not covered. If the average number is decreasing, it suggests that Wikipedia's growth will slow or stop as the number of links to uncreated articles approaches zero. The stable redlink ratio suggests that Wikipedia is a scale-free network, in principle capable of unlimited growth.

    The study also notes that most new articles were created within the first month that they were referenced in another article. Furthermore, only 3% of new articles were created by the same user who created the first link to that article (whether as a redlink or a bluelink). This implies that the connection between redlinks and new articles is a collaborative one, and that adding redlinks actually spurs others to create new articles.--ragesoss (talk) 00:20, 1 August 2008 (UTC)[reply]

    This is a fascinating study but the conclusion completely wrong. In one direction (diffusion), it omits the possibility that editors will simply remove redlinks if they feel an article has too many of them. This could be primarily an aesthetic decision, rather than an aggregate indicator of potential articles. In the other direction, it omits the possibility that growth could slow from one equilibrium to another equilibrium (lower than current level of growth, still positive, flat with second derivative=0). --JayHenry (talk) 00:38, 1 August 2008 (UTC)[reply]
    PS: My comments are based off a fast read of the working draft, which is freely available, and I assume the same in substance as the published version? --JayHenry (talk) 00:43, 1 August 2008 (UTC)[reply]
    Yes, at first glance it looks very close to the published version. In any case, since they find a stable redlink ratio, the conclusion of a scale-free network pattern of growth seems to stand. To me, though, the most interesting aspect of it was that most articles (right around 50%, it looks like to me from the graph) are created within a month after the first time they are redlinked, and 97% of the time by a different user than the one who linked it. It seems like a strong argument against removing redlinks for aesthetic reasons (e.g., in FA and GA candidates). It also makes me think adding a redlinks section (e.g., a line from Wikipedia:Recent changes article requests) to the Main Page might be a good idea.--ragesoss (talk) 00:59, 1 August 2008 (UTC)[reply]
    (I know Ral hates it when we do this on the suggestion line but I think it's interesting!) I don't think the conclusion does stand, because the correlation between redlinks and actual "missing articles/undefined concepts" may be weaker than they considered. The ratio of missing articles could be constantly falling, and the aesthetic preference for articles to contain 1.8 redlinks and stubs is only sustainable and constant because the true ratio of missing articles was higher than 1.8. If it fell below 1.8 that artifice would crumble. --JayHenry (talk) 01:15, 1 August 2008 (UTC)[reply]
    You point out one of many ways that Wikipedia is a really complex place. They acknowledge as much in the conclusion, noting several factors unrelated to network structure that could affect the growth potential of Wikipedia. Still, the conclusion that Wikipedia will not be limited by network structure passes the smell test, in my view. Of course, being based only on February 2006 data means that it missed the growth-mode transition that happened around September 2006, when the exponential phase of growth ended.--ragesoss (talk) 01:52, 1 August 2008 (UTC)[reply]
    You are talking about how the current thought is that it's growth is logarythmic, correct? Wikipedia:Modelling_Wikipedia's_growth -Ravedave (talk) 04:11, 1 August 2008 (UTC)[reply]
    Speaking of newspapers (this being the Signpost page) and logs... At Purdue University, a school with huge engineering and math programs, the student paper is called The Exponent. They briefly had a rival paper which was, yep, The Log. Sadly a single clever pun (actually several puns-in-one, as it was also a log of university life, as well as printed on paper, which is derived from logs) was not enough to sustain an otherwise poorly-produced publication. --JayHenry (talk) 04:42, 1 August 2008 (UTC)[reply]
    On the topic of logarithms, the above comment made me think that the Logarhythms would be a good band name at a school like Purdue. As with most great ideas, somebody beat me to it. Okay, sorry Ral, enough with logs... if I keep it up my block log is where it will end :) --JayHenry (talk) 05:18, 1 August 2008 (UTC)[reply]
    I don't mind interesting discussion on an issue, it's mainly when people argue on a controversial subject that I get annoyed. Ral315 (talk) 20:56, 2 August 2008 (UTC)[reply]
    It's tough to say. Growth was roughly exponential until September 2006. Since that time, it's been roughly linear. The total size graph can be fit to a logistic curve pretty well, suggesting that the growth rate will gradually decline. However, it seems probable that changes in inclusion standards and deletion practices (and the effect those changes had on the size and structure of the community) were more of a factor than lack of potential topics (as the logistic model implies). Growth rate may be declining, though; since January, Wikipedia has added about 1560 new articles per day, down from 1625 in 2007 and 1822 in 2006. It's unclear whether this is a trend, or just fluctuation, as month-to-month growth rates vary considerably.--ragesoss (talk) 04:45, 1 August 2008 (UTC)[reply]
    I rerun the study on a more recent data set, which included the years 2006 and 2007. The new results I obtained from this 2001-2007 data set don't appear to differ from the ones based on the study's 2001-2005 data set. --Diomidis Spinellis (talk) 20:47, 8 August 2008 (UTC)[reply]
    Thanks! Very interesting study, and the update is much appreciated. It looks like there is a monotonic decline in the incomplete/complete ration since mid-2006, and it's now less than 1.4. That does seem to suggest a different conclusion. It might be useful to put total articles on the X-axis instead of time (which would make the recent downward trend much more apparent.--ragesoss (talk) 21:20, 8 August 2008 (UTC)[reply]
    I think it's too early to draw a conclusion from this downward trend. We've seen a more dramatic downward change from 2001 to mid 2002, and then an upward swing to 2004. The recent trend is more significant, because larger numbers of articles are involved; it might therefore be more difficult to reverse. As long as the ratio is above 1.0, growth as we know it should continue. A fall below 1.0 (which if the current trend continues could happen around 2011-2012) would indeed be worrying. --Diomidis Spinellis (talk) 14:47, 9 August 2008 (UTC)[reply]

    Yes, I think we are experiencing the slow-down in the article growth. I can tell this from my experience, and I'm sure many other experienced users would concur. What is interesting is that, according to History of Wikipedia, the article creation by anonymous users was disabled in 2005, but it, apparently?, didn't affect the growth at the point. Also, since the growth rate is the number of articles created minus that of ones deleted, I suspect the notability policy is quite possibly the culprit of the declining growth, rather than the dearth of uncovered topics. -- Taku (talk) 05:35, 1 August 2008 (UTC)[reply]

    A while ago (probably more than a year), I looked at this and found there was one article deletion for every three articles created (i.e. a net creation of two). Dragons flight (talk) 19:28, 1 August 2008 (UTC)[reply]
    I wonder what the number is now.--ragesoss (talk) 19:31, 1 August 2008 (UTC)[reply]
    IMHO, notability & the presence of links are indirectly related. That is, the greater the possibility that an article is notable, the greater the likelihood that it will have one or more links to it. (And I hesitate a little in stating this, since some Wikipedians will misuse this observation to promote a mechanistic definition of notability.) -- llywrch (talk) 15:01, 4 August 2008 (UTC)[reply]
    Somewhat related discussion at http://lists.wikimedia.org/pipermail/wikien-l/2008-August/095097.html 86.44.19.24 (talk) 04:02, 19 August 2008 (UTC)[reply]

















    Wikipedia:Wikipedia Signpost/2008-08-11/Growth_study