Single-Page View Archives |
---|
| ||
Volume 5, Issue 25 | 22 June 2009 | About the Signpost |
| ||
(← Prev) | 2009 archives | (Next →) |
|
| |
Home | Archives | Newsroom | Tip Line | Shortcut : WP:POST/A |
|
This study has a narrow focus: to determine the distribution of the length of time that vandalism remains on the English-language Wikipedia. This distribution is also known as the survival function for vandalism. The two primary results from this study are: (a) the median time to correction is down to four minutes, and (b) some subtle forms of vandalism still persist for months and even years.
In the past there have been other statistical studies, both formal and informal, of how long vandalism remains in Wikipedia until it is corrected, but almost all of them express their results as a mean time to correction (i.e., as a simple arithmetic average of the observed times). I will show in this study that the distribution function for time to correction has such a fat tail that the mean time to correction is both mathematically and substantively meaningless. The median time to correction, on the other hand, conveys useful information.
A random sample of 100 articles from the English language edition of Wikipedia was obtained through the use of the random article link in navigation toolbar. For each article, the history log was used to examine each recorded change, starting from the most recent, going back until a clear instance of vandalism was found. Then the changes were scanned in reverse order, going forward until the vandalism was corrected.
For each such instance of vandalism, the elapsed time until correction was computed, in minutes. These are the fundamental data on which this report is based.
In addition, some notes were taken on the general nature of the vandalism. All data collection occurred on 2009-06-11.
A histogram of times to correction is shown in the chart to the right. Note that the horizontal axis is depicted on a logarithmic scale, to accomodate its enormously long right-hand tail.
In this histogram there are evidently two separate processes at work. The bulk of the histogram follows a curve that declines as a power function of elapsed time: this is the process by which ordinary readers and editors of Wikipedia stumble across and correct instances of vandalism.
The first two bars on the left, however, are significantly higher than the curve would suggest. The difference between the actual height of the bars and the height predicted by the curve is accounted for by the independent activity of Wikipedia's Recent Change Patrol (RCP). Members of the RCP typically monitor the Recent Change Log for suspicious edits. The RCP is able to correct most blatant vandalism within seconds of occurrence.
Both of these vandalism-correction processes act in concert to produce a remarkable result: the median time to correction for vandalism in this study was found to be just four minutes. Similar (unpublished) studies performed by this author one and two years ago yielded median times to correction of five and six minutes, respectively. It seems apparent that Wikipedia is improving its already impressive rate of vandalism detection and correction.
The fact that the estimated curve for the survival function is exponential on a graph whose horizontal axis is logarithmic indicates that the probability density function itself follows a power law distribution, also known as a Pareto distribution, given by the formula
If the parameter in the above formula is less than one — as it is in this case — then the mean of the distribution is infinite. The practical significance of this unusual situation is that any sample mean calculated from empirical data conveys absolutely no information whatsoever about the typical length of time that it takes for an instance of vandalism to be corrected.
The only useful alternative to a sample mean in this situation is the sample median, which is fully robust with respect to long-tailed distributions.
Depending upon what assumptions are made concerning the rate of activity of the RCP, the parameter for the Pareto distribution lies in a range between about 0.25 and 0.40. This range is comfortably below one, indicating that the tail of the distribution is huge and that sample means are completely and utterly useless for describing the data.
About 84% of the vandalism that I observed in this random sample seemed to be just adolescent fooling around. Of the 16% that appeared more adult, half seemed to be adult humor or anger, and half seemed to come from people whose intent was to leave a permanent but nearly invisible mark upon Wikipedia. For example, the perpetrator will carefully change the spelling of an obscure name to an incorrect form, or change a location to something that still looks plausible at first glance. I imagine them coming back over and over again to the page that they altered, to see if that subtle little change is still there. Perhaps this impulse is roughly the same as the one which causes people to carve their initials into trees, or to scratch them on rocks.
The fact that 50% of all vandalism is being detected and reverted within an estimated four minutes of appearance should go a long way to allay fears about the susceptibility of English-language Wikipedia articles to malicious vandalism. On the other hand, the fact that an estimated 10% of all vandalism endures for months and even years indicates that some new tools and strategies are needed for rooting out the most subtle and persistent forms of vandalism.
The elapsed times (in minutes) to correction for the instances of vandalism found in this study were as follows: { 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 4, 5, 8, 9, 19, 73, 213, 490, 672, 2442, 14176, 152996 }. In addition, two cases of vandalism had never been corrected (until discovered by the author).
Reader comments
A large new edition of Wikizine is out: "Year: 2009 Week: 29 Number: 108". It includes news about the LiquidThreads extension, various Wikimedia Foundation announcements and goings-on, privacy issues with traffic analysis services that were installed on two Wikipedias, a Wikimedia Canada meeting and Wiki-Conference New York, and more.
According to an article in MIT's Technology Review, "Wikipedia Gets Ready for a Video Upgrade", Wikipedia will see dramatic improvements in video capabilities rolled out within the next few months.
On the Commons-l mailing list, Casey Brown described the article:
They just put together all of the mini-updates about Michael Dale/Kaltura's
work that we've been getting for months now.
- http://wikimediafoundation.org/wiki/Collaborative_Video
- http://blog.wikimedia.org/2008/07/23/kaltura-sponsors-michael-dale-open-source-video-developer/
- http://metavid.org/blog/2009/03/27/add-media-wizard-and-firefogg-on-test-wikimediaorg/
- http://techblog.wikimedia.org/author/mdale/
- public svn
- lots of wikitech-l/commons-l/foundation-l notes
The article just put all the snippets together into a solid update for people outside our community. :-)
According to New Zealand's stuff.co.nz, overseas investors and doctors have been shying away from Palmerston North because its Wikipedia article described it as being a particularly crime-prone area, with a particular emphasis on gang violence. MidCentral Health consultant Christine Wood described how doctors from Israel and Germany declined to work in Palmerston North after reading the Wikipedia entry. The inclusion of the crime section was criticized because of the lack of such a section in other New Zealand entries, such as Auckland and Hamilton. Palmerston North's city council responded by "toning down" the section.
The following is a brief overview of discussions taking place on the English Wikipedia and other Wikimedia projects.
Note: Starting with this issue, a notice will be placed next to items which have been added since the last issue, for easier locating of discussions which you may not have known about.
New! Request for comment: Self electing groups: Should "unofficial" electable groups of Wikipedians be allowed?
New! Request for comment: Full-date unlinking bot: Should a bot be allowed to unlink dates under this proposal? Specifically, unlinking only full dates with day, month, and year information, and not editing the same page twice to do so in case the edit is reverted? So far the community seems supportive of this proposal.
This is a list of current bot requests for approval, with brief descriptions of the proposed tasks. See this week's technology report for information on recently-approved bots.
New! AnomieBOT 31: To move {{translated page}} from articles to talk pages.
New! DrilBot 3: To tag image files where the image license migration would be redundant.
New! MondalorBot: To cleanup interwikis and rename categories.
The following requests for adminship are currently open (numbers indicate support/oppose/neutral voting, and are updated every half hour):
New! Cool3 4: Final (55/7/1); closed by Rlevse at 17:57, 27 June 2009 (UTC).
New! Jarry1250: Final (77/2/1); closed by EVula at 16:33, 24 June 2009 (UTC).
New! Patar knight: Final (52/7/2); closed by Kingturtle at 3:11, 28 June 2009 (UTC).
New! Plastikspork: Final (52/7/6); closed by bibliomaniac15 at 22:39, 25 June 2009 (UTC).
New! Timmeh 2: Final (55/37/10); withdrawn by candidate.
New! Wtmitchell: Final (65/1/4); closed by Rlevse at 12:13, 26 June 2009 (UTC).
Reader comments
Two editors were granted admin status via the Requests for Adminship process this week: Ched Davis (nom) and Mazca (nom).
This section is now included in the Technology Report, and contains an expanded description of the bots that have been approved. This week's article.
Eighteen articles were promoted to featured status this week: Moltke class battlecruiser (nom), Ten Commandments in Roman Catholicism (nom), Hastings Ismay, 1st Baron Ismay (nom), Ice hockey at the Olympic Games (nom), Jarome Iginla (nom), Yamato class battleship (nom), Magnetosphere of Jupiter (nom), Albert Bridge, London (nom), BP Pedestrian Bridge (nom), Abu Nidal (nom), Brazilian battleship Minas Geraes (nom), Fantasy Black Channel (nom), Otto Becher (nom), Bill Ponsford (nom), Early life of Keith Miller (nom), On the Origin of Species (nom), Battle of the Coral Sea (nom) and John Douglas (architect) (nom).
Seven lists were promoted to featured status this week: List of members of the International Ice Hockey Federation (nom), The Simpsons (season 14) (nom), List of Mexican National Trios Champions (nom), Rawlings Gold Glove Award (nom), List of Philippine–American War Medal of Honor recipients (nom), Commandant of the Marine Corps (nom) and List of United States Military Academy alumni (engineers) (nom).
One topic was promoted to featured status this week: Towns in Trafford (nom).
One portal was promoted to featured status this week: Portal:Connecticut (nom).
The following featured articles were displayed on the Main Page this week as Today's featured article: Richmond Bridge, Euclidean algorithm, Akutan Zero, In Utero, Iridium, Emily Dickinson.
No articles were delisted this week.
Two lists were delisted this week: List of mergers and acquisitions by Expedia (nom) and List of mergers and acquisitions by Dell (nom).
One topic was delisted this week: Numbered highways in Amenia (CDP), New York (nom).
The following featured pictures were displayed on the Main Page this week as picture of the day: Seven Rila Lakes, Gerald Ford, Map by Pedro Reinel, Arborist, Common Grass Blue, Lunar Lander Challenge and Leucospermum.
No featured sounds were promoted this week.
One featured picture was demoted this week: Cathédrale de Nantes (nom).
Twelve pictures were promoted to featured status this week and are shown below.
This is a summary of recent technology and site configuration changes that affect the English Wikipedia. Please note that some bug fixes or new features described below have not yet gone live as of press time; the English Wikipedia is currently running version 1.44.0-wmf.4 (a8dd895), and changes to the software with a version number higher than that will not yet be active. Configuration changes and changes to interface messages, however, become active immediately.
4 bots or bot tasks were approved for operation this week. These were:
This week's discussion report contains information on current bot requests and related discussions.
list=usercontribs
, as new edits. (r52096, bug:19271)User
and excludeuser
have been added to the API for list=recentchanges
and list=watchlist
. (r52152, bug:14200)
The Arbitration Committee this week announced that there will be another Checkuser and Oversight Election in August, and outlined a schedule for the election.
The Arbitration Committee opened no cases and closed one this week, leaving four open.