The Signpost

Recent research

New influence graph visualizations; NPOV and history; 'low-hanging fruit'

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, edited jointly with the Wikimedia Research Committee and republished as the Wikimedia Research Newsletter.

Wikipedia-based graphs visualize influences between thinkers, writers and musicians

A visualization of musical genres related to psychedelic music, based on DBPedia data.

In a blog post titled "Graphing the history of philosophy",[1] Simon Raper of the company MindShare UK describes how he constructed an influence graph of all philosophers using the "Influenced by" and "Influenced" fields of Template:Infobox philosopher (example: Plato). This information was retrieved using DBpedia with a simple SPARQL query. After some cleanup, the result, consisting of triplets in the form <Philosopher A, Philosopher B, Weight> was processed using the open source graph visualization package Gephi to create an impressive overview of the philosophers within their respective spheres of influence.

Brendan Griffen extended the idea to "everyone on Wikipedia. Well, everyone with an infobox containing ‘influences’ and/or ‘influenced by’", arriving at a huge, far more dense "Graph Of Ideas" including not only philosophers, but also novelists, fantasy and science fiction writers, and comedians.[2] In another blog post,[3] Griffen added transitive links as well – so that each person is considered to be influenced both directly and indirectly. The most connected people in the graph were ancient Greek thinkers, with Thales, Pythagoras and Zeno of Elea occupying the top three spots. Griffen remarks that this vindicates a statement in Bertrand Russell's History of Western Philosophy (1945): "Western Philosophy begins With Thales".

Also inspired by Raper's posting, Tony Hirst posted a number of visualizations of the Wikipedia link and category structure (likewise using DBpedia and Gephi, queried via the Semantic Web Import plugin) to visualize related entries and influence graphs in the English Wikipedia. The blog posts (all of which include detailed step-by-step tutorials) examine the related graph of philosophers,[4] and also visualize an influence graph of programming languages[5] and one of musical genres related to psychedelic music.[6] All these visualizations and blog posts by Hirst are released under a Creative Commons Attribution license.

Hirst also mentioned a related tool called "WikiMaps", the subject of a recent article in the International Journal of Organisational Design and Engineering.[7] As described in a press release, the tool provides a "map of what is “important” on Wikipedia and the connections between different entries. The tool, which is currently in the “alpha” phase of development, displays classic musicians, bands, people born in the 1980s, and selected celebrities, including Lady Gaga, Barack Obama, and Justin Bieber. A slider control, or play button, lets you move through time to see how a particular topic or group has evolved over the last 3 or 4 years." A demo version is available online.

See also the recent coverage of a similar visualization, based on wikilinks instead of infoboxes: "The history of art mapped using Wikipedia"

Information retrieval scientists turn their attention to Wikipedia's page view logs

Found to be connected to the "#euro2012" hashtag by analyzing Wikipedia pageviews: Euro 2012 football championship

The Time-aware Information Access workshop at this year's SIGIR (Special Interest Group on Information Retrieval) conference brought a wave of attention to Wikipedia's public page-view logs. Detailing the number of page views per hour for every Wikipedia project, these files figure prominently in a variety of open-source intelligence applications presented at the workshop.

A group of researchers from ISLA, University of Amsterdam created an API providing access to this data and performing simple analysis tasks.[8] Though the site appears to be down at the time of writing, the API supports the retrieving a particular article's page-view time series as well as searching for other wikipedia articles based on the similarity of their time series. In addition to machine-readable JSON results, the API will supply simple plots in png format. While the idea of providing page specific time series is not new, support for finding other pages with similar viewing patterns highlights a fascinating new use for Wikipedia page views.

Two other papers are combining Wikipedia page-view information with external time-series data sets. On the intuition that Wikipedia page views should have a strong correlation with real-world events, researchers from the University of Glasgow and Microsoft built a system to detect which hashtags frequently queried on Bing Social Search were event-related.[9] For example, the hashtag #thingsthatannoyme doesn't clearly correspond to an event, whereas a hashtag like "#euro2012" is about the UEFA European Football Championship. After tokenizing the hashtags into a list of words, the researchers queried Wikipedia for those terms and correlated the time series of hashtag search popularity with the page-view time series for the articles which are returned. This correlation score can be used to indicate which hashtags are likely to be about events, a useful feature for web searches and any other temporally aware zeitgeist application.

In a similar vein, researchers from the University of Edinburgh and University of Glasgow used the Wikipedia page-view stream to tackle the problem known as first-story detection (FSD), which aims to automatically pick out the first publication relating to a new topic of interest.[10] While traditional techniques primarily focus on newswire or Twitter, the authors used a combination of Twitter and Wikipedia page views to construct an improved FSD system. To improve on state-of-the-art Twitter-only FSD systems, the authors aimed to filter out false positives by checking that the Twitter-based first stories corresponded to a Wikipedia page that was also experiencing heightened traffic during the same period.

Using a simple outlier detection method, the authors created a set of Wikipedia pages with unexpectedly high page views for each hour. Each Twitter-based first story (tweet) was then matched against the corresponding collection of Wikipedia outliers, employing an undisclosed metric of textual similarity that uses only the Wikipedia page titles. If the tweet failed to match any spiking Wikipedia page, it was down-weighted as a first story candidate. The authors showed that this combined approach improves FSD precision in comparison to a twitter-only baseline for all but the most popular twitter-based stories. Though this research makes advances on the difficult task of first-story detection, perhaps the most immediately useful finding is that Wikipedia page views appear to lag behind twitter activity by roughly two hours. In general, we can expect to see an increasing amount of joint models over various open-source intelligence streams as we learn exactly what each stream is useful for and the relationships between the streams.

See also the Signpost coverage of a small study of the highest hourly page views on the English Wikipedia during January-July 2010, and their likely causes: "Page view spikes"

The limits of amateur NPOV history

In "The inclusivity of Wikipedia and the drawing of expert boundaries: An examination of talk pages and reference lists"[11], information studies professor Brendan Luyt of Nanyang Technological University looks at History of the Philippines, a B-class article that had featured article status from October 2006 until it was delisted at the conclusion of its featured article review in January 2011.

Luyt argues that talk-page discussions, the types of sources cited, and the organization of the article itself, all point to a very traditional view of what constitutes history: in short, great man history concerned mainly with political and military events, and the actions of elites. This style of history does not capture the breadth of approaches used by professional historians, so does not live up to the ideal of NPOV in which all significant viewpoints published in reliable sources are represented fairly and proportionately. In practice, Luyt shows, editors (lacking sufficient knowledge of the relevant professional historical literature) end up using arguments over bias and NPOV to construct a limited and conservative historical narrative—for this article at the least, although a similar pattern could be found for many broad historical topics.

The sources cited are primarily what Luyt calls "textbookese" summaries, easily available online, which focus on bare facts without the historical debates that surround them. Between the valid sources and experts recognized by Wikipedia editors and the good-faith use of the NPOV principle to limit other viewpoints, Luyt concludes that—rather than being more inclusive of diverse views and sources than the typical "expert" community—Wikipedia in practice recognizes a considerably narrower set of viewpoints.

Three new papers about Wikipedia class assignments

An article titled "Assigning Students to edit Wikipedia: Four Case Studies"[12] presents the experiences of four professors who participated in the Wikipedia Education Program, in a total of six courses total (two of four instructors taught two classes each). The lessons from the assignments included: 1) the importance of strict deadlines, even for graduate classes; 2) having a dedicated class for acquiring skills in editing and for understanding Wikipedia policies, or spreading this over segments of several classes; 3) the benefits of having students interact with the campus ambassadors and the wider Wikipedia community.

Overall, the instructors saw that compared with their engagement in traditional assignments, students were more highly motivated, produced work of higher quality, and learned more skills (primarily, related to using Wikipedia, such as being able to better judge its reliability). Wikipedia itself benefited from several dozen created or improved articles, a number of which were featured as DYKs. The paper presents a useful addition to the emerging literature on teaching with Wikipedia, as one of the first serious and detailed discussions of specific cases of this new educational approach.

"Integrating Wikipedia Projects into IT Courses: Does Wikipedia Improve Learning Outcomes?"[13] is another paper that discusses the experiences of instructors and students involved in the recent Wikipedia:Global Education Program. Like most existing research in this area, the paper is roughly positive in its description of this new educational approach, stressing the importance of deadlines, small introductory assignments familiarizing students with Wikipedia early in the course, and the importance of close interactions with the community. A poorly justified (or explained) deletion or removal of content can be quite a stressful experience to students (and the newbie editors are unlikely to realize that an explanation may be left in an edit summary or page-deletion log). A valuable suggestion in the paper was that instructors (professors) make edits themselves, so they would be able to discuss editing Wikipedia with students with first-hand experience instead of directing students to ambassadors and how-to manuals; and to dedicate some class time to discussing Wikipedia, the assignment, and collective editing.

A four-page letter[14] in the Journal of Biological Rhythms by a team of 48 authors reported on a a similar undergraduate class project in early 2011, where 46 students edited 15 Wikipedia articles in the field of chronobiology, aiming at good article status. After their first edits, they were systematically given feedback by one "Wikipedia editor and 6 experts in chronobiology" before continuing their edits (in the paper's acknowledgements the authors also thank "innumerable Wikipedia editors who critiqued student edits"). Because of the high visibility of the results – most of the articles were ranked top in Google results – students found the experience rewarding. Topics were selected collaboratively by the class, and because students came up with a relatively small number of suggestions, one concern was that the project might, if repeated, run out of article topics in the given subject area.

A literature review presented at July's Worldcomp'12 conference in Las Vegas about "Wikipedia: How Instructors Can Use This Technology As A Tool In The Classroom"[15] also recommended to have students actively edit Wikipedia (as well as practicing to read it critically), and concluded that "it is time to embrace Wikipedia as an important information provider and one of the innovative learning tools in the educators' toolbox."

Substantive and non-substantive contributors show different motivation and expertise

"Investigating the determinants of contribution value in Wikipedia"[16] reports the results of a survey of Wikipedians who were asked their opinion about the "contribution value" of their edits (measured by agreement to statements such as "your contribution to Wikipedia is useful to others"), which was then related to various characteristics.

The researchers used Google to obtain a list of 1976 Wikipedia users’ email addresses (using keywords such as “gmail.com” or “hotmail.com”). They sent invitation emails that provided the URL to the online questionnaire. In six weeks, 234 editors completed all the questions. Of these, 205 – Nine females and 196 males – supplied a valid user name and were considered in the rest of the analysis (anonymous editors were removed).

A content analysis was performed of 50 randomly selected edits by each respondent (or all, if the user had fewer than 50 edits), classifying them as "substantive" changes (e.g. "add links, images, or delete inaccurate content") and "non-substantive changes" (e.g. "reorganizing existing content [or] correcting grammatical mistakes and formatting texts to improve the presentation"), corresponding to "two [proposed] new contributor types in Wikipedia to discriminate their editing patterns."

An attempt was made to relate this to the "contribution value" the respondents assigned to their own edits, and to their responses in two other areas:

The "breadth" of interests and resources was defined as the number of ratings above a certain threshold in each, and the "depth" as the highest rating assigned in each.

In an "important consideration for practitioners", the authors wrote that:

"[T]o produce valuable contributions, users with high depth of interests and resources should be encouraged to concentrate their efforts on substantive changes. Meanwhile, for users with high breadth of interests and resources, wiki practitioners should advise them to pay more attention to nonsubstantive changes. The findings imply that practitioners can try to identify two distinct types of users. To achieve this objective, they may develop certain algorithms in wikis to automatically detect the frequencies of substantive/non-substantive changes of users. ... For example, notification messages about wiki articles that need substantive changes can be sent to users who have high levels of depth of interests and resources. Similarly, well-prepared messages about articles that need non-substantive changes can be delivered to users who have high levels of breadth of interests and resources."

Is there systemic bias in Wikipedia's coverage of the Tiananmen protests?

Remembrance of the 20th anniversary of the June 4 events in Hong Kong (replica of the "goddess of democracy" statue)
Remembrance in the West (replica of the same statue at the University of British Columbia, Canada)

Wikipedia: Remembering in the digital age[17] is a masters dissertation by Simin Michelle Chen, examining collective memories as represented on the English Wikipedia; she looked at how significant events are portrayed (remembered) on the project, focusing on the Tiananmen Square Protests of 1989. She compared how this event was framed by the articles by New York Times and Xinhua News Agency, and in Wikipedia, where she focused on the content analysis of Talk:Tiananmen Square protests of 1989 and its archives.

Chen found that the way Wikipedia frames the event is much closer to that of The New York Times than the sources preferred by the Chinese government, which, she notes, were "not given an equal voice" (p. 152). This English Wikipedia article, she says, is of major importance to China, but is not easily influenced by Chinese people, due to language barriers, and discrimination against Chinese sources that are perceived by the English Wikipedia as unreliable – that is, more subject to censorship and other forms of government manipulation than Western sources. She notes that this leads to on-wiki conflicts between contributors with different points of views (she refers to them as "memories" through her work), and usually the contributors who support that Chinese government POV are "silenced" (p. 152). This leads her to conclude that different memories (POVs) are weighted differently on Wikipedia. While this finding is not revolutionary, her case study up to this point is a valuable contribution to the discussion of Wikipedia biases.

While Chen makes interesting points about the existence of different national biases, which impact editors' very frames of reference, and different treatment of various sources, her subsequent critique of Wikipedia's NPOV policy is likely to raise some eyebrows (pp. 48–50). She argues that NPOV is flawed because "it is based on the assumption that facts are irrefutable" (p. 154), but that those facts are based on different memories and cultural viewpoints, and thus should be treated equally, instead of some (Western) being given preference. Subsequently, she concludes that Wikipedia contributes to "the broader structures of dominance and Western hegemony in the production of knowledge" (p. 161).

While she acknowledges that official Chinese sources may be biased and censored, she does not discuss this in much detail, and instead seems to argue that the biases affecting those sources are comparable to the those affecting Western sources. In other words, she is saying that while some claim Chinese sources are biased, other claim that Western sources are biased, and because the English Wikipedia is dominated by the Western editors, their bias triumphs – whereas ideally, all sources should be acknowledged, to reduce the bias. The suggestion is that Wikipedia should reject NPOV and accept sources currently deemed as unreliable. Her argument about the English Wikipedia having a Western bias is not controversial, was discussed by the community before (although Chen does not seem to be aware of it, and does not use the term "systemic bias" in her thesis) and reducing this bias (by improving our coverage of non-Western topics) is even a goal of the Wikimedia Foundation. However, while she does not say so directly, it appears to this reviewer that her argument is: "if there are no reliable non-Western sources, we should use the unreliable ones, as this is the only way to reduce the Western bias affecting non-Western topics". Her ending comment that Wikipedia fails to leave to its potential and to deliver "postmodern approach to truth" brings to mind the community discussions about verifiability not truth (the existence of this debates she briefly acknowledges on p. 48).

Overall, Chen's discussion of biases affecting Wikipedia in general, and of Tiananmen Square Protests in particular, is useful. The thesis however suffers from two major flaws. First, the discussion of Wikipedia's policies such as reliable sources and verifiability (not truth ...) seems too short, considering that their critique forms a major part of her conclusions. Second, the argumentation and accompanying value-judgements that Wikipedia should stop discriminating against certain memories (POVs) is not convincing, lacking a proper explanation of the reasons why the Wikipedia community made those decisions favoring verifiability and reliable sources over inclusion of all viewpoints. Chen argues that Wikipedia sacrifices freedom and discriminates against some memories (contributors), which she seems to see as more of a problem that if Wikipedia was to accept unreliable sources and unverifiable claims.

"Low-hanging fruit hypothesis" explains Wikipedia's slowed growth?

A student paper titled "Wikipedia: nowhere to grow"[18] from a Stanford class about "Mining Massive Data Sets" argues for the "low-hanging fruit hypothesis" as one factor explaining the well-known observation that "since 2007, the growth of English Wikipedia has slowed, with fewer new editors joining, and fewer new articles created". The hypothesis is described as follows: "the larger [Wikipedia] becomes, and the more knowledge it contains, the more difficult it becomes for editors to make novel, lasting contributions. That is, all of the easy articles have already been created, leaving only more difficult topics to write about". The authors break this hypothesis into three smaller ones that are easier to test – that (1) there has been a slowing in edits across many languages with diverse characteristics; (2) older articles are more popular to edit; and (3) older articles are more popular to read. They find a support for all three of the smaller hypotheses, which they argue supports their main low-hanging fruit hypothesis.

While the overall study seems well-designed, the extrapolation from the three subhypotheses to the parent hypothesis seems problematic. The authors do not provide a proper operationalization of terms such as "novel", "lasting", and "easy/difficult", making it difficult to enter into a discourse without risking miscommunication. There may be at least four main issues in the work:

Overall, the paper presents four hypotheses, three of which seem to be well supported by data, and contribute to our understanding of Wikipedia, but their main claim seems rather controversial and poorly supported by their data and argumentation.

See also the coverage of a related paper in a precursor of this research report last year: "IEEE magazine summarizes research on sustainability and low-hanging fruit"

Briefly

References

  1. ^ Raper, Simon: Graphing the history of philosophy. Drunks and Lampposts, June 13, 2012
  2. ^ Brendan Griffen: The Graph Of Ideas. Griff's Graphs, July 3, 2012
  3. ^ Brendan Griffen: The Graph Of Ideas 2.0. Griff's Graphs, July 20, 2012
  4. ^ Hirst, Tony (2012). Visualising related entries in Wikipedia using Gephi. OUseful.Info, July 3, 2012
  5. ^ Hirst, Tony (2012). Mapping how Programming Languages Influenced each other According to Wikipedia. OUseful.Info, July 3, 2012
  6. ^ Hirst, Tony (2012). Mapping related Musical Genres on Wikipedia with Gephi. OUseful.Info, July 4, 2012
  7. ^ "Wikimaps: dynamic maps of knowledge" in Int. J. Organisational Design and Engineering, 2012, 2, 204–224
  8. ^ Peetz, M. H., Meij, E., & de Rijke, M. (2012). OpenGeist: Insight in the Stream of Page Views on Wikipedia. SIGIR 2012 Workshop on Time-aware Information Access (#TAIA2012). PDF Open access icon
  9. ^ Whiting, S., Alonso, O., & View, M. (2012). Hashtags as Milestones in Time. SIGIR 2012 Workshop on Time-aware Information Access (#TAIA2012). PDF Open access icon
  10. ^ Osborne, M., Petrovic, S., McCreadie, R., Macdonald, C., & Ounis, I. (2012). Bieber no more: First Story Detection using Twitter and Wikipedia. SIGIR 2012 Workshop on Time-aware Information Access (#TAIA2012). Open access icon
  11. ^ Luyt, B. (2012). The inclusivity of Wikipedia and the drawing of expert boundaries: An examination of talk pages and reference lists. Journal of the American Society for Information Science and Technology, 63(9), 1868–1878. doi:10.1002/asi.22671 Closed access icon
  12. ^ Carver, B., Davis, R., Kelley, R. T., Obar, J. A., & Davis, L. L. (2012). Assigning Students to Edit Wikipedia: four case studies. E-Learning and Digital Media, 9(3), 273–283. PDF Closed access icon
  13. ^ Patten, K., & Keane, L. (2012). Integrating Wikipedia Projects into IT Courses: Does Wikipedia Improve Learning Outcomes? AMCIS 2012 Proceedings. PDF Closed access icon
  14. ^ Chiang, C. D., Lewis, C. L., Wright, M. D. E., Agapova, S., Akers, B., Azad, T. D., Banerjee, K., et al. (2012). Learning chronobiology by improving Wikipedia. Journal of Biological Rhythms, 27(4), 333–36. HTML Closed access icon
  15. ^ Hogg, J. L. (2012). Wikipedia: How Instructors Can Use This Technology As A Tool In The Classroom. Worldcomp’12. PDF Open access icon
  16. ^ Zhao, S. J., Zhang, K.Z.K., Wagner, C., & Chen, H. (2012). Investigating the determinants of contribution value in Wikipedia. International Journal of Information Management. doi:10.1016/j.ijinfomgt.2012.07.006 Closed access icon
  17. ^ Chen, Simin Michelle (2012): Wikipedia: Remembering in the digital age. University of Minnesota MA thesis. June 2012. PDF Open access icon
  18. ^ Austin Gibbons, David Vetrano, Susan Biancani (2012). Wikipedia: Nowhere to grow Open access icon
  19. ^ Rienco Muilwijk: Trust in online information A comparison among high school students, college students and PhD students with regard to trust in Wikipedia. University of Twente, February 2012 PDF Open access icon
  20. ^ von Muhlen, M., & Ohno-Machado, L. (2012). Reviewing social media use by clinicians. Journal of the American Medical Informatics Association : JAMIA, 19(5), 777–81. doi:10.1136/amiajnl-2012-000990 Open access icon
  21. ^ Ferschke, O., Gurevych, I., & Rittberger, M. (2012). FlawFinder: A Modular System for Predicting Quality Flaws in Wikipedia. Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN) Workshop (PAN @CLEF 2012), Rome. PDF Open access icon
  22. ^ Suzuki, Y., & Yoshikawa, M. (2012), QualityRank: assessing quality of wikipedia articles by mutually evaluating editors and texts. 23rd ACM Conference on Hypertext and Social Media (HT 2012). DOI Open access icon

















Wikipedia:Wikipedia Signpost/2012-08-27/Recent_research