Submissions/The State of Wikimedia Scholarship: 2009-2010/Notes

From Wikimania 2010 • Gdańsk, Poland • July 9-11, 2010


Citations to the papers being discussed are bold and included in the text below.


  • Help give Wikimedians a window into the academic world studying Wikipedia by:
    • Giving people more details on papers they have heard about.
    • Introducing people to new papers or perspectives that they've not heard.
  • Start building bridges between academics and their specimens (that's us!).
  • Give Wikimedians ideas and inspiration for how they can benefit from those who are benefiting from their example.

Scope Conditions

One of the presenters suggested an earlier version of this project for Wikimania 2009 after doing a similar project for the Debian Project. That review was comprehensive and covered all academic articles on Debian. Foolishly, the idea was to do something similar for Wikipedia and Wikimedia projects. However, with hundreds of articles published in peer reviewed journals, even covering the last year is prohibitively difficult.

As a result, we are not presenting everything (or even coming close!). There are many more papers than we could possibly review, read, or present in limited time. The result is a highly curated selection of only 10 papers (or groups of papers). Here is a description of how we've attempted to limit things:

  • Only things published between last Wikimania 2009 and Wikimania 2010
  • Only Wikimedia/Wikipedia related works (i.e., not work on wikis in general)
  • Only papers that are strongly related to Wikimedia (i.e., not just one of several data sources used to test a theory)
  • Only articles written in English

This left us with about 400 articles. As a result, we narrowed things down additionally by selecting papers that were:

  • A selection of articles with a biased toward being broad and representative:
    • e.g., if we have taken one paper from a particular workshop, we will try to not take others, even if they are wonderful
    • e.g., we've selected papers that aim to represent the wide variety of fields currently constituting Wikimedia/Wikipedia scholarship
  • Papers that are likely to be relevant, interesting, or useful to Wikimedia Community members
  • No papers done by people or research groups attending Wikimania 2010
    • As a result, authors can't criticize us for getting their paper wrong -- at least not right away. Hopefully, the threat of us reviewing others papers creates an incentive for scholars to show up at Wikimania.

After that, we still had too many papers. As a result, we took a random selection from the piles of papers that remain. There are wonderful papers we've left out of here for no reason than that we didn't have room. We apologize to all the authors whose work great work was omitted.


How does Wikipedia evolve?

[F] Suh, Bongwon, Gregorio Convertino, Ed H. Chi, and Peter Pirolli. 2009. “The singularity is not near: Slowing growth of Wikipedia.” Pp. 1-10 in Proceedings of the 5th International Symposium on Wikis and Open Collaboration. Orlando, Florida: ACM doi:10.1145/1641309.1641322.

  • Wikipedia growth (both in terms of content and number of editors) has been exponential up to 2007.
  • Current results indicate that this growth has perhaps entered a new steady-state phase.
  • Apparently, number of very active editors (+100 and +1,000 edits) has stabilized, some resistance to reach beyond the 100 edits threshold.
  • Number of revert actions on edits by less active editors has increased, but not for edits peformed by very active editors (+100 edits).
  • Propose Lotka-Volterra population growth model bound by a limit K(t), growing as a function of time.
  • Constraints affecting the slope of this K(t) could be: number of available volunteers, level of public outreach, editing tools and usability, etc.

Who contributes to Wikipedia?

Anonymous versus registered users

[M] Anthony, Denise, Sean W. Smith, and Timothy Williamson. 2009. “Reputation and Reliability in Collective Goods.” Rationality and Society 283-306 vol. 21 doi:10.1177/1043463109336804

This paper uses data on on 7,058 random users from Dutch and French Wikipedia to ask about how differences in anonymous versus registered users editing patterns. They are particularly interested in the "reliability" of users work (measured as the number of characters retained from a users edit in subsequent edits).

They offer the following hypotheses:

  • Hypothesis 1: Registered users will make more contributions than non-registered users. (Supported)
  • Hypothesis 2: Registered users with many contributions will have higher reliability than both (a) registered users with fewer contributions, and (b) non-registered (i.e., anonymous) users. (Not supported)
  • Hypothesis 3: Above some threshold of contributions, reliability will decline for registered users. (Not supported)
  • Hypothesis 4: Anonymous users will contribute less content per edit than registered users. (Supported)
  • Hypothesis 5: Most anonymous contributors will contribute one time only. (Supported)
  • Hypothesis 6: Anonymous one-time contributors will have higher reliability than anonymous contributors with more than one contribution. (Supported)
  • Hypothesis 7: Reliability will decrease with number of contributions for anonymous users. (Supported)

Their basic results are that:

  • Edits are quite reliable in general!
  • Anonymous users are (in general) more reliable than non-anonymous users.
  • As users edit more, this relationships flips around.

What motivates contributions to Wikipedia?

[M] Antin, Judd, and Coye Cheshire. 2010. “Readers are not free-riders: reading as a form of participation on Wikipedia.” Pp. 127-130 in Proceedings of the 2010 ACM conference on Computer supported cooperative work. Savannah, Georgia, USA: ACM doi: 10.1007/BF01420590

Based on the concept of free riding which is a concept from public goods economics which describes people that benefit without contributing and which, in traditional economic goods can cause a tragedy of the commons. this is a short paper that argues that:

  1. Many readers aren't free-riding because they don't realize they can contribute
  2. Reading Wikipedia is a form of contribution.
  3. Reading Wikipedia is a form of legitimate peripheral participation through which users move toward contribution.

The paper describes a survey given to a random selection of of university students about editing on Wikipedia.

They asked questions about people's knowledge of Wikipedia. 70% of their readers read Wikipedia multiple times per week. 16% had ever edited. For example:

  • 57% knew that one could edit Wikipedia
  • 18% knew that you don't need an account to edit Wikipedia
  • 4% knew details of the administrator creation procedure(!)

They also found that the more you knew, the more you edit -- although there may be causal issues here. There is evidence on the second point in the next paper.

[M] Zhang, Xiaoquan (Michael), and Feng Zhu. “Group Size and Incentives to Contribute: A Natural Experiment at Chinese Wikipedia.” American Economic Review. open access copy

In the the economics literature, the relationships to group size and motivation in the constructions of public goods is a very general and very important issue. The idea is, how is the size of a group connected to an individual within it's desire to contribution. If you benefit more people, are you motivated?

Wikipedia has been blocked in a China a number of times. They authors argue that one of the blocks was particularly long and unexpected and that most users, at the time, had no reason to suspect it would unblocked any time soon. Of course, users inside China were blocked, while other Chinese speaking users outside were not. This constituted a natural experiment.

  • Worked with a sysop from Chinese User:Shizhao and User:Mountain.
  • They looked at the "third" block of Wikipedia from October 2005 and lasting for more than a year.
  • Identified blocked and non-blocked users through editing patterns.
  • 46% decrease in the contributions of non-blocked users.

How do Wikipedia users interact?

[J] Choi, Boreum, Kira Alexander, Robert E. Kraut, and John M. Levine. 2010. “Socialization tactics in Wikipedia and their effects.” Pp. 107-116 in Proceedings of the 2010 ACM conference on Computer Supported Cooperative Work. Savannah, Georgia, USA: ACM doi:10.1145/1718918.1718940

This paper studies how newcomers get socialized online, particularly at en:Wikipedia. It describes two types of socialization: institutional and individual.

Institutional socialization includes

  • formal training
  • coherent sequences
  • fixed timetables
  • mentoring
  • building on existing skills

Individualistic socialization

  • informal
  • random
  • variable
  • disconnected
  • involves skill change.

Research shows that institutional socialization has numerous positive effects, including understanding their roles, feeling more accepted by the organization, and having more satisfaction and commitment. They're also more likely to stay, and to do better work. Recruiting processes and practices can also serve as socialisation, and affect retention. These include informal or formal discussions, observing newcomers, and recruiting from existing members' acquaintances

With this theory in mind, the authors look into online socialization. Briefly they note that institutional socialization is possible online; some online groups, such as, have formal training for new users, using cohorts, mentoring, clear sequences and incentives. Then they look at existing socialisation mechanisms on the English Wikipedia, and identify 7 tactics:

  1. invitations to join the project
  2. welcome messages
  3. requests to work on a particular task
  4. offers of assistance
  5. positive feedback on work
  6. constructive criticism of work
  7. personal comments

Using some sample WikiProjects, they reviewed the amount of communication with projects 1 month before to 1 month after someone joins a WikiProject, and provide 3 models. They conclude that newcomers respond to constructive criticism with increased contribution, but that invitations may only introduce a temporary boost in motivation.

It begins to show differences in online versus offline socialisation (in Wikipedia personalisation seems better; offline standardized messages seem better). It especially offers suggestions to Welcoming Committees: personalise your messages, thanking newcomers for particular contributions, suggest improvements and offer assistance.

[F] Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John Riedl. 2009. “A jury of your peers: quality, experience and ownership in Wikipedia.” Pp. 1-10 in Proceedings of the 5th International Symposium on Wikis and Open Collaboration. Orlando, Florida: ACM doi:10.1145/1641309.164133

  • Questions several natural assumptions regarding the acquistion of expertise of editors in Wikipedia.
  • Automated measure "word persistance" for evaluating quality of contributions.
  • Quantitative proof that refutes the hypothesis: "contributions from editors with more experience are more accepted".
  • Probability of being reverted increases as the number of different authors "watching" the changes grows: "Stepping on toes".


[F] Ekstrand, Michael D., and John T. Riedl. 2009. “rv you're dumb: identifying discarded work in Wiki article history.” Pp. 1-10 in Proceedings of the 5th International Symposium on Wikis and Open Collaboration. Orlando, Florida: ACM doi:10.1145/1641309.1641317.

  • Traditional lineal representations of revision history in wikis hide interesting editorial patterns.
  • Developed visual interface to represent this hidden patterns, inspired in gitk
  • 3 design goals:
    • Clear distinction between accepted and rejected revisions
    • Clearly show reverts.
    • Indicate edits performed by same editor.
  • Integrated interface, optimal colouring, user interaction.
  • Visualize editing branches/merges.
  • Tested advanced heuristics for revert detection: cosine similarity (limited deep) and adoption coefficient (time consuming). Found no improvement over "traditional" revert detection.

[F] Fong, Peter Kin-Fong, and Robert P. Biuk-Aghai. 2010. “ What did they do?: deriving high-level edit histories in Wikis.” Pp. 1-10 in Proceedings of the 6th International Symposium on Wikis and Open Collaboration. Gdansk, Poland: ACM doi:10.1145/1832772.1832775.

  • Proof of concept of enhanced tool for tracking and visualizing changes between 2 wiki revisions.
  • 3 steps:
    • Lexical analyzer
    • Text differencing engine
    • Action categorizer
  • One more step (history summarizer) under development
  • Rich action categorizer for Wikipedia: spell-checking, inter-lang links, wikify, categorizing, add references...

Wikipedia and the outside world

How do other people use Wikipedia content, cite Wikipedia, and what are the problems and issued raised by Wikipedia's growing influence.

Commercialization of Wikipedia Content: Distoring and Subverting Free Knowledge?

[J] Langlois, Ganaele, and Greg Elmer. 2009. “Wikipedia leeches? The promotion of traffic through a collaborative web format.” New Media & Society 773 -794.

"At age 78, I thought I was beyond surprise or hurt at anything negative said about me. I was wrong. One sentence in the biography was true. I was Robert Kennedy’s administrative assistant in the early 1960s. I also was his pallbearer. It was mind-boggling when my son, John Seigenthaler, journalist with NBC News, phoned later to say he found the same scurrilous text on and" (Seigenthaler, 2005)

Mirror websites commercialise by

  • simply adding ads to content
  • add non-Wikipedia content, also serving ads
  • use content to improve search engine rankings:
    • spamdexing "webpages filled with Wikipedia content are created for the purpose of attracting traffic from search engines, but users are actually redirected automatically to another page full of advertisement." so "the content literally disappears to make way for sponsored advertising"
    • mirrors with query words in the URL
    • metatag stuffing

Ironically, this "freezes content and renders it static". Commercial entities can appropriation free content more easily with dynamic production techniques. Authorship is delegated to machine processes, shifting from using tags/categories to use of whole sentences.

  • People use (and trust!) search engines for finding things
  • Manipulation of ranking results
  • Search engines are commercial entities, based on advertising, causing some conflicts of interest and paradoxes, for instance making money off of spamdexing (which undermines them)

Search engines can be undermined by spamdexing while making money out of them; "search engines cannot be considered as filters or mediators of content, but need to be acknowledged as commercial actors playing an important role in the informational politics of the web."

Citations to Wikipedia

[J] Stoddard, Morgan Michelle. 2009. “Judicial Citation to Wikipedia in Published Federal Court.” Masters in Library Science, Chapel Hill, NC: UNC .

This Master's thesis discusses the use of Wikipedia by judges, in U.S. court decisions, despite various problems with using Wikipedia for citation. Citations are extremely important in law, in establishing what is appropriate to cite: citations are used again, in later court decisions. "When a judge cites a particular source, she is indicating—intentionally or unintentionally—that the source is persuasive, legitimate authority"

Legal researchers have discussed the advantages of Wikipedia as well as its disadvantages, but regardless, citations are increasing. This study examines the types of citations being used, and concludes that a third of citations are about issues central to the case.

In one landmark case -- Basada v. Mukasey (2008) -- "an immigration judge denied Basada asylum because the documents she submitted did not prove her identity", based in part on evidence from the U.S. Department of Homeland Security, from the Wikipedia article on laissez-passer. The decisions was sustained in an appeal: even though Wikipedia was not an appropriate source, they agreed that the decision was supported by other evidence.

In another case, Cobell v. Kempthorne (2008), the Wikipedia article was cited in the decision itself, as a place to get "Cliff Notes" on the ongoing debate, with the disclaimer that “the Court, of course, cannot vouch for its accuracy”.

Document Summarization

[M] Ramanathan, Krishnan, Yogesh Sankarasubramaniam, Nidhi Mathur, and Ajay Gupta. 2009. “ Document Summarization using Wikipedia.” Pp. 254-260 in Proceedings of the First International Conference on Intelligent Human Computer Interaction. doi:10.1007/978-81-8489-203-1_25.

A large part of the developing world will first access the internet on mobile phones, but mobile phones don't have a lot of screen space.

A new document summarization approach is proposed on the basis of sentence extraction. The system runs as a proxy between the site and the mobile device, summarizing data en-route, and only sending the summary to the mobile phone.

  • First sentences are mapped to wikipedia concepts.
    • Wikipedia is indexed using Lucene. The sentence is then entered as a lucene query. Finally the titles of the wikipedia pages (hits) are extracted from the lucene results.
  • This process is repeated for each sentence. The number of times each wiki-page is hit is accumulated into a data structure (C++ multimap)
  • Those sentences that refer to the wiki-pages with the largest number of hits are selected to be output as part of a summary.


  • A sentence can get chosen even if mapping to only one wiki page.
  • The authors would like to take into account the importance of the pages they map to.

[M] Sauper, Christina. 2010 “Automated creation of Wikipedia articles.” Thesis, Massachusetts Institute of Technology. (Also published as a paper in ACL 2009[1])

Sauper's process:

  • Looks at the structure of existing Wikipedia articles on similar topics. For examples, a disease page has sections on diagnosis, causes, symptoms, and treatment.
  • Does a web search for the new term without an article and gets a list of a whole bunch of results.
  • Uses a supervised learning algorithm to identify fragments of texts in articles on the web that match the types of text in the types of sections from other website.
  • Finally reassembles things with an integer linear programming algorithm for redundancy elimination.

She evaluates the article by generating articles between 12 and 100 sentences. And they read great.