Submissions/Mining Wikipedia public data

From Wikimania 2010 • Gdańsk, Poland • July 9-11, 2010


This is an open submission for Wikimania 2010.

Title of the submission
Mining Wikipedia public data
Type of submission (workshop, tutorial, panel, presentation)
Author of the submission
Felipe Ortega
E-mail address or username (if username, please confirm email address in Special:Preferences)
Country of origin
Affiliation, if any (organization, company etc.)
Libresoft, University Rey Juan Carlos
Personal homepage or blog
Abstract (please use no less than 300 words to describe your proposal)

Undoubtedly, Wikipedia is today one of the biggest open collaborative projects running in the Web. It is also one of the projects providing a huge set of data dumps, offering all content and information about community activities openly, following the same approach pursued in other type of open communities such as FLOSS development projects. This vast data repository can be mined for many different purposes, from acquiring added value information to complement web services, to enrich semantic search engines, augmented reality applications, multimedia content management solutions or to simply learn more aspects of many key elements driving the editorial activity and production of content within the different Wikipedia communities.

However, when it comes to actually create and deploy performance-wise solutions to extract and analyze all this information, many developers consistently face recurrent problems to achieve these goals: parsing dump files, configuring the hardware and software infrastructure to run all required processes, parallelize analysis tasks to cut out execution times, etc. Without appropriate guidance and a set of useful hints to overcome many problems well-known for a subset of developers and researchers working with these compilations on a daily basis, newcomers may despair of their initial intentions and abandon their projects. This is specially true for many people without a strong technical background, for which all these complexities represent an unsurmountable problem.

The aim of this tutorial session will be twofold. On one hand, it will present a set of already available tools and environments (all licensed as FLOSS) suitable for automating many of the tedious and complex tasks that must be undertaken to mine Wikipedia data repositories. This will constitute the main body of the session, and should address non-technical audience level, as well. On the other hand, in a selection of cases it will offer deeper hints and recommendations for tech developers on best practices to set up and configure suitable working environments capable of carrying out all necessary tasks to work with Wikipedia data dumps.

All along the presentation, use of state-of-the-art tools will be stressed, such as pywikipediabot, as well as some interesting applications available, like wrdese, wikitrust or [:meta:WikiXRay wikixray].

Attendees to the tutorial should learn the basic steps to set up and run these tools, how they can adapt them to their own purposes, and limitations they should take into account in their use.

Session program

  • We will devote the first 10-15 mins. to listen to the issues and interest of the audience, trying to compose a list of recurrent problems and limitations that researchers and practitioners would love to overcome when the retrieve and use Wikipedia data.
  • Finally, we will go through the list, trying to apply best practices on processing Wikimedia public data dumps to answer questions and gain valuable insights (remaining 30 mins).
Track (People and Community/Knowledge and Collaboration/Infrastructure)
Knowledge and Collaboration
Will you attend Wikimania if your submission is not accepted?

Yes, in any case.

Slides or further information (optional)
  • Slides It would be great to contact Dmitry Chichkov, in case he's interested in providing first-hand feedback about his tool (wrdese).

Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. Jodi.a.schneider 08:44, 20 May 2010 (UTC)[reply]
  2. Kristo 05:17, 21 May 2010 (UTC)[reply]
  3. Psychology 01:54, 23 May 2010 (UTC)[reply]
  4. Marinna 15:02, 25 May 2010 (UTC)[reply]
  5. --Friedel Völker 15:27, 28 May 2010 (UTC)[reply]
  6. Kocio 13:03, 2 June 2010 (UTC)[reply]
  7. Jérôme 14:59, 15 June 2010 (UTC)[reply]