Talk:Submissions/Mining Wikipedia public data

From Wikimania 2010 • Gdańsk, Poland • July 9-11, 2010

copying from http://sync.in/60kOfEwBHA in case that gets lost somehow. Jodi.a.schneider 10:16, 13 July 2010 (UTC)[reply]

List of recurrent issues encountered by WP researchers - Wikimania 2010 workshop

What kinds of problems do you have when you research? Answer below.

How do I make the dump into something I can use? Dumps are very big. Can I get slices? Can I get a corpus or random sample of pages?

How do I systematically look at some particular segment of data - mentions of a single keyword, only talk pages for policies, etc. (Want full-text, deep search through histories)

  • Reconstruct history of Talk pages - diffs are noisy
  • Same with templates - what did a page look like at a particular moment in time - templates mean that the "history" isn't really the history as of a date in time
  • Lose history of archived Talk pages (LiquidThreads may help)

It's unclear what counts as a useful result...

What statistical methods make sense given the structure of this data? How do I sample data? What samples make sense?

How do you download the current content of pages in Wikipedia?

  • A: database dumps, http://download.wikimedia.org , most recent complete dump of enwiki is at http://download.wikimedia.org/enwiki/20100130/ , need to understand things about the data
  • But there are many difficulties in processing, they are huge, there is no analytic software, there are time lags (months old by time finished in some cases)
  • Many aren't programmers - GUI would be useful
    http://enwp.org/Special:Export allows you export up to 1000 pages in XML, can specify all pages in a category (wow! I needed to know that!)

Redirects are now tagged in the public data dumps.


Cannot access certain tables because of privacy issues. Deleted revisions and pages, access logs.


How can we share the data analysis that we're doing - shared corpus, data archiving, ...


What kind of data can I get and not get?

I want to know about deleted pages (to study first mover advantage in article creation -- is it partly selection at creation time?) - currently no way to do this (also means you can't know the history for a deleted page)

can have a bot that monitors recent changes and grabs the text of every page when it is put on AfD, etc, but this is costly and makes a LOT of requests to the API (also doesn't work for historical info -- but good idea for a current study, thanks!)


What tools/best practices can we share/should we know about?


Tools for analytizing particular articles


Bots and code

Computer resources

Tools for dealing with particular dumps

What are these good for? (classify me)

Best practices:

  • 1: Be(a)ware of special pages

Redirects are now tagged (but this is new)

  • 2 Hardware:

Try to parallelize Buy memory before hard drives - decompress on the fly RAID 10 in Linux can often work well (average studies)

  • 3 Database

Standard config of database will not work efficiently - the average user of mysql needs transaction control, concurrent read/write among many users, etc.

  • 4 Code

Release your code - and use good coding practices (documentation, version releases, svn)

  • 5 Use the right "spell"

Use http://meta.wikimedia.org/wiki/Pywikipediabot library for python, it's awesome. It queries the API, makes data collection a lot easier. Was originally done for bots. 20x better performance than any other solution. Doesn't work with dump files, directly queries the live database. Doesn't do much for analysis though, but returns structured data that can be fed into your own database.

  • 6 Avoid reinventing the wheel


  • 7 Automate everything!

Reproducibility! Humans make many, many mistakes. This is why we don't have graphical tools (BUT IMO Graphical interfaces can help you refine the procedure and understand what you can do!)

  • 8 Always expect the worst

Average case isn't going to work - data is too large - standard algorithms take a long time Specific cases, caveats Don't underestimate the amount of time it will take

  • 9 Not many graphical interfaces

Difficult to automate, hard to display real-time results

Communication channels

Literature reviews/summarizing what's been done - see the many bibliographies - Want to help integrate? Summarize articles?