Talk:Submissions/Mining Wikipedia public data
List of recurrent issues encountered by WP researchers - Wikimania 2010 workshop
What kinds of problems do you have when you research? Answer below.
How do I make the dump into something I can use? Dumps are very big. Can I get slices? Can I get a corpus or random sample of pages?
How do I systematically look at some particular segment of data - mentions of a single keyword, only talk pages for policies, etc. (Want full-text, deep search through histories)
- Reconstruct history of Talk pages - diffs are noisy
- Same with templates - what did a page look like at a particular moment in time - templates mean that the "history" isn't really the history as of a date in time
- Lose history of archived Talk pages (LiquidThreads may help)
It's unclear what counts as a useful result...
What statistical methods make sense given the structure of this data? How do I sample data? What samples make sense?
How do you download the current content of pages in Wikipedia?
- A: database dumps, http://download.wikimedia.org , most recent complete dump of enwiki is at http://download.wikimedia.org/enwiki/20100130/ , need to understand things about the data
- But there are many difficulties in processing, they are huge, there is no analytic software, there are time lags (months old by time finished in some cases)
- Many aren't programmers - GUI would be useful
- http://enwp.org/Special:Export allows you export up to 1000 pages in XML, can specify all pages in a category (wow! I needed to know that!)
Redirects are now tagged in the public data dumps.
Cannot access certain tables because of privacy issues. Deleted revisions and pages, access logs.
How can we share the data analysis that we're doing - shared corpus, data archiving, ...
What kind of data can I get and not get?
I want to know about deleted pages (to study first mover advantage in article creation -- is it partly selection at creation time?) - currently no way to do this (also means you can't know the history for a deleted page)
- can have a bot that monitors recent changes and grabs the text of every page when it is put on AfD, etc, but this is costly and makes a LOT of requests to the API (also doesn't work for historical info -- but good idea for a current study, thanks!)
What tools/best practices can we share/should we know about?
Tools for analytizing particular articles
- http://toolserver.org/~daniel/WikiSense/Contributors.php - number of contributors
- http://toolserver.org/~mzmcbride/watcher/ - number of people who are watching a page
- http://stats.grok.se/en/201007/ - most viewed pages, largest # of editors in a month, viewed page statistics
- http://wikidashboard.parc.com visualization in place
Bots and code
- http://meta.wikimedia.org/wiki/Pywikipediabot pywikipediabot - queries the Wikipedia API
- http://toolserver.org/~daniel/ Talk to Daniel about Toolserver accounts
Tools for dealing with particular dumps
- http://en.wikipedia.org/wiki/Wikipedia:Database_download - Information on downloading the database
What are these good for? (classify me)
- http://meta.wikimedia.org/wiki/WikiXRay quantitative analysis tool (from Felipe Ortega et al)
- http://meta.wikimedia.org/wiki/User:Micke/WikiFind search tool for database dumps
- http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/ - preprocessor for XML dumps, "eliminates some information and adds other useful information"
- http://www.mediawiki.org/wiki/Alternative_parsers - List of parsers
- http://static.wikipedia.org/ - Static HTML dumps
- 1: Be(a)ware of special pages
Redirects are now tagged (but this is new)
- 2 Hardware:
Try to parallelize Buy memory before hard drives - decompress on the fly RAID 10 in Linux can often work well (average studies)
- 3 Database
Standard config of database will not work efficiently - the average user of mysql needs transaction control, concurrent read/write among many users, etc.
- 4 Code
Release your code - and use good coding practices (documentation, version releases, svn)
- 5 Use the right "spell"
Use http://meta.wikimedia.org/wiki/Pywikipediabot library for python, it's awesome. It queries the API, makes data collection a lot easier. Was originally done for bots. 20x better performance than any other solution. Doesn't work with dump files, directly queries the live database. Doesn't do much for analysis though, but returns structured data that can be fed into your own database.
- 6 Avoid reinventing the wheel
- 7 Automate everything!
Reproducibility! Humans make many, many mistakes. This is why we don't have graphical tools (BUT IMO Graphical interfaces can help you refine the procedure and understand what you can do!)
- 8 Always expect the worst
Average case isn't going to work - data is too large - standard algorithms take a long time Specific cases, caveats Don't underestimate the amount of time it will take
- 9 Not many graphical interfaces
Difficult to automate, hard to display real-time results
Literature reviews/summarizing what's been done - see the many bibliographies - Want to help integrate? Summarize articles?