Reciprocal Enrichment between Wikipedia and Machine Translators
Mikel Iturbe
Unai Fdz. de Betoño
Galder Gonzalez
Arkaitz Zubiaga
Iñaki Alegria
Gorka Labaka
Kepa Sarasola
Unai Fdz. de Betoño
University of the Basque Country
A successful characteristic of Wikipedia is the high number of languages it is available in. Nonetheless, the rapid growth of English Wikipedia is making most of the languages left behind, especially minority languages, where the number of contributors is immensely smaller. In this sense, partially automated processes relying on tasks like machine translation present a new option for easing article generation in many languages. Currently, translations provided by existing machine translation systems are riddled with inaccuracies. Hence, they are useful for understanding the meaning of source text rather than for getting a correct translation, since the subsequent post-editing process requires hard human work.

In this context, we present 'OpenMT-2: Hybrid Machine Translation and advanced evaluation', a project for which Basque Wikipedia contributors collaborate with the University of the Basque Country (EHU) and Polytechnic University of Catalonia (UPC), funded by the Spanish Ministry of Science and Innovation (TIN2009-14675-C03-01). Within this project, a set of 100 long articles of the Spanish Wikipedia will be selected, and afterwards translated into Basque language by using Matxin-Opentrad, an open source rule-based machine translation system. The authors have presented their Spanish into Basque translation approaches in previous works[1].

The automatically translated articles will be full of errors. Thus, a group of users from Basque Wikipedia will review them, correcting the errors they will find; this process is also known as post-editing. In this process, changes made by these users will be logged. In addition, the fixed articles will be included into Basque Wikipedia.

Researchers from the aforementioned universities will analyze the resulting post-editing logs. Thus, they can work on improving their machine translation process by manually improving the different modules of their MT system, or by implementing an automated statistical post-editing process[2] that is expected to improve the accuracy of the translation also for the Spanish-Basque language pair[3].

At the moment, they are examining different alternatives to create a human post-editing interface suitable to translate Wikipedia contents, by means of adapting any current free and open software: (1) OmegaT seems to be a free translation memory application suitable to do it; (2) the World Wide Lexicon Translator is a Firefox add-on (WWL) that makes browsing foreign languages sites easy and automatic. Simply open a URL and it detects its language and translates using human and machine translations. With it you can view and create translations for any website. However, its post-editing interface does not yet work very properly; (3) the Google Translation Toolkit provides specific help to translate wikipedia contents, but it is not a free and open software.

As regards to Wikimania 2010, we would like to present the details of the OpenMT-2 project, showing the positive aspects of a collaborative work among Wikipedia and universities, with the aim of increasing available resources for information treatment and generation.


  1. Alegria I., Arregi X., Díaz de Ilarraza A., Labaka G., Lersundi M., Mayor A., Sarasola K. 2008. Strategies for sustainable MT for Basque: incremental design, reusability, standardization and open-source. Proceedings of the IJCNLP-08, pp: 235-243. Hyderabad, India.
  2. Simard, M., Ueffing, N., Isabelle, P., and Kuhn, R. 2007. Rule-based translation with statistical phrase-based post-editing Proceedings of the Second Workshop on Statistical Machine Translation. pp:203-206. Prague, Czech Republic.
  3. Díaz de Ilarraza A., Labaka G., Sarasola K. 2008. Statistical Post-Editing: A Valuable Method in Domain Adaptation of RBMT Systems MATMT2008 workshop: Mixing Approaches to Machine Translation. pp.35-40.
Knowledge and Collaboration
Probably not
