Machine Translation Turing Test

Machine Translation
Will computers ever reach the quality of professional translators?

The U.S. government spent 4.5 billion USD from 1990 through 2009 on outsourcing translations and interpretations [8]. If these translations were automated, much of this money could have been spent elsewhere. The research field of Machine Translation (MT) tries to develop systems capable of translating verbal language (i.e. speech and writing) from a certain source language to a target language.

Because verbal language is broad, allowing people to express a great number of things, one must take into account many factors when translating text from a source language to a target language. Three main difficulties when translating are proposed in [12]: the translator must distinguish between general vocabulary and specialized terms, as well as various possible meanings of a word or phrase, and must take into account the context of the source text.

Machine Translation systems must overcome the same obstacles as professional human translators in order to accurately translate text. To try to achieve this, researchers have had a variety of approaches over the past decades, such as [5, 1, 10]. At first, the knowledge-based paradigm was dominant. After promising results on a statistical-based system ([1, 2]), the focus shifted towards this new paradigm.

Knowledge-based MT

The Knowledge-based Machine Translation paradigm (also called Rule-based MT) typically focuses on sets of rules to translate texts [14]. It is based on linguistic information from languages, such as dictionaries and grammar. Nirenburg writes about three major types: direct, interlingua and transfer.

Direct translation systems rely on a set of rules dependent on the source and target language pairs. In general, these rules describe the grammatical and vocabular phenomena of the source language and also state how to translate them into the target language. An example of a system implementing a direct approach is SYSTRAN [5].

In an interlingua system, the source and target language are not directly connected through rules [14]. Nirenburg writes that the system first expresses the source language text in a formal language to represent the meaning. Then, the system uses rules of the target language to express this meaning. Thus in this system the languages never come into contact with each other.

Transfer systems are similar to interlingua systems. However, where interlingua is fully language-pair independent, in transfer systems there is some dependence. The actual implementation of this is varied. In general, the system creates a formal representation of the meaning of the source text, and then expresses this in the target language. However, during the translation to the target, the system can make use of, for example, dictionaries between the two languages and grammatical rules to use in correspondence with the formal representation [14].

Statistical MT and Example-based MT

In Statistical Machine Translation (SMT) translations are made based on statistical models. The idea behind SMT comes from information theory. The models used are trained by a large bilingual text corpus, which is a large and structured set of texts used for statistical analysis. These models are based on probability distributions p(e|f), which is the probability that a target language string e is the translation for the source language string f [15]. To translate f, you calculate \hat{e}.

  \hat{e} = \underset{e}{\text{argmax}} \left( p(e|f) \right)

Example-based Machine Translation (EBMT) is similar to SMT in that it uses a bilingual text corpus to generate translations, examples of such systems are described by [4, 10]. The difference between EBMT and SMT is that instead of using probability distributions, EBMT attempts to match input against the corpus to extract examples by analogy, which are then merged to find the correct translation [18]. Example corpus:

English Dutch
That is my pencil. Die potlood is van mij.
That is my car. Die auto is van mij.

In this case, the machine would learn

  1. That is my X. ↔ Die X is van mij.
  2. pencil ↔ potlood
  3. car ↔ auto

For some other language pairs, this would be slightly more difficult. For example, Russian’s pronouns change based on grammatical gender, and the word is is not present in the Russian translations:

English Russian
That is my pen. Этот карандаш мой.
(Etot karandash moy.)
That is my car. Эта машина моя.
(Eta mashina moya.)

Translation Memory

The MT methods described above are all autonomous systems, such that when their rules have been implemented or they have been trained, they translate texts without human interaction. A different type of MT, Translation Memory (TM), is described in [18]. TM is similar to EBMT in that it has a database of bilingual examples that have been translated previously. As opposed to EBMT, Translation Memory is not used to translate texts autonomously. Instead, it is used as a tool by people to help them translate texts; the TM system tries to find translations for individual sentences and shows them to the person translating, who can then choose a translated sentence and modify it where necessary.

Quality of Translation

An important part of Machine Translation research is evaluation of generated translations. A human could evaluate the translations, but this is expensive and subjective. As such, automated, quantitative translation evaluations are desired [11]. Such a system has been developed, called BLEU (bilingual evaluation understudy) [17].

BLEU works by having a corpus of good quality translations made by humans of a text, and calculating a numeric “translation closeness” metric of a machine translation of that text to the corpus. This closeness metric is based on the word error rate metric, which counts the number of substitutions, deletions and insertions of words needed to transform some text A into some text B.

The system is based on the assumption that a machine translation is better the closer it is to a professional human translation. As such, the focus of the remainder of this article is to investigate whether the current approaches to Machine Translation (SMT and EBMT) will ever enable machines to translate texts, such that a human cannot distinguish the translation from a professional human-translated text.

The Turing test is a test whether a machine has the ability to show intelligent behaviour, irrespective of it’s actual intelligence [16]. In the test, humans interrogate the system and other humans. The interrogators do not know whether the entity they are interrogating is human or a machine. If the interrogators cannot reliably distinguish the machine from the humans, it is said that the machine passed the test. In the case of translations, it could be said that a MT system passes the test if humans cannot reliably distinguish between the system’s translations and human translations of a text.


In this section we will review the problems of the several current approaches to Machine Translation.


Both Statistical Machine Translation and Example-based Machine Translation require a database to learn how to translate texts from a source to a target language [15, 10]. There are a few issues regarding obtaining data and constructing these databases, however.

Parallel corpora are required for Example-based MT. These corpora have to be aligned, usually on a sentence-to-sentence or paragraph-to-paragraph basis [18]. Aligning manually is a slow process, and impractical for large databases. Preferably the alignment should be fully automated [6].

There are multiple methods of automatic alignment, such as the ones described by [1] and [6]. This latter program used sentence length as a metric, where one assumes that there is a significant correlation between the lengths of a pair of sentences that are translations of each other. However, it has been stated that early success using automatic alignment was due to the fact that most researchers used the “well-behaved” English-French corpus of the Canadian parliament’s transcripts [19]. “Well-behaved” meaning that most sentences and paragraphs can be lined up (e.g. any one sentence is translated to exactly one sentence) and that there is little noise (e.g. footnotes are at the same relative position when the transcripts are digitalized).

Further, some metrics pose problems when aligning text pairs from other languages, such as English and Chinese [19]. Specifically, it is not clear whether sentence length is correlated as strongly in that case. Furthermore, because the languages use different character sets altogether, where counting the number of characters to align English and French sentences works well, it is likely that this will yield poor results for English and Chinese.

Analogous to counting characters, it is conceivable that counting words as a metric for sentence length will also not yield satisfactory results for every language pair. For example, when aligning texts that are English and Dutch, the sentence length correlation could be decreased by the fact that Dutch has compound words. As can be seen in the table below, two sentences with eight words in English, can have significantly different lengths when translated depending on whether words should be compounded in Dutch.

English Dutch
She is a writer of books for children. Zij is een kinderboekenschrijfster.
She is an avid reader of science books. Zij is een begerig lezer van wetenschapsboeken.

Researchers designing a statistical phrase-based translation system noticed that their performance varied significantly depending on the methods used to create the phrase translation table [10]. They found that aligning texts on words provided better results than aligning on phrases. Furthermore, they observed that under different circumstances, other heuristics provided the best result; not only the specific language pairs affected the result, but also the size of the training database influenced heuristic fitness.

Suitability and the required number of corpora is not clear. An experiment showed an algorithm’s translation accuracy rose, approximately linearly, from roughly 30\% to 75\%, by adding more examples to the corpus [13]. They started at 100 examples, and steadily added examples up to the full set of 774. It is, however, assumed that accuracy cannot be increased up to 100\% simply by adding more suitable examples [18].

Furthermore, adding more examples can have a negative effect as well. A positive effect is possible for metrics sensitive to frequency, where more examples can increase the score to often used matches. In contrast, some other systems may just be presented with more options to “choose” from, thus increasing the chance of ambiguity. For these systems it might be better to have a human-picked set of relevant and high quality examples.

Besides having a possible negative effect on translation quality, the time needed to compute a translation can be negatively affected by increasing the database as well [13, 15]. There are cases where translation speed is important, such as when the system is part of a live speech translator. In this case the translations of sentences should on average require no more than the amount of time it takes to speak those individual sentences. If the system has millions of entries in the database and produces near-perfect translations, it cannot be used for translating if it cannot keep up with the speaker.

Translation Rules

All MT systems use rules to a certain degree for translation. Knowledge-based MT is by definition based on rules, and even though the names “Example-based MT” and “Statistical MT” imply that rules are not used, most of these systems are still somewhat bound to rules. For example, [10] designed a statistical phrase-based translation system, whereas [4] designed a hierarchical phrase-based system. Both systems are not rule-based in essence, but their designs involve rules that tell the system how to handle data.

Chiang states that a non-hierarchical phrase-based system might work adequately for language pairs that have similar word order, such as English and French, but might not work as well when used for different language pairs that have different word order, such as English and Mandarin. Where “One of X” is usual word order in English, in Mandarin the order would be “X one of”. Chiang explains that this poses a problem, as X could be longer than a simple phrase. In such a case, simple word reordering would not work (e.g. flipping nouns and adjectives), as the sentences are hierarchical by nature.

Because choices of design rules affect translation quality, and all MT systems are based on rules to some extent, some systems work better for certain language pairs than others. This is undesirable, as it leads to management of multiple translation systems, and the system is not flexible in that it cannot learn a new language by simply giving it appropriate text corpora.


If MT systems are to reach professional quality, aside from the system-specific problems described above, there is yet another problem the systems face: if the systems do not comprehend the texts they have to translate, part of the meanings can get lost in translation [3].

This is not a necessarily a problem for legal texts, that are required to be unambiguous. In contrast, translation systems translating, for example, websites, are handling ambiguous texts. A web page might be an article about microchips, where the word “chips” is used in the article. A naive translation machine designed to take a word’s most common translation into the target language, could make the mistake to interpret the word as “potato chips” throughout the article.

Furthermore, a text can have meaning beyond its literal interpretation [9]. For example, a writer can phrase a sentence in a specific way to indicate that it is sarcasm; in effect the sentence means the opposite of its literal interpretation. Human translators can pick up on these cues and make sure the translated texts still have the original meaning by making sure their text reflects the sarcasm, or to make the original meaning of the writer explicit. Current MT systems would, however, see the sentence as literal and translate it as-is, which could lead to the sarcasm not being clear in the translation.

Especially if MT systems are going to be used to translate poetry and literature, it is important that the system is able to capture as much of the original meaning as possible. One such system has already been created, which translates text literally, whilst attempting to maintain the original meter and rhyme scheme [7]; parts of the meaning of the original poem will be lost. Little research has been done into the issue of preserving actual meaning of texts with respect to context as described in this article.


We have seen a range of problems related to Machine Translation; some related to certain systems, and some, especially text comprehension, related to every MT system currently in development. These problems hinder the progress of Machine Translation quality.

If there is no “well-behaved” bilingual corpus available for a certain pair of languages, then the translation quality of an EBMT or SMT between those languages will be poor [18]. Humans will surely be able to notice the difference between the machine’s translation, and the translation of a professional translator in those languages.

Furthermore, we have seen that even if an EBMT or SMT system has high performance on a certain language pair, you cannot assume it will perform as high on a different language pair, even if “well-behaved” corpora are available. This is due to heuristics for corpus alignment not being applicable to every corpus, as differences between languages make it hard, maybe even impossible, to find general rules for comparing sentences in different languages.

Another problem is the suitability of entries for the corpus. As discussed by [18], having a larger corpus does not always yield a better result, as it increases ambiguity and thus allows the machine to “choose” between possible translations, even if such a choice should not be available.

We have seen that rule-based systems have the problem that certain rules are not applicable to every language pair. If these rules are at the heart of the system (e.g. the way it parses corpora, or the way it looks at sentences), then such a system cannot be applied in many other language pairs [4]. Furthermore, it might simply be undesirable to have to change anything at all when applying a translation system to a different language.

Perhaps the most prominent problem for translation quality is the inability of current systems to comprehend the texts they are translating. This results in ambiguous parts in the source text being translated wrongly, where a human could find the correct translation from context. Moreover, if the writer of a text introduced ambiguity purposefully, such as sarcasm, current systems are likely to translate that part without taking the actual meaning into account.

As each approach has multiple problems to deal with, and all current approaches have no solution to text comprehension, it is highly unlikely for systems developed with current approaches to pass the Turing test in multiple applications (e.g. translating to and from multiple languages, or translating different text types). There is need to develop forms of context-aware systems that have an understanding of the text they are translating, as pure statistical or rule-based translation is unable to cope with ambiguity and the deeper meaning of phrases and sentences in the text as a whole.


  • [1] P. F. Brown, J. Cocke, S. D. A. Pietra, V. D. J. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin, “A statistical approach to machine translation,” Computational linguistics, vol. 16, iss. 2, pp. 79-85, 1990.
    title={A statistical approach to machine translation},
    author={Brown, Peter F and Cocke, John and Pietra, Stephen A Della and Pietra, Vincent J Della and Jelinek, Fredrick and Lafferty, John D and Mercer, Robert L and Roossin, Paul S},
    journal={Computational linguistics},
    publisher={MIT Press}
  • [2] P. F. Brown, V. D. J. Pietra, S. D. A. Pietra, and R. L. Mercer, “The mathematics of statistical machine translation: parameter estimation,” Computational linguistics, vol. 19, iss. 2, pp. 263-311, 1993.
    title={The mathematics of statistical machine translation: Parameter estimation},
    author={Brown, Peter F and Pietra, Vincent J Della and Pietra, Stephen A Della and Mercer, Robert L},
    journal={Computational linguistics},
    publisher={MIT Press}
  • [3] J. Brunning, A. De Gispert, and W. Byrne, “Context-dependent alignment models for statistical machine translation,” in Proceedings of human language technologies: the 2009 annual conference of the north american chapter of the association for computational linguistics, 2009, pp. 110-118.
    title={Context-dependent alignment models for statistical machine translation},
    author={Brunning, Jamie and De Gispert, Adri{\`a} and Byrne, William},
    booktitle={Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
    organization={Association for Computational Linguistics}
  • [4] D. Chiang, “A hierarchical phrase-based model for statistical machine translation,” in Proceedings of the 43rd annual meeting on association for computational linguistics, 2005, pp. 263-270.
    title={A hierarchical phrase-based model for statistical machine translation},
    author={Chiang, David},
    booktitle={Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics},
    organization={Association for Computational Linguistics}
  • [5] D. A. Gachot, “The systran renaissance,” in Mt summit ii, 1989, pp. 66-71.
    title={The SYSTRAN renaissance},
    author={Gachot, Denis A},
    booktitle={MT SUMMIT II},
  • [6] W. A. Gale and K. W. Church, “A program for aligning sentences in bilingual corpora,” Computational linguistics, vol. 19, iss. 1, pp. 75-102, 1993.
    title={A program for aligning sentences in bilingual corpora},
    author={Gale, William A and Church, Kenneth W},
    journal={Computational linguistics},
    publisher={MIT Press}
  • [7] D. Genzel, J. Uszkoreit, and F. Och, “Poetic statistical machine translation: rhyme and meter,” in Proceedings of the 2010 conference on empirical methods in natural language processing, 2010, pp. 158-166.
    title={Poetic statistical machine translation: rhyme and meter},
    author={Genzel, Dmitriy and Uszkoreit, Jakob and Och, Franz},
    booktitle={Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing},
    organization={Association for Computational Linguistics}
  • [8] D. Isenberg, Translating For Dollars, 2010.
    author = {David Isenberg},
    title = {{Translating For Dollars}},
    howpublished = "\url{}",
    year = {2010},
    note = "[Online; accessed 2-October-2013]"
  • [9] K. Jaszczolt and K. Turner, “Meaning through language contrast.” John Benjamins Publishing, 2003, vol. 2, pp. 141-142.
    author = {Jaszczolt, Katarzyna and Turner, Ken},
    publisher = {John Benjamins Publishing},
    title = {Meaning Through Language Contrast},
    year = {2003},
    pages = {141--142},
    volume = {2}
  • [10] P. Koehn, F. J. Och, and D. Marcu, “Statistical phrase-based translation,” in Proceedings of the 2003 conference of the north american chapter of the association for computational linguistics on human language technology-volume 1, 2003, pp. 48-54.
    title={Statistical phrase-based translation},
    author={Koehn, Philipp and Och, Franz Josef and Marcu, Daniel},
    booktitle={Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1},
    organization={Association for Computational Linguistics}
  • [11] P. Marrafa and A. Ribeiro, “Quantitative evaluation of machine translation systems: sentence level,” in Mt summit viii, 2001, pp. 39-43.
    title={Quantitative evaluation of machine translation systems: sentence level},
    author={Marrafa, Palmira and Ribeiro, Antonio},
    booktitle={MT Summit VIII},
  • [12] A. K. Melby, Some difficulties in translation, 1999.
    author = {Alan K. Melby},
    title = {{Some difficulties in translation}},
    howpublished = "\url{}",
    year = {1999},
    note = "[Online; accessed 2-October-2013]"
  • [13] H. Mima, H. Iida, and O. Furuse, “Simultaneous interpretation utilizing example-based incremental transfer,” in Proceedings of the 17th international conference on computational linguistics-volume 2, 1998, pp. 855-861.
    title={Simultaneous interpretation utilizing example-based incremental transfer},
    author={Mima, Hideki and Iida, Hitoshi and Furuse, Osamu},
    booktitle={Proceedings of the 17th international conference on Computational linguistics-Volume 2},
    organization={Association for Computational Linguistics}
  • [14] S. Nirenburg, “Knowledge-based machine translation,” Machine translation, vol. 4, iss. 1, pp. 5-24, 1989.
    title={Knowledge-based machine translation},
    author={Nirenburg, Sergei},
    journal={Machine Translation},
  • [15] F. J. Och and H. Ney, “Discriminative training and maximum entropy models for statistical machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics, 2002, pp. 295-302.
    title={Discriminative training and maximum entropy models for statistical machine translation},
    author={Och, Franz Josef and Ney, Hermann},
    booktitle={Proceedings of the 40th Annual Meeting on Association for Computational Linguistics},
    organization={Association for Computational Linguistics}
  • [16] G. Oppy and D. Dowe, The turing test, 2011.
    author =  {Oppy, Graham and Dowe, David},
    title =  {The Turing Test},
    booktitle =  {The Stanford Encyclopedia of Philosophy},
    editor =  {Edward N. Zalta},
    howpublished =  {\url{}},
    year =  {2011},
    edition =  {Spring 2011},
    note = "[Online; accessed 4-December-2013]"
  • [17] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics, 2002, pp. 311-318.
    title={BLEU: a method for automatic evaluation of machine translation},
    author={Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing},
    booktitle={Proceedings of the 40th annual meeting on association for computational linguistics},
    organization={Association for Computational Linguistics}
  • [18] H. Somers, “Review article: example-based machine translation,” Machine translation, vol. 14, iss. 2, pp. 113-157, 1999.
    title={Review article: Example-based machine translation},
    author={Somers, Harold},
    journal={Machine Translation},
  • [19] H. Somers, “Further experiments in bilingual text alignment,” International journal of corpus linguistics, vol. 3, iss. 1, pp. 115-150, 1998.
    title={Further experiments in bilingual text alignment},
    author={Somers, Harold},
    journal={International Journal of Corpus Linguistics},
    publisher={John Benjamins Publishing Company}

Leave a Reply

Your email address will not be published. Required fields are marked *