Machine Translation Turing Test

Machine Translation
Will computers ever reach the quality of professional translators?

[bibshow file=machinetranslationturingtest.bib sort=firstauthor order=asc]

The U.S. government spent 4.5 billion USD from 1990 through 2009 on outsourcing translations and interpretations [bibcite key=USSpend]. If these translations were automated, much of this money could have been spent elsewhere. The research field of Machine Translation (MT) tries to develop systems capable of translating verbal language (i.e. speech and writing) from a certain source language to a target language.

Because verbal language is broad, allowing people to express a great number of things, one must take into account many factors when translating text from a source language to a target language. Three main difficulties when translating are proposed in [bibcite key=TTT1999]: the translator must distinguish between general vocabulary and specialized terms, as well as various possible meanings of a word or phrase, and must take into account the context of the source text.

Machine Translation systems must overcome the same obstacles as professional human translators in order to accurately translate text. To try to achieve this, researchers have had a variety of approaches over the past decades, such as [bibcite key=Gachot1989,Brown1990,Koehn2003]. At first, the knowledge-based paradigm was dominant. After promising results on a statistical-based system ([bibcite key=Brown1990,Brown1993]), the focus shifted towards this new paradigm.

Knowledge-based MT

The Knowledge-based Machine Translation paradigm (also called Rule-based MT) typically focuses on sets of rules to translate texts [bibcite key=Nirenburg1989]. It is based on linguistic information from languages, such as dictionaries and grammar. Nirenburg writes about three major types: direct, interlingua and transfer.

Direct translation systems rely on a set of rules dependent on the source and target language pairs. In general, these rules describe the grammatical and vocabular phenomena of the source language and also state how to translate them into the target language. An example of a system implementing a direct approach is SYSTRAN [bibcite key=Gachot1989].

In an interlingua system, the source and target language are not directly connected through rules [bibcite key=Nirenburg1989]. Nirenburg writes that the system first expresses the source language text in a formal language to represent the meaning. Then, the system uses rules of the target language to express this meaning. Thus in this system the languages never come into contact with each other.

Transfer systems are similar to interlingua systems. However, where interlingua is fully language-pair independent, in transfer systems there is some dependence. The actual implementation of this is varied. In general, the system creates a formal representation of the meaning of the source text, and then expresses this in the target language. However, during the translation to the target, the system can make use of, for example, dictionaries between the two languages and grammatical rules to use in correspondence with the formal representation [bibcite key=Nirenburg1989].

Statistical MT and Example-based MT

In Statistical Machine Translation (SMT) translations are made based on statistical models. The idea behind SMT comes from information theory. The models used are trained by a large bilingual text corpus, which is a large and structured set of texts used for statistical analysis. These models are based on probability distributions p(e|f), which is the probability that a target language string e is the translation for the source language string f [bibcite key=Och2002]. To translate f, you calculate \hat{e}.

\hat{e} = \underset{e}{\text{argmax}} \left( p(e|f) \right)

Example-based Machine Translation (EBMT) is similar to SMT in that it uses a bilingual text corpus to generate translations, examples of such systems are described by [bibcite key=Chiang2005,Koehn2003]. The difference between EBMT and SMT is that instead of using probability distributions, EBMT attempts to match input against the corpus to extract examples by analogy, which are then merged to find the correct translation [bibcite key=Somers1999]. Example corpus:

English Dutch
That is my pencil. Die potlood is van mij.
That is my car. Die auto is van mij.

In this case, the machine would learn

  1. That is my X. ↔ Die X is van mij.
  2. pencil ↔ potlood
  3. car ↔ auto

For some other language pairs, this would be slightly more difficult. For example, Russian’s pronouns change based on grammatical gender, and the word is is not present in the Russian translations:

English Russian
That is my pen. Этот карандаш мой.
(Etot karandash moy.)
That is my car. Эта машина моя.
(Eta mashina moya.)

Translation Memory

The MT methods described above are all autonomous systems, such that when their rules have been implemented or they have been trained, they translate texts without human interaction. A different type of MT, Translation Memory (TM), is described in [bibcite key=Somers1999]. TM is similar to EBMT in that it has a database of bilingual examples that have been translated previously. As opposed to EBMT, Translation Memory is not used to translate texts autonomously. Instead, it is used as a tool by people to help them translate texts; the TM system tries to find translations for individual sentences and shows them to the person translating, who can then choose a translated sentence and modify it where necessary.

Quality of Translation

An important part of Machine Translation research is evaluation of generated translations. A human could evaluate the translations, but this is expensive and subjective. As such, automated, quantitative translation evaluations are desired [bibcite key=Marrafa2001a]. Such a system has been developed, called BLEU (bilingual evaluation understudy) [bibcite key=Papineni2002].

BLEU works by having a corpus of good quality translations made by humans of a text, and calculating a numeric “translation closeness” metric of a machine translation of that text to the corpus. This closeness metric is based on the word error rate metric, which counts the number of substitutions, deletions and insertions of words needed to transform some text A into some text B.

The system is based on the assumption that a machine translation is better the closer it is to a professional human translation. As such, the focus of the remainder of this article is to investigate whether the current approaches to Machine Translation (SMT and EBMT) will ever enable machines to translate texts, such that a human cannot distinguish the translation from a professional human-translated text.

The Turing test is a test whether a machine has the ability to show intelligent behaviour, irrespective of it’s actual intelligence [bibcite key=TuringTest]. In the test, humans interrogate the system and other humans. The interrogators do not know whether the entity they are interrogating is human or a machine. If the interrogators cannot reliably distinguish the machine from the humans, it is said that the machine passed the test. In the case of translations, it could be said that a MT system passes the test if humans cannot reliably distinguish between the system’s translations and human translations of a text.


In this section we will review the problems of the several current approaches to Machine Translation.


Both Statistical Machine Translation and Example-based Machine Translation require a database to learn how to translate texts from a source to a target language [bibcite key=Och2002,Koehn2003]. There are a few issues regarding obtaining data and constructing these databases, however.

Parallel corpora are required for Example-based MT. These corpora have to be aligned, usually on a sentence-to-sentence or paragraph-to-paragraph basis [bibcite key=Somers1999]. Aligning manually is a slow process, and impractical for large databases. Preferably the alignment should be fully automated [bibcite key=Gale1993].

There are multiple methods of automatic alignment, such as the ones described by [bibcite key=Brown1990] and [bibcite key=Gale1993]. This latter program used sentence length as a metric, where one assumes that there is a significant correlation between the lengths of a pair of sentences that are translations of each other. However, it has been stated that early success using automatic alignment was due to the fact that most researchers used the “well-behaved” English-French corpus of the Canadian parliament’s transcripts [bibcite key=Somers1998]. “Well-behaved” meaning that most sentences and paragraphs can be lined up (e.g. any one sentence is translated to exactly one sentence) and that there is little noise (e.g. footnotes are at the same relative position when the transcripts are digitalized).

Further, some metrics pose problems when aligning text pairs from other languages, such as English and Chinese [bibcite key=Somers1998]. Specifically, it is not clear whether sentence length is correlated as strongly in that case. Furthermore, because the languages use different character sets altogether, where counting the number of characters to align English and French sentences works well, it is likely that this will yield poor results for English and Chinese.

Analogous to counting characters, it is conceivable that counting words as a metric for sentence length will also not yield satisfactory results for every language pair. For example, when aligning texts that are English and Dutch, the sentence length correlation could be decreased by the fact that Dutch has compound words. As can be seen in the table below, two sentences with eight words in English, can have significantly different lengths when translated depending on whether words should be compounded in Dutch.

English Dutch
She is a writer of books for children. Zij is een kinderboekenschrijfster.
She is an avid reader of science books. Zij is een begerig lezer van wetenschapsboeken.

Researchers designing a statistical phrase-based translation system noticed that their performance varied significantly depending on the methods used to create the phrase translation table [bibcite key=Koehn2003]. They found that aligning texts on words provided better results than aligning on phrases. Furthermore, they observed that under different circumstances, other heuristics provided the best result; not only the specific language pairs affected the result, but also the size of the training database influenced heuristic fitness.

Suitability and the required number of corpora is not clear. An experiment showed an algorithm’s translation accuracy rose, approximately linearly, from roughly 30\% to 75\%, by adding more examples to the corpus [bibcite key=Mima1998]. They started at 100 examples, and steadily added examples up to the full set of 774. It is, however, assumed that accuracy cannot be increased up to 100\% simply by adding more suitable examples [bibcite key=Somers1999].

Furthermore, adding more examples can have a negative effect as well. A positive effect is possible for metrics sensitive to frequency, where more examples can increase the score to often used matches. In contrast, some other systems may just be presented with more options to “choose” from, thus increasing the chance of ambiguity. For these systems it might be better to have a human-picked set of relevant and high quality examples.

Besides having a possible negative effect on translation quality, the time needed to compute a translation can be negatively affected by increasing the database as well [bibcite key=Mima1998,Och2002]. There are cases where translation speed is important, such as when the system is part of a live speech translator. In this case the translations of sentences should on average require no more than the amount of time it takes to speak those individual sentences. If the system has millions of entries in the database and produces near-perfect translations, it cannot be used for translating if it cannot keep up with the speaker.

Translation Rules

All MT systems use rules to a certain degree for translation. Knowledge-based MT is by definition based on rules, and even though the names “Example-based MT” and “Statistical MT” imply that rules are not used, most of these systems are still somewhat bound to rules. For example, [bibcite key=Koehn2003] designed a statistical phrase-based translation system, whereas [bibcite key=Chiang2005] designed a hierarchical phrase-based system. Both systems are not rule-based in essence, but their designs involve rules that tell the system how to handle data.

Chiang states that a non-hierarchical phrase-based system might work adequately for language pairs that have similar word order, such as English and French, but might not work as well when used for different language pairs that have different word order, such as English and Mandarin. Where “One of X” is usual word order in English, in Mandarin the order would be “X one of”. Chiang explains that this poses a problem, as X could be longer than a simple phrase. In such a case, simple word reordering would not work (e.g. flipping nouns and adjectives), as the sentences are hierarchical by nature.

Because choices of design rules affect translation quality, and all MT systems are based on rules to some extent, some systems work better for certain language pairs than others. This is undesirable, as it leads to management of multiple translation systems, and the system is not flexible in that it cannot learn a new language by simply giving it appropriate text corpora.


If MT systems are to reach professional quality, aside from the system-specific problems described above, there is yet another problem the systems face: if the systems do not comprehend the texts they have to translate, part of the meanings can get lost in translation [bibcite key=Brunning2009].

This is not a necessarily a problem for legal texts, that are required to be unambiguous. In contrast, translation systems translating, for example, websites, are handling ambiguous texts. A web page might be an article about microchips, where the word “chips” is used in the article. A naive translation machine designed to take a word’s most common translation into the target language, could make the mistake to interpret the word as “potato chips” throughout the article.

Furthermore, a text can have meaning beyond its literal interpretation [bibcite key=Jaszczolt2003]. For example, a writer can phrase a sentence in a specific way to indicate that it is sarcasm; in effect the sentence means the opposite of its literal interpretation. Human translators can pick up on these cues and make sure the translated texts still have the original meaning by making sure their text reflects the sarcasm, or to make the original meaning of the writer explicit. Current MT systems would, however, see the sentence as literal and translate it as-is, which could lead to the sarcasm not being clear in the translation.

Especially if MT systems are going to be used to translate poetry and literature, it is important that the system is able to capture as much of the original meaning as possible. One such system has already been created, which translates text literally, whilst attempting to maintain the original meter and rhyme scheme [bibcite key=Genzel2010]; parts of the meaning of the original poem will be lost. Little research has been done into the issue of preserving actual meaning of texts with respect to context as described in this article.


We have seen a range of problems related to Machine Translation; some related to certain systems, and some, especially text comprehension, related to every MT system currently in development. These problems hinder the progress of Machine Translation quality.

If there is no “well-behaved” bilingual corpus available for a certain pair of languages, then the translation quality of an EBMT or SMT between those languages will be poor [bibcite key=Somers1999]. Humans will surely be able to notice the difference between the machine’s translation, and the translation of a professional translator in those languages.

Furthermore, we have seen that even if an EBMT or SMT system has high performance on a certain language pair, you cannot assume it will perform as high on a different language pair, even if “well-behaved” corpora are available. This is due to heuristics for corpus alignment not being applicable to every corpus, as differences between languages make it hard, maybe even impossible, to find general rules for comparing sentences in different languages.

Another problem is the suitability of entries for the corpus. As discussed by [bibcite key=Somers1999], having a larger corpus does not always yield a better result, as it increases ambiguity and thus allows the machine to “choose” between possible translations, even if such a choice should not be available.

We have seen that rule-based systems have the problem that certain rules are not applicable to every language pair. If these rules are at the heart of the system (e.g. the way it parses corpora, or the way it looks at sentences), then such a system cannot be applied in many other language pairs [bibcite key=Chiang2005]. Furthermore, it might simply be undesirable to have to change anything at all when applying a translation system to a different language.

Perhaps the most prominent problem for translation quality is the inability of current systems to comprehend the texts they are translating. This results in ambiguous parts in the source text being translated wrongly, where a human could find the correct translation from context. Moreover, if the writer of a text introduced ambiguity purposefully, such as sarcasm, current systems are likely to translate that part without taking the actual meaning into account.

As each approach has multiple problems to deal with, and all current approaches have no solution to text comprehension, it is highly unlikely for systems developed with current approaches to pass the Turing test in multiple applications (e.g. translating to and from multiple languages, or translating different text types). There is need to develop forms of context-aware systems that have an understanding of the text they are translating, as pure statistical or rule-based translation is unable to cope with ambiguity and the deeper meaning of phrases and sentences in the text as a whole.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.