The U.S. government spent 4.5 billion USD from 1990 through 2009 on outsourcing translations and interpretations [bibcite key=USSpend]. If these translations were automated, much of this money could have been spent elsewhere. The research field of Machine Translation (MT) tries to develop systems capable of translating verbal language (i.e. speech and writing) from a certain source language to a target language.
Because verbal language is broad, allowing people to express a great number of things, one must take into account many factors when translating text from a source language to a target language. Three main difficulties when translating are proposed in [bibcite key=TTT1999]: the translator must distinguish between general vocabulary and specialized terms, as well as various possible meanings of a word or phrase, and must take into account the context of the source text.
Machine Translation systems must overcome the same obstacles as professional human translators in order to accurately translate text. To try to achieve this, researchers have had a variety of approaches over the past decades, such as [bibcite key=Gachot1989,Brown1990,Koehn2003]. At first, the knowledge-based paradigm was dominant. After promising results on a statistical-based system ([bibcite key=Brown1990,Brown1993]), the focus shifted towards this new paradigm.
Finding models that predict or explain relationships in data is a big focus in information science. Such models often have many parameters that can be tuned, and in practice we only have limited data to tune the parameters with. If we make measurements of a function f(x) at different values of x, we might find data like in Figure (a) below. If we now fit a polynomial curve to all known data points, we might find the model that is depicted in Figure (b). This model appears to explain the data perfectly: all data points are covered. However, such a model does not give any additional insight into the relationship between x and f(x). Indeed; if we make more measurements, we find the data in Figure (c). Now the model we found in (b) appears to not fit the data well at all. In fact, the function used to generate the data is f(x) = x + \epsilon with \epsilon Gaussian noise. The linear model f'(x) = x depicted in Figure (d) is the most suitable model to explain the found data and to make predictions of future data.
The overly complex model found in (b) is said to have overfitted. A model that has been overfitted fits the known data extremely well, but it is not suited for generalization to unseen data. Because of this, it is important to have some estimate of a model’s ability to be generalized to unseen data. This is where training, testing, and development sets come in. The full set of collected data is split into these separate sets.