Finding models that predict or explain relationships in data is a big focus in information science. Such models often have many parameters that can be tuned, and in practice we only have limited data to tune the parameters with. If we make measurements of a function f(x) at different values of x, we might find data like in Figure (a) below. If we now fit a polynomial curve to all known data points, we might find the model that is depicted in Figure (b). This model appears to explain the data perfectly: all data points are covered. However, such a model does not give any additional insight into the relationship between x and f(x). Indeed; if we make more measurements, we find the data in Figure (c). Now the model we found in (b) appears to not fit the data well at all. In fact, the function used to generate the data is f(x) = x + \epsilon with \epsilon Gaussian noise. The linear model f'(x) = x depicted in Figure (d) is the most suitable model to explain the found data and to make predictions of future data.

The overly complex model found in (b) is said to have *overfitted*. A model that has been overfitted fits the known data extremely well, but it is not suited for *generalization* to unseen data. Because of this, it is important to have some estimate of a model’s ability to be generalized to unseen data. This is where *training*, *testing*, and *development* sets come in. The full set of collected data is split into these separate sets.

Continue reading “Training, Testing and Development / Validation Sets”