The U.S. government spent 4.5 billion USD from 1990 through 2009 on outsourcing translations and interpretations [bibcite key=USSpend]. If these translations were automated, much of this money could have been spent elsewhere. The research field of Machine Translation (MT) tries to develop systems capable of translating verbal language (i.e. speech and writing) from a certain source language to a target language.
Because verbal language is broad, allowing people to express a great number of things, one must take into account many factors when translating text from a source language to a target language. Three main difficulties when translating are proposed in [bibcite key=TTT1999]: the translator must distinguish between general vocabulary and specialized terms, as well as various possible meanings of a word or phrase, and must take into account the context of the source text.
Machine Translation systems must overcome the same obstacles as professional human translators in order to accurately translate text. To try to achieve this, researchers have had a variety of approaches over the past decades, such as [bibcite key=Gachot1989,Brown1990,Koehn2003]. At first, the knowledge-based paradigm was dominant. After promising results on a statistical-based system ([bibcite key=Brown1990,Brown1993]), the focus shifted towards this new paradigm.
In this post, we will explore an inductive data structure as a type for lists of elements. We define two recursive functions that can be applied to lists of this type; namely, an operation to append two lists and an operation to reverse a list. With these definitions, we set out to mathematically prove a number of properties that should hold for these functions by means of induction.
A list is a data structure of a collection of elements in a certain order. A list can be defined in multiple ways. One way is to define a list as being an element (the “head” of the list) followed by another list (the “tail” of the list). Inductively, this could be formalized as follows:
Inductive list a := cons a list | nil.
This means that a list l with elements of type a is either cons x tail, with x of type a and with tail of type list a, or l is the empty list nil. Let’s look at some example lists with elements of the whole numbers.
// Empty list; 
l = nil
// List with one element; 
l = cons 1 nil
// Different list with one element; 
l = cons 2 nil
// List with two elements; [1,2]
l = cons 1 (cons 2 nil)
// List with three elements; [1,2,3]
l = cons 1 (cons 2 (cons 3 nil))
Note that because we have lists of integers, following our definition the list l is of type list integer.
Multiple list operations can be defined, such as append and reverse. We defined our list inductively, and so it would make sense to define these operations inductively (also known as recursively) as well. Because of our neat data structure and operations, we should then be able to prove that certain properties of the operations hold.
Most likely you will have heard that you should never go data fishing, meaning that you should not repeatedly test data. In the case of statistical significance tests, perhaps you will have heard that because of the nature of these tests you will find an effect at the 5% significance level in 5% of cases when there actually is no effect, and an effect at the 2% significance level in 2% of cases when there actually is no effect, and so on. It is less likely you will have heard not to continue looking for an effect after your current test concluded there was none. Here is why.
Finding models that predict or explain relationships in data is a big focus in information science. Such models often have many parameters that can be tuned, and in practice we only have limited data to tune the parameters with. If we make measurements of a function f(x) at different values of x, we might find data like in Figure (a) below. If we now fit a polynomial curve to all known data points, we might find the model that is depicted in Figure (b). This model appears to explain the data perfectly: all data points are covered. However, such a model does not give any additional insight into the relationship between x and f(x). Indeed; if we make more measurements, we find the data in Figure (c). Now the model we found in (b) appears to not fit the data well at all. In fact, the function used to generate the data is f(x) = x + \epsilon with \epsilon Gaussian noise. The linear model f'(x) = x depicted in Figure (d) is the most suitable model to explain the found data and to make predictions of future data.
The overly complex model found in (b) is said to have overfitted. A model that has been overfitted fits the known data extremely well, but it is not suited for generalization to unseen data. Because of this, it is important to have some estimate of a model’s ability to be generalized to unseen data. This is where training, testing, and development sets come in. The full set of collected data is split into these separate sets.
[bibshow file=nativismvsempiricism.bib sort=firstauthor order=asc]
Note: This is an essay I wrote for the subject Philosophy of Cognitive Science that was part of my bachelor’s course. I think it might be interesting to others, so I’ve decided to publish it here. The format is adapted slightly to be more suitable for this blog; the content is unchanged.
In the field of artificial intelligence, humans are often used as prime examples of adaptable agents with general intelligence. The goal of some artificial intelligence researchers is to arrive at an artificial general, or human-level, intelligence. These agents should be able to perform many of the same tasks with the same adaptability as humans are able to. One of the few empirical certainties in the endeavour of creating such intelligent agents is that the natural, human intelligence works. Thus, there is merit to artificial intelligence research that strives to mimic human intelligence by modelling human mechanisms.
An intriguing and far-from-settled debate concerns the origin of human knowledge, skills, abilities and thought in general. The major theories can be identified as lying somewhere between the two extremes of full-blown nativism and full-blown empiricism [bibcite key=gross2012innateness]. Nativistic theorists would argue for innate knowledge; at least some of our capabilities arise from hard-wired pathways in our nervous system that are available at birth. In contrast, empiricists would argue that these capabilities are learned from experience utilizing the brain’s various capacities for learning. For example, a baby’s suckling after birth is likely innate, whereas the behavioural pattern of brushing your teeth is likely learned. It is still unknown which combination of these extremes in this seemingly easy distinction is correct.
When striving to model human capacities in an artificial intelligence, knowing which parts of human intelligence and other capabilities are hard-wired and which parts arise from experiences should be of particular interest to artificial intelligence researchers. In the following, we will look at the innateness (or lack thereof) of language parsing and acquisition. From this, recommendations will be made regarding the high-level design of an artificial natural language processor.
An artificial neural network (ANN) is a type of machine learning model. It is made up of a number of simple parts called units, or neurons. By combining a large amount of these simple units, ANNs can solve real-world problems. For example, the main network that was used in my bachelor thesis research consisted of over 12,000 units. The name artificial neural network is slightly misleading: they’re mostly related to biological neural networks through the fact that both artificial and natural neural networks are made up of simple parts. Other than that they’re quite unrelated.
During the training of several neural networks for my bachelor’s thesis (more on that later, maybe!) I noticed something fun. The used networks’ weights (in this case classification function parameters) are initialized with numbers drawn from the standard normal distribution, meaning the initial network state is random. Such randomness by its very nature has no actual structure, and thus has high entropy. This means that compressing the information to save it to disk is less effective than on other, more structured, information. Initially, saving one such network’s weights required approximately 19.5MB of disk space.
As the networks’ training progressed, the file sizes shrunk! After a day of training, the space required for saving this network’s weights had decreased to 18.0MB; a 1.5MB decrease from the original value. I hadn’t thought about it before, but once I noticed it I soon realized what was happening. The whole act of training networks is exercised to find structure in data. A neural network does this by learning some sort of representation of the data through continuously updating its weights while training — in other words, the weights are getting more structured as the network is getting smarter! When the weights’ structure increases the entropy decreases, making compression more effective and our disks happier. Or unhappier, perhaps.