Machine Learning: The Question of 'Why?'
Author: Anita Faul, Teaching Associate Fellow, University of Cambridge & The Alan Turing Institute
"I would like to take the opportunity to be a guest blogger to pay tribute to Professor Mike Powell, my PhD supervisor."
Often, when asked what I am doing, I answer: “Well, I started out in Numerical Analysis and then moved into, what is now commonly referred to as Machine Learning or AI, but it is all algorithms. We just interpret the numbers a bit differently now. It all comes down to number crunching.”
Mike was there from the beginning of number crunching and had an impact on the field in many ways. He was a practitioner who also gave algorithms a theoretical underpinning.
He completed his BA in Mathematics followed by the Diploma in Computer Science at Cambridge in 1959. He did not pursue a PhD, but worked for the Atomic Energy Research Establishment Harwell where he was tasked with writing computer programs to assist other researchers and technical staff. There were calculations in atomic physics and chemistry. It was in optimization where his work had the greatest impact.
Many machine learning algorithms involve the minimization of a cost function. The choice of cost function is one aspect. The aim is to choose it and possible penalty terms such that a solution with the desired properties is teased out. Regularization is often introduced to reduce complexity in the solution. The law of parsimony, also known as Occam’s razor, should be a guiding principle, keeping models simple while explaining the data. However, also other attributes of the solution can be emphasized. Smoothness is an example. In one application, a smooth gradual transition between different regions is desirable, while in another sharp edges are needed. Either can be encouraged with a suitable penalty term.
The other aspect is the fast optimization of the cost function. Mike had an exceptional ability to understand the inner workings of algorithms. This understanding of why a technique was successful or not led to the development of many new and effective numerical methods. It also enabled him to come up with intriguing counter-examples to proposed methods, sometimes to the disappointment of others. Investigating the circumstances under which an algorithm would fail to converge or converge very slowly, was the first task he set me as PhD student. The next was to understand why and use the insight to prove the convergence of the algorithm.
This understanding of techniques did not come out of nowhere. Considering Mike’s vast experience, I still encountered him playing through an algorithm with pencil and paper choosing a minimal example. It is something I still do myself and encourage. This way an intuitive feel is developed. When using punch cards, one needs to have a good justification for a modification to an algorithm. The readiness and availability of compute power nowadays makes it easy to play around and try out different approaches. This is to the detriment of the understanding. Even when an algorithm is improved, often it is not known why.
Machine Learning often concentrates on the “How?”, but “Why?” is the much more important question. The lack of understanding, why an algorithm arrives at a particular solution, makes it, for example, difficult to remove bias from an algorithm. Machines were used to help decision-making processes, because it was thought a machine cannot be biased in the same way a human is. However, sometimes machines did not remove the bias. In some cases, they even played it up. Often bias in the training data is blamed, but algorithms have been shown to be even biased when features such as gender or ethnicity are removed from the input. The algorithms still learnt correlations, showing bias when the output was analysed.
More and especially unbiased data will help. At the same time, it should be investigated, why a data sample took, for example, that path instead of another path through a neural network. Yet, the complexity of neural networks with many layers and neurons is such that a human cannot follow it. Neural networks are so powerful, because given enough layers and neurons, or in other words enough rope to hang itself, anything can be modelled.
In classification, a neural network with the step function from ZERO to ONE as the activation function is specifying many different planes in the space the data lies. The output of the step function of one neuron in the first layer can be interpreted as a Boolean on which side of the plane (being specified by the synapses going into that neuron) the data sample lies. ZERO (or FALSE) for one side, ONE (or TRUE) for the other. The subsequent layers can be interpreted as AND and OR operations. On a piece of paper any region or even set of regions can be described, by a set of lines and on which side of a line a point of the region lies. This made neural networks for classification so powerful, but it was also their downfall. As discovered, input data could be ever so slightly changed, and the neural network would give a different result, while a human would still classify the data correctly as before. In the neural network, the data point merely slipped to the other side of a plane.
Replacing the hard step function with the logistic sigmoid possibly alleviates the problem. Instead of saying, we are sure of being on one side or the other, a probability is given on which side the data sample lies. If the logistic sigmoid returns one half, we are on neither side. The slope of the logistic sigmoid needs to be determined. A steep slope means the certainty increases quickly on both sides. A gentle slope signifies we remain unsure for quite a while when travelling away from the plane. Also, should the uncertainty decrease in the same way everywhere?
While it is tempting to play with the choice of activation function and its parameters, perhaps effort is better spent on the question, whether there should be a plane at all. Less is more. I recommend investigating with how few neurons and layers a neural network can still perform its task. The answer might be surprising.
In regression, it helps to think about the process which generated the data. If it is just a linear combination of the input, then linear activation functions suffice. If it also involves the products and powers of the input a polynomial activation function is necessary. If the process involves transcendental functions, such as a trigonometric function or exponential, these can be activation functions. If functions of functions are required to describe the process, extra layers in the network accomplish this. Such considerations address the question of why in the design of algorithms.