Prawns and Probability: April 2012

Saturday, April 21, 2012

Pigeon Navigation (1): Paths and Probability

How do birds navigate successfully over huge distances from temperate to tropical regions and back every year? How do homing pigeons know how to get back to their owner's loft quickly enough to win a race? Is there some way to control the number of pigeons in Trafalgar Square [or insert your country's pigeon hotspot]?

All good questions. None of which really interest me.

How can we mash up the science of pigeon navigation and a bit of probability theory and come up with something fun and faintly ridiculous? Now you're talking...

For a bit over 10 years now researchers having been attaching GPS devices to the backs of domestic homing pigeons (Columba livia to our classicist friends) before releasing them in more or less odd places. If and when these pigeons make it home, the devices can be removed and we can see exactly where the pigeon has been in the interim (typically at a resolution of a couple of metres, once every second).

This is what a pigeon looks like. Thats a GPS tracker on its back.

A few pigeon paths recorded in the Oxford area. Those red dots sure look exciting don't they? We'll be getting to them eventually...

With such data, our intrepid scientists have shown that probably use landmarks, learn routes home, seem to follow roads and often co-operate in getting home. Sadly, while these findings have revolutionised a popular field of study, been hugely cited and generally proved more than averagely seminal, they didn't include very much probability theory, so I'm going to go ahead and pretty much ignore them from here in.

But where there's data, there's chance to get some machine learning going. So let's get to it...

Paths and Probability

There are many things we might want to learn from the recorded data from the GPS devices. In my research I try to frame learning as a test of various hypotheses using data to adjudicate between them. For example, if we want to learn whether pigeons genuinely follow idiosyncratic routes (which we will) we need to know if the data is more or less likely given this hypothesis than the alternative. If we want to know if the pigeon uses landmarks, we need to find a way to say if the GPS data is more or less likely based on some hypothetical set of landmarks the bird might be using. We need to use probability theory as a link between our data and our theories.

The many recorded locations that a pigeon visits constitute elements of a path that the pigeon actually flies. As with anything probabilistic, we need to start off by finding a way to ask how likely the data (the recorded positions) are. How probable is it that the pigeon flew this path, rather than some alternative route? How can we place probabilities on observations of flight paths?

Well, lets try and get there one step at a time. First I'll just try to give you some idea of the approach we're going to take. In subsequent posts I'll flesh this out with some actual maths.

If I asked you to place a probability on where the middle of the path (say, the 50th of 100 locations) would be, how would you do it? A reasonable guess would be that on average it would be half way between the release point and the loft. But as the picture above shows, its likely to vary around that point quite a bit. Wherever you think its going to be, you can specify this as a probability distribution, a Gaussian (Normal) distribution, centred on where you think it will be and with a standard deviation that represents your uncertainty.

Now imagine I ask you to put a similar probability on the locations 1/3rd and 2/3rds of the way along the path. We could just as easily make a guess and place Gaussian distributions at both of the points to represent where we think the bird will be. Likely these will be directly 1/3rd and 2/3rds of the way between release and loft. But look at that picture above. If the pigeon starts out to the left of the straight line, its likely to stay out to the left later. So our two locations are going to be correlated, if one is left of centre, the other is likely to be too. They have a joint probability distribution.

The pictures below give some indication how this joint distribution works. We have two correlated variables. Initially we are quite uncertain about both (A). Then we measure one, reducing its uncertainty to zero (B). In addition, the uncertainty in the second variable is reduced, and the expected value moves closer to the first measured value.

(A) Two correlated, unmeasured variables

(B) Variable 1 is measured, variable two is less uncertain

Now, we can extend this to lots of different locations along the path. It is reasonable to imagine that locations will be more correlated the closer they lie along the path. Lets assume we can state a function which we call the covariance function, k(t1, t2), which states how strongly two values (t1, x1) and (t2, x2) should be correlated, and that this gets weaker as the separation of t1 and t2, dt = |t1-t2| becomes bigger, such as the functions in the figure below.

Correlations get weaker as the difference in t values increases. How fast the correlations decrease depends on the covariance function, k(dt).

Making that assumption, and looking at 10 points, all jointly distributed, we might get figures like those below

(A) 10 unmeasured variables, correlated according to separation

(B) Measure some variables, others become less uncertain in response.

Going one step further, we might take the number of points we are interested in to infinity, for a continuous path, and then measure just a few of those points

A continuous range of variables, measured in 3 places

What we're getting too, through this exercise, is the concept of a Gaussian process, which is a probability distribution over continuous paths or functions. Much like the Gaussian distribution gives a probability of seeing any number, or set of numbers, a Gaussian process (GP) gives the probability of seeing any path, or any set of points measured on that path. The standard Gaussian distribution can describe any finite number of jointly distributed variables, the GP is simply a Gaussian distribution with an infinite number of variables, representing every possible point on the path.

Gaussian: P(x) = N(x; mean, variance)

Gaussian process: P(path) = GP(path; mean path, covariance function)

The most important property of a GP is that any subset of points on the path (such as the recorded positions from the GPS device - don't confuse GPs and GPS!) follow a multivariate Gaussian distribution,

P(recorded positions) = N(recorded positions, mean positions, covariance matrix)

We'll discuss more about exactly what the covariance matrix and mean positions represent in the next post.

Great! We're on our way. If we can assign probabilities to paths in a consistent manner we can ask if observed paths are more or less likely based on different hypotheses, which allows us to use data to select between those hypotheses. In the next post I'll give a rundown of the properties of GPs and how they work.

[In a switch of textbook, for these pigeon navigation posts I'll be advising you to look at the definitive guide to GPs, Gaussian Processes for Machine Learning, by Rasmussen and Williams, and what I have to assume is the definitive work on using GPs to analyse pigeon flight paths, Prediction of Homing Pigeon Flight Paths using Gaussian Processes, by one R. P. Mann]

Sunday, April 15, 2012

Why Model Selection?: Bayes and Biased Coins

A key tool in the work in the work I do is Bayesian model selection. These days 'Bayesian' is a buzzword that is often used colloquially as a synonym for all relatively sophisticated inference methods, and frequently deployed in papers to great effect, making them seem more exciting. However, model selection, and Bayesian model selection in particular do have concrete meanings beyond the hype, and I think it is important to see, in a few simple examples, why model selection is often (and, I believe, usually) a better alternative to standard significance tests when trying to test scientific hypotheses

While I refuse to be dogmatic in my use of inference methods, I do subscribe to what is known as the 'Bayesian interpretation' of probability theory, which views probabilities as degrees of belief in things, much as the odds at a horse race represent how much the bookmaker believes a given horse is going to win. Again, I hope it is possible to show, with a simple example, that this interpretation has important and useful implications for real inference.

A standard tool in teaching probability is the biased coin, a coin which is more likely to land on one of either heads or tails. Lets look at a couple of examples of biased coin problems to illustrate why model selection can be more sensible than significance testing.

Disclaimer: Examples like this are common in textbooks, including my favourite. I make no claim to originality. Also please let me know if you find any errors in the maths, theres nothing worse than having your argument die by virtue of a missing factor 2 somewhere! (which did eventually happen!)

The Setup

Imagine I offer you two coins, telling you that one is biased towards tails, the other is fair. The biased coin gives tails on average 4 times in 5, the fair coin gives tails half the time You choose one and I ask you to try and guess if it is fair or not. I tell you that you can toss the coin a maximum of 50 times. You toss the coin the allotted 50 times and you get 31 tails, 19 heads. Do you believe the coin is biased?

The Significance Test

The classic way to answer this question is a significance test. We ask 'are the data [the coin tosses] significantly different from what we expect by chance?' To do this we construct two hypotheses:

The test hypothesis (H1): The coin is biased.

The null hypothesis (H0): The coin is fair.

We then look to see whether our data are sufficiently unlikely under the null hypothesis that we might reject it. We do not consider the test hypothesis any further.

The question we ask, mathematically, is 'how likely was I to see 31 or more tails from 50 tosses, if the coin was fair?' The answer is:

P(31 or more tails | fair coin) = ∑⁵⁰_i=31 binopdf(i, 50, 0.5) = 0.0325

Where binopdf is the binomial probability distribution. Traditionally we reject the null hypothesis if the probability is less than 0.05, so here we can claim a significant result. We would then say we believe the coin to be biased, and roughly speaking we would assume we had about a 3-4% of being wrong.

Problems

Woah there! Do you have any objections to what we just did? I hope so....

Why do we care about 31 or more tails? We didn't get more than (or less than) 31. Why are we basing our conclusions on things that never happened?
What the hell happened to H1? Surely we could/should have tested that hypothesis too. Nope, in significance tests we always choose the most 'random' possible hypothesis, label it the null and ignore the rest
The setup was specified in some detail. There were exactly 2 coins, only one was biased, and the bias was known. You were asked to choose one at random. How come none of these details appear in our analysis? Are they really unimportant?

Model Selection

Lets deal with the first two of those objections first. Why do we care about more than 31 tails? One can make arguments in terms of deciding 'what would convince me?' before doing the test, but the simplest answer is: because this test is ridiculous. Our conclusions should only ever be based on what actually happened and how likely that was, a positon known as the likelihood principle.

Ok, so what about H1? Surely we could go back, label our hypotheses the other way round and test the biased hypothesis instead? Exactly! Its a quirk of the kind of effects scientists test and how they test them that we have become used to 'rejecting the null hypothesis'. Lets see how easy it is to test H1 directly against H0.

First, we'll calculate just the probability of what actually happened, i.e. 31 tails and 19 heads. First, if the coin is fair:

P(31 tails, 19 heads | fair coin) = binopdf(31, 50, 0.5) = 0.0270

and if the coin is biased (remember, we know the bias):

P(31 tails, 19 heads | biased coin) = binopdf(31, 50, 0.8) = 0.0016

And we can now simply state that the data is 17 (yes, 17!) times more likely if the coin is fair than if it is biased.

P(31 tails, 19 heads | fair coin)/P(31 tails, 19 heads | biased coin) = 17

While the significance test suggested the coin was biased, our direct model comparison shows this to be an absurd conclusion. Of course, I have chosen these numbers to illustrate how bad the significance test is...it will sometimes give the right answer. But why jump through the perverse hoops required for the significance test when the direct comparison is both more accurate and simpler?

If we don't happen to know the exact bias of the biased coin in advance, we can still do a comparison, by integrating over the possible biases the coin could have, from 0.5 to 1, with a prior distribution on p of pr(p)=2 for 0.5 < p < 1 (thanks to commenter for correction)

P(31 tails, 19 heads | biased coin) = ∫¹_p=0.5binopdf(31, 50, p)pr(p)dp = 0.0372

P(31 tails, 19 heads | fair coin)/P(31 tails, 19 heads | biased coin) = 0.7

So now the biased coin is somewhat more likely because the bias could be nearer the 0.6 necessary to give these results (though the probability of a fair coin is still ~ 40%, not the 3-4% we thought earlier).

Bayes

So direct comparison of two hypotheses is (I hope you'll agree) better than doing a significance test, if we have a two or more clear hypotheses to test. And if we have no clear alternative to the null hypothesis, what are we even doing? Boldly rejecting the null when we have nothing to replace it with? Doesn't that sound like we're getting ahead of ourselves?

Now to deal with the third objection above. We saw in the last section that we can make use of the exact bias of the coin if it is known, and adjust if it is not. What about al the other info we have in the problem? And to introduce a further connected problem, why are we judging these models on how likely they make the data? I don't know about you, but I want to know how likely the data makes the models....they are not the same thing!

Looking at our original problem setup, a key aspect we haven't considered is how many coins you were offered. It was exactly two, one of which was biased. What if I had offered you a bucket of 1000 coins, only one of which was biased. Even if i didn't tell you the exact bias, could we really repeat the analysis above and conclude that a coin giving 31 tails was likely to be biased?

Without further ado, let me introduce Bayes rule, which allows us to go from the probability of the data, D, to the probability of hypothesis, H

P(H | D) = P(D | H) x P(H)/P(D)

the first term on the right is know as the likelihood, and is what we have previously been using to judge our hypotheseses. But, ignoring P(D), which will cancel out when comparing two hypotheses, that second term on the right, P(H), shows us that something else is going on. This is what we call the prior probability of H, i.e. how likely H was before we saw any data (tossed the coin). In our first example we could easily argue that for both H0 and H1, P(H0)=P(H1)=1/2, since there were two equally choosable coins. With 1000 coins on offer, with a random choice, we should assume that there is only a 1/1000 chance that we picked the biased coin, before we start testing it. Bayes rule says that this prior probability doesn't just disappear once we start doing tests. So if we calculate the ratio between the probability of H0 and H1 we need to include the ratio of prior probabilities

P(fair | 31 tails) / P(biased | 31 tails)
= P(31 tails | fair) x P(fair) / P(31 tails | biased) x P(biased)
= 0.0270 x 0.999 / 0.0372 x 0.001
= 729

So yes, the data is somewhat more probable from a biased coin, but the sheer unlikeliness of us having picked that coin out of the bucket to start with massively outweighs this slim evidence. If you don't believe this, I invite you to fly over to Sweden and we'll play some betting games with you picking coins out of a bucket and I'll finally have some expendable income!

If you still have doubts about the power of prior beliefs, watch the video below. Although the data your eyes receive is slightly more likely from a concave face, you see a convex face because your brain knows convex faces are hugely more likely

Monday, April 9, 2012

How Fish Shoal (5): Separating Different Effects

In this final post on the analysis in our paper on fish shoaling, I'll show how we can adapt the technique of fitting a function using a neural network, using this to isolate the various different cues that the fish respond to simultaneously.

In general we might imagine that a fish moving in a shoal is presented with an array of potential stimuli at any moment. Just for starters, it has the positions, movements and behaviours of its many neighbouring fish to consider. In addition to this there are other environmental cues, such as the positions of the walls of the fish tank in a laboratory experiment, or the possible locations of predators in the wild. We can expect that the behaviour of the focal fish at any moment will be due to a combination of all these effects (I'm intentionally avoiding the words 'sum' or 'product' for reasons that we'll soon see).

Now, in the last post we learned how we can fit the behaviour of a focal fish as a function of any input stimuli that we choose. We saw an example using just the position of the nearest neighbour to predict the acceleration response of the focal fish. There is nothing in that approach to stop us instead using far more input stimuli. We could, for example, construct a neural network to predict the acceleration from the positions of the nearest 3 neighbours.

This would, in principle, allow us to learn how the acceleration of the focal fish depends, in potentially very complex ways, on the positions of its 3 nearest neighbours (NNs - don't get confused with neural networks!). However, such an approach has a few significant drawbacks. Firstly, from a technical viewpoint, the larger we make the space of possible inputs (i.e. the more stimuli we use), the harder it becomes to train the neural network. The more inputs we have the more combinations of those inputs that are possible. It becomes less and less likely that our data will cover a large enough proportion of those possible combinations to allow us to learn the connections between inputs and outputs.

Secondly, even if we could learn a function of the 3 positions (6 variables in total, each position being an angle and distance), how are we to 'see' what we have learnt? We may be able to try new combinations of the inputs and find the predicted acceleration, but it is going to be almost impossible for us to visualise the function.

Finally, how can we relate this back to previously established theories of collective motion? If we learn some highly complex function of the positions of all the neighbouring fish, what does that tell us about the simple rules of interaction that have previously been the standard way of understanding these phenomena?

We can get around these difficulties by considering the sort of interaction rules that have been suggested before. These have almost exclusively considered additive responses - the response of the fish to 2 neighbours is simply the sum of the response to each neighbour individually. This is akin to most effects in physics - the force on a spaceship is the sum of the gravitational force from the Earth and the gravitational force from the Moon, and the gravitational force from the Sun...etc. The existence of the Moon doesn't change the force exerted on the spaceship by the Earth (at a specific moment in time).

So we can propose a model where the acceleration of the focal fish is a function, f, of the positions of 3 neighbours (p₁, p₂, p₃), this function itself being the sum (see why we avoided that word earlier) of 3 simpler functions, g₁, g₂, g₃.

Acceleration = f(p₁, p₂, p₃) = g₁(p₁) + g₂(p₂) + g₃(p₃) + residual

Now, just as we can model a function using a neural network, we can model 3 functions using 3 neural networks. The eventual acceleration is now the sum of the outputs from those 3 networks. I'm going to stop drawing each network properly and just treat them as black boxes, as I encouraged you to do in the last post.

So the task of estimating one complicated function by learning one big neural network is now changed to the task of estimating 3 hopefully simpler functions using 3 smaller networks. The question is how to learn all 3. There is a choice to be made - do you try and learn all 3 simultaneously, or do you prioritise some over others.

In our paper we took both approaches for different subsets of the stimuli. We considered the position of the 3 nearest neighbours, but also the past behaviour of the focal fish and the position of the wall, give us a schematic as below.

Or, in equation form:

Acceleration = g_past(past) + g_wall(wall) + g₁(p₁) + g₂(p₂) + g₃(p₃) + residual

We made two biologically motivated reasonings to decide how to approach learning this combination:

1. There is no reason to suspect, a priori, that the past behaviour of the fish, the position of the tank wall or the positions of the neighbouring fish are more or less important than each other

2. It is implausible to imagine that the focal fish interacts with its 2nd or 3rd nearest neighbours but not with the first nearest.

Therefore we will learn the first three networks (past, wall, NN1) simultaneously, since each of these should have the chance to be the primary factor in predicting acceleration. Networks 4 and 5 (NN2 and NN3) will be learned subsequently, using whatever part of the fishes' acceleration that has not been accurately predicted by the first 3 networks. This means, for instance, that the interaction between a fish and its second nearest neighbour will only be allowed to account for what cannot be predicted by its interaction with its second nearest neighbour.

The process of learning multiple networks, either simultaneously or in succession is relatively similar. We iteratively learn each network, while assuming that the others are already known. Lets look at how we learn the networks associated with the past, the wall and the nearest neighbour:

1. Start each network in a random configuration, just like when we learn a single network.

2. Now, we assume that the networks associated with the past and the wall are known and correct. We pass our measured values for these stimuli into the networks and record the output, getting a predicted acceleration_past+wall (recall the USE function on the black box from the last post)

3. We now learn the nearest neighbour network using our measured values of the nearest neighbour position as inputs, and the difference between the actual acceleration and the predicted acceleration as outputs

4. Once the NN1 network is learnt, we fix its state and then apply the same technique to learn the past network. Fixing the wall and NN1 networks in their current states we predict what the acceleration should be from the wall and nearest neighbour alone

5. Then, like in step 3, we now learn the past network, using the difference between the predicted and observed accelerations as the output

6. Having learnt the past network, we fix its new state, and we predict the acceleration from the past and the nearest neighbour networks

7. And now we learn the wall network, using the measured values of the wall position and the difference between the predicted acceleration and the observed acceleration

8. We can now test how good our 3 networks combined are at predicting the real acceleration by feeding the measured stimuli through all 3 in their current states.

9. Now we have learnt each of the first 3 networks once. However, we are not finished. From this point we go back to stage 2 and go through stages 2 to 8 again, repeating the whole process until the error between predicted acceleration and real acceleration stops improving.

So by this process we have learnt, simultaneously, the functions associated with the past, the wall and the nearest neighbour. The process of learning each network in this iterative fashion divides the observed accelerations into components associated with the different stimuli in a way that creates the best match between the measured values, assuming that the additive assumption is correct. It is of course possible that there are more complicated interactions that depend on, for example, the positon of the wall and the nearest neighbour in some complex non-additive fashion. We have found the best additive function that estimates the true process.

Now, having found these three networks, we are in position to successively learn the interactions with the second and third neighbours. Remember that we do these after the first 3 since it is hard to imagine a fish consistently ignoring its nearest neighbour while attending to its second nearest neighbour. As such we only learn each of these once, rather than iterating.

10. Group all 3 of the networks we learned before and predict the acceleration

11. Now learn the second nearest neighbour network (NN2) based on the difference between this prediction and the observed acceleration

12. Now group all 4 of the networks we have learnt so far and predict the acceleration

13. Learn the 3rd nearest neighbour network based on the difference between these predictions and the observed accelerations

14. Phew! Now we have finally learnt every network. We can make predictions based on all 5 networks to test how well the combination predicts the real accelerations

15. But nicely, since each network has only 1 or 2 inputs, we can also input the range of possible values for each input into its respective network and plot the output, enabling us to visualise each component and solving our earlier problem with visualising a high dimensional function. When we do so, we get something that looks a lot like the figure below. The results from the past network are not shown, only those from the wall and the 3 neighbours (from top to bottom respectively). On the left hand side we have the predicted accelerations from each of these networks. On the right hand side we also have the predicted turning angle, using the same process but with measured turning angles instead of accelerations. Recall that each plot is a semicircle because the funtions are assumed to be symmetric (acceleration) or anti-symmetric (turning angle)

As we saw in the last post, the interaction with the nearest neighbour replicates many of the features we expect, including distance dependent repulsion and attraction between the two fish. The other interesting result here is how little structure there is for the functions associated with the second and third nearest neighbours. In our paper we interpret this as evidence that the fish primarily interact with their first neighbour only. Such a strong biological interpretation comes with necessary mathematical caveats. It is important to be clear that we have only learnt the best function for mapping stimuli to behaviour out of those which fit the additive structure we proposed. As shown in a another paper published alongside ours, interactions may not always be additive. Also, technically what we have shown is that the positions of the second and third neighbours do not help us to predict the behaviour once we known the position of the first nearest neighbour. While this suggests that no interaction takes place these are subtly different statements.

Despite those final caveats, learning combined functions in this manner is a powerful tool for separating effects in the data, and for learning patterns that may be obscured by other unknown factors (such as the response to the walls of the tank here). I hope you find this useful when considering your own data!

If you want to know about this sort of technique in more detail I can suggest starting on the Wikipedia page for Expectation-Maximisation, or let me plug again (unpaid and unrequested!) my suggested text: David Mackay's textbook, Information Theory, Inference and Learning Algorithms (free online).

Monday, April 2, 2012

How Fish Shoal (4): Using a Neural Network to Learn From Data

To recap, in the first post on this topic, I started by asking how we can use recorded data of fish movements in groups to learn how they interact. I stated that we can see this as inferring a function between the environment and the fish's behaviour, and in the subsequent posts we looked at how we might estimate functions using regression, arriving at the idea of a neural network as a highly flexible tool for performing non-linear regression. In this post we'll see how we can practically use neural networks (as one possible tool among many alternatives) to learn from data, and how this is actually coded in Matlab to show how few of the details we need to concern ourselves with to start doing useful inference.

I'll be referring to code that utilises the Netlab toolbox in Matlab, which you can download for free, and which you can install simply by unzipping the downloaded file and adding the directory to your Matlab path. The code I will use is specific to Netlab, but the basic method applies to using any similar toolbox.

Inference always begins by deciding which outputs (behaviours) we want to predict from which inputs (stimuli, environment). The recorded positions of the various fish over time taken from video tracking are only useful once we make this assignment. In the case of our research we looked at the relative positions and directions of each fish's neighbours in the group as the inputs, and the fish's responses of accelerations (or deceleration) and turning angle, as shown in the figure below taken from our paper.

The relative position and direction of a neighbour (yellow) from the focal fish (red)

So first we take all the recorded positions of the fish, and for every fish at every time step we calculate the following quantities:

1. The angle (theta) and distance (r) to the nearest neighbour, the second nearest neighbour, third nearest etc.

2. The direction (phi) of each neighbour relative to the focal fish

3. How much the fish accelerated (a) and turned (alpha) on the next time step

We also measure quantities associated with where the wall of the tank is relative to the fish, but I'll ignore these for now. 1 and 2 here are our inputs, the stimuli. 3 is the behaviours - what the fish did next in response to the stimuli.

So, assuming that we've tracked our fish and measured the above, lets get inferring....

Let's try seeing how the acceleration of the focal fish is related to the position of the nearest neighbour. Once you've got Netlab installed, you can build a neural network in just one line

my_nn = mlp(2, 10, 1, 'linear');

my_nn: is your neural network (remember from the last post, it can also be called a Multi-Layer Peceptron - mlp)

2: is the number of inputs we want to use. We will be using the angle and distance to the neighbour.

10: is the number of 'hidden nodes' - thats the number of nodes in the middle layer of the diagram we saw in the last post. We can change this number - more nodes make the network more flexible but harder to learn well. I find 20 tends to work ok, but always experiment! Each node will be a sigmoidal function of the inputs, but we're not going to worry about these details here.

'linear': means that the output will be a weighted sum of all the hidden nodes. The only real reason to change this is if the outputs are binary rather than continuous.

Now you have a neural network! But at the moment is doesn't do very much. It's been configured in a random state. You can try putting some numbers in and seeing what come out using mlpfwd

y = mlpfwd(my_nn, [x1, x2]);

where x1and x2 are any possible values of the angle and distance to the nearest fish you want to try, and y is the predicted acceleration. At the moment those predictions will be meaningless, as the network hasn't learnt anything.

Now comes the useful bit. Assume we have three vectors containing the data, theta is a vector of angles to the nearest fish, r is a vector of the distances to the nearest fish and a is a vector of how much the fish accelerated. Make sure these are column vectors. Then we can train network using just a few more lines of code.

options = zeros(1, 18); options(1)=1; options(14) = 100;
my_nn = netopt(my_nn, options, [theta, r], a, 'scg');

netopt is a function that trains the network, based on the data its given. It tries to find the values for all the parameters (like the 'slopes' in the last post) which will produce the best match between what actually comes out of the network when we put the inputs (position of the nearest neighbour) in, and the behaviours we tell it should come out (i.e. the measured accelerations). options is, as the name suggests a number of possible options. Here we only use 2. The first tells Matlab to show the error values as the algorithm learns, the 14th tells netopt to run 100 iterations of the learning algorithm. The learning algorithm is something called 'scaled conjugate gradients', which is the 'scg' at the end.

Now we can input any values of theta and r to the network and it should output a value of the expected acceleration that fits with the data it has already seen. That is about 90% of everything you need to know to start doing inference with a neural network today. All the diagrams and equations in the last post are nice to have in the back of your head while doing this, but essentially you can treat the neural network as a black box. You put data in, in the form of known inputs and outputs. You press a button to make the network 'learn', and then the box will tell you what output you should expect for any input you offer it.

First we show the network some known examples

..then we ask it to predict the output for other inputs

This is in fact the basis of pretty much all of machine-learning. Take a number of known examples of something, such as images of handwritten letters. Plug them into a learning algorithm (of which a neural network is but one among many) to train it. Then use the same algorithm to predict what some unknown examples are.

Now all that remains is to try inputting all the possible values of theta and r that we might be interested in. In our paper we made the further simplification that the function would be symmetric around the axis of the fish - i.e. if the fish will accelerate when a neighbour is ahead on the left, it will also do so if the neighbour is ahead on the right. So we test values of r between 0 and some maximum (like 40cm), and angles between 0 and pi (everywhere on the left of the fish). In Matlab we can make vectors of these test inputs like this:

r_test = linspace(0, 40, 100);
theta_test = linspace(0, pi, 100); %this gives us 100 values of each input

[r_grid, theta_grid] = ndgrid(r_test, theta_test);
test_input = [r_grid(:), theta_grid(:)];
%this matches every value of r to every value of theta so we can test all pairs

test_acc = mlpfwd(my_nn, test_input); %this puts our test inputs through the network we learned

test_acc = reshape(test_acc, size(r_grid));
%and this makes the output accelerations into a matrix so we can visualise it

X = cos(theta_test)*r_test';

Y = sin(theta_test)*r_test';

pcolor(X, Y, test_acc);

%This visualises the output on a nice semi-circle

And so finally we get a plot showing what the network thinks the fish will do for any given position of the nearest neighbouring fish

That B is because this comes from a multipanel image, as we'll see soon

So we confirm some previously held beliefs about how interactions like this work. The focal fish accelerates to catch up with a neighbour in front. It slows down to rejoin a neighbour behind. And if a neighbour is too close (near the centre), this is reversed to move the focal fish to a more comfortable distance.

So in a few lines of code by us, and a lot of preprogrammed code by the makers of Netlab, we have done some quite sophisticated inference with a minimum of real maths. Of course, there are some complications in now getting from the 90% you already know to the 100% you need to get publication ready. You'll need to concern yourself things like multiple local minima of the squared error, cross-validation and such other things. But these are things to worry about once you've got your hands a little dirty and started actually doing some inference...none of them mean you can't start applying these techniques to your data TODAY!

In the next and probably last post on this topic I'll show how we go from learning this relatively simple function with just 2 inputs, to a more complex function accounting for the positions of many neighbours, and we'll investigate the perils of correlation and confounding.

[Again, if you want to read more about the details of any of these techniques, I recommend David Mackay's textbook, Information Theory, Inference and Learning Algorithms (free online). Netlab also contains a large number of demo scripts, of which demomlp1.m demo is similar to this post]