Friday, June 16, 2017

Rethinking Retractions: Rethought

Four years ago I published what turned out to be one of my most popular blogposts: 'Rethinking Retractions'. In that post I related the story of how I managed to mess up the analysis in one of my papers, leading to a horrifying realisation when I gave my code to a colleague: a bug in my code had invalidated all our results. I had to retract the paper, before spending another year reanalysing the data correctly, and finally republishing our results in a new paper.

Since I wrote that blogpost I have found there are a lot of people out there who want to talk about retractions, the integrity of the scientific literature and the incentives researchers face around issues to do with scientific honesty.

Here a few of the things that have resulted from that original blogpost:

Speaking at the World Conference on Research Integrity


Looking back now at the original blogpost, I can see the situation with some more distance and detachment. The most important thing I have to report, five years after the original cock-up and retraction, is that I never suffered any stigma from having to retract a paper. Sometimes scientists talk about retractions as if they are the end of the world. Of course, if you are forced to retract half of your life's work because you have been found to have been acting fraudulently then you may have to kiss your career goodbye. But the good news is that most scientists seem smart enough to tell the difference between an honest error and fraud! There are several proposals going around now to change the terminology around corrections and retractions of honest errors to avoid stigma, but I think the most important thing to say is that, by and large, the system works - if you have made an honest mistake you should go ahead and correct the literature, and trust your colleagues to see that you did the right thing.

Meanwhile, I'm just hoping I still have something to offer the scientific community beyond being 'the retraction guy'...

Analogues between student learning and machine learning

Back in 2014 I was trying to make some progress towards my docent (Swedish habilitation) by fulfilling the requirement to undertake formal pedagogic training. As it happens, I left Sweden before either could be completed, but I recently went back through my materials, and found this essay I had written as part of that course. In the absence of anything else to to do with it, here it now lies...

Introduction

Over time people have developed increasingly sophisticated theories of learning and education, and correspondingly teaching methods have changed and adapted. As a result, much is now known about what activities most promote student learning, and the differences between individuals in their learning techniques and strategies.

At the same time, computer scientists have developed increasingly powerful artificial intelligences. The creation of powerful computational methods for learning patterns, making predictions and understanding signals has drawn attention to a more mathematical understanding of how learning happens and can be facilitated.

Some of the parallels between these fields are obvious. For example, the development of artificial neural networks was driven by the analogy between these mathematical structures and the neuronal structure of the brain, and encouraged scientists to describe the brain from a computational perspective (e.g. in [Kovács, 1995]). However, the analogies between theories of learning in education and computer science are deeper than these surface resemblances, and go to the heart of what we consider useful information and knowledge, and what we mean by understanding.

In this report I will review elements of both the pedagogical and machine learning literature to draw attention to specific examples of what I consider to be direct analogues in these two fields, and how these analogies help organise our knowledge of the learning process and motivate approaches to student learning.

Learning to learn

When computer scientists first began creating an artificial intelligence, their first approach was to try to encode useful knowledge about the world directly in the machine, by explicitly inclusion in the computer’s programming. For example, in attempting to create a computer vision system that could recognise handwriting letters, the programmer would try to describe in computer code what an ‘A’ or a ‘B’ looked liked in terms that the computer could recognise in the images it received. However, this procedure generally proved dramatically ineffective. The sheer range of ways in which an ‘A’ can be written, the possible permutations on the basic design and the different angles and lighting that the computer could receive defeated the attempt to systematically describe the pattern in this top-down fashion.

Instead, success was first achieved in these tasks when researchers tried the radically different approach not of teaching the computer each concept individually, but instead teaching the computer how to learn itself. In 1959 Arthur Samuel defined machine learning as a ‘Field of study that gives computers the ability to learn without being explicitly programmed’ [Simon, 2013]. By providing the computer with algorithms that allowed it to observed examples of different letters, and learn to distinguish these itself from the examples, much greater success was possible in identifying the letters. In essence, by teaching the computer good methods for learning, the computer could gain much greater understanding itself, and with less input from the programmer.

The parallel here with the teacher-student relationship is very direct. A teacher is responsible, of course, for providing a great deal of information to a student. But the best teachers are more successful because they teach the students how to learn for the themselves, how to fit new examples into their existing understanding and how to seek the new information and examples they need. At the higher levels of tuition, encouraging and enabling this self-directed learning is essential. Anne Davis Toppins argues that within 30 minutes ‘I can convince most graduate students that they are self-directed learners’ [Toppins, 1987]. However, much as programmers initially tried to directly tell computers what they needed to know, before realising the greater efficiency of teaching them to learn for the themselves, so has the pedagogical approach taken a similar path [Gustafsson et al., 2011]:

'For some lecturers, thinking in terms of emphasising with and supporting the students’ learning and “teaching them to learn”, i.e. supporting them in their development of study skills, can constitute a new or different perspective. [...] Some teachers claim that since the students have studied for such a long time in other school situations, the higher education institution should not have to devote time to the learning procedure.'

In other words, there have been, and indeed still are many lecturers who view their role primarily in terms of transmitting information, rather than in developing the students’ abilities to think and learn for themselves.

Conceptual understanding

In the modern teaching literature, much importance is placed on aiming for, and testing students conceptual knowledge. That is, students are expected to learn not simply a series of factual statements, or isolated results, but instead to incorporate their knowledge into higher level abstract concepts that they can use to understand unfamiliar situations, solve unseen problems and extrapolate their knowledge to new domains. The prevailing doctrine of constructive alignment [Biggs, 1999] that forms the basis for recommended teaching approaches in European countries under the Bologna process is designed to make sure that teaching methods, student activities and assessment assignments all align towards this goal of promoting and testing whether students understand the ‘big picture’.

According to a computer scientists view of knowledge and information, there is a very good reason why we should aim to promote such a concept-centred approach for students. Identifying unifying principles that tie knowledge together and understanding how apparently different fields may link together reduces the amount and the complexity of the information that a student or computer must store, access and process, and maximises the effectiveness of extrapolating to new domains.

Consider as a simple example the data shown in figure 1. How can this data be effectively stored? The simplest method would be the record each pair of (x, y) co-ordinates. Assuming we use a 1 byte per number (single-precision floating point accuracy), this will take us 20 bytes (10 x’s, 10 y’s). But visually we can immediately recognise an important pattern; the data clearly lie along a straight line. If we know the gradient of this line we can immediate translate any value of into a value of y. Therefore we can reproduce the whole data set by specifying just 12 numbers – the 10 values of x, one value for the intercept and one value of the gradient. Therefore by understanding one big idea, one concept about the data, that they lie along a line, we have almost halved the effort of learning and storing that information. Furthermore, we can now extrapolate to any new slue of x, immediately knowing the correct corresponding value of y. If we had simply memorised the 10 pairs of co-ordinates we would have no way to do this. In the field on machine-learning this line of reasoning has been formalised into the principles of Minimum Message Length or Minimum Description Length, first proposed by Chris Wallace [Wallace and Boulton, 1968] and Jorma Rissanen [Rissanen, 1978] respectively. This states that the best model, or description of data set is the one which requires the least information to store. Modern texts on machine-learning theory focus heavily on the superiority of the simplest possible models that enable reconstruction of the necessary information and stress the connection to the well established principle of Occam’s Razor (e.g. [MacKay, 2003]). Applications of machine learning theory to animal behaviour have further suggested that animals apply the same principles to maximise the value of their limited processing and storage capabilities [Mann et al., 2011], so it is likely that humans also apply similar methods
























Figure 1: By observing conceptual patterns in the data we can reduce the amount of memory needed to store it, whether on a machine or in a human mind. In this simple example identifying the linear relation between the X and Y co-ordinates (Y = 2X), we need to store only the X values, the intercept and the gradient, reducing the number of stored numbers from 20 to 12.

An analogous example in student learning might be seen in teaching mathematics students to solve equations. The most naive way for students to learn how to solve a particular type of problem in an exam would be to observe many, many examples of the problem, remember the solution to each one and then attempt to identify a match in the exam and recall the solution for the matching equation. Such an approach, while not entirely unknown among students cramming for final exams, is likely doomed to failure. It requires an enormous amount of (trustworthy!) memory to store even a fraction of the possible problems one might see in the exam, and if a new problem is encountered there is no way to generalise from the known solutions to other equations in order to solve it. A much more efficient method is to learn general techniques that can be applied to any possible equation. In this case the student need only remember a few core principles and how to apply them. They can then solve both equations they have seen before, or new examples

Strategic learning

A common characteristic of high-achieving students is a strategic approach to learning. They have a good overview of what they need to learn to achieve their life goals. They set realistic but challenging learning goals for themselves to the end of learning this material. And they actively seek out information from teachers, reading materials and other sources to aid their learning. Whether their goals are intrinsic (interest in the subject, desire for knowledge) or extrinsic (obtaining a degree, getting a job), this strategic approach to learning systematically produces better outcomes than passively receiving whatever information is offered.

Analogously, in the field of machine learning, recent developments have tended more and more towards ideas termed ‘active learning’ [Settles, 2010]. The previous paradigm of simply offering many examples to the computer to learn from and then assessing or using the results of that process has been overturned. Instead, the programmer/mathematician devises a strategy for the computer to seek out new examples, based on what it wants to achieve (e.g. identifying written letters successfully) and what it currently knows. For example, if the computer has a good idea how to recognise an ‘A’, but frequently confuses a ‘U’ and a ‘V’, it will seek out or request more examples of these letters so that it can improve its knowledge. This way it does not waste time learning redundant material, but maximises the result of its effort by focusing on the most rewarding areas.

Likewise a high-performing student will focus their attentions on areas where they are weak and/or particularly crucial concepts that provide a pivot for understanding. They will ask their teachers for more feedback on their efforts in these areas, spend more time on mastering them and prioritise them ahead of areas of less importance or that are already understood. Mckeachie’s Teaching Tips [McKeachie and Svinicki, 2013] devotes a chapter to the importance encouraging strategic and self-regulated learning. One of their descriptions of a strategic learner states:

‘Strategic learners know when they understand new information and, perhaps more important, when they do not. When they encounter problems studying or learning, they use help-seeking strategies’.

This emphasis on the importance of know where understanding is lacking and the resultant help- seeking strategy perfectly aligns with what information theory tells us is the optimal way to gain useful knowledge.

Mckeachie’s Teaching Tips [McKeachie and Svinicki, 2013] also focuses on the importance of student learning goals. My own research in the field of active-learning corroborate this view, demonstrating that even when a learner has a good learning strategy, the success of that strategy depends intimately on the goals that the learner sets themselves. Indeed, without a suitable goal the learner is unable to define a useful strategy [Garnett et al., 2012]. Thus, in order to develop students strategic learning skills, it is essential first to help them define, and identify what their individual goals are. A student for whom this is an essential course, but who is otherwise uninterested, may be best helped by helping them to clarify what they wish to achieve (a certain final grade for instance), and then working with them to establish what strategy will most likely allow them to reach that outcome. A student with greater intrinsic motivation for the course may need help setting specific staged learning goals that enable a learning strategy. The teacher’s experience in understanding the most effective path through the material would therefore be essential in establishing effective goals that the student can then apply a strategy to achieve.

Discussion

While student and machine learning are clearly not direct parallels of each other (could one imagine a machine equivalent for tiredness, or skipping class to watch TV?), the analogies that do exist be- tween the two help us to understand why certain approaches to student learning are more successful than others, via the large body of technical knowledge that exists regarding how machines can be taught. In this report I have analysed a selection of those analogies, aiming to draw conclusions about how students should be taught.

In particular, a common theme of modern pedagogical approaches is to move from information transfer to a student directed learning approach. In a sense, computer scientists have been down this path already, switching from a programmer-led to a computer-led learning approach that has resulted in far superior learning outcomes. This should motivate and support the equivalent transition in student learning

In teaching computers how to think and learn, we have also needed to help them establish goals and strategies for learning, and this is now the forefront of machine learning research. The dramatic improvement in computer learning outcomes when well-developed strategies are employed should remind us that it is the manner in which the student approaches new information and requests help and feedback that matter at least as much as the amount of information they are presented with. Such knowledge demands that we devote time to monitoring and developing students learning strategies and discussing what they hope to achieve via our courses.

Students, like all of us, are presented with a great deal more information than they can easily process and digest. If computer science in the 21st century has taught us anything, it is the importance of identifying general patterns in the vast body of information we are now exposed to via the media, the Internet and other sources. Without relatively simple general principles, information can easily become overwhelming. That the same principle applies in student learning should not surprise us. How is a student to retain all the information we attempt to transfer to them without organising it into general principles rather than a huge array of specific cases? The content of any course therefore should revolve as much around this organisational structure as the raw information itself, demanding generalised understanding rather than specific regurgitation. Thankfully this is the direction modern pedagogy is taking, with such concepts of constructive alignment and the SOLO taxonomy.


References
[Biggs, 1999] Biggs, J. (1999). What the student does: teaching for enhanced learning. Higher Education Research & Development, 18(1):57–75.
[Garnett et al., 2012] Garnett, R., Krishnamurthy, Y., Xiong, X., Schneider, J., and Mann, R. (2012). Bayesian optimal active search and surveying. In Proceedings of the International Con- ference of Machine Learning.
[Gustafsson et al., 2011] Gustafsson, C., Fransson, G., Morberg, ̊A., and Nordqvist, I. (2011). Teaching and learning in higher education: challenges and possibilities.
[Kovács, 1995] Kovács, I. (1995). Maturational windows and adult cortical plasticity, volume 24. Westview Press.
[MacKay, 2003] MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms. Cambridge: Cambridge University Press.
[Mann et al., 2011] Mann, R., Freeman, R., Osborne, M., Garnett, R., Armstrong, C., Meade, J., Biro, D., Guilford, T., and Roberts, S. (2011). Objectively identifying landmark use and predicting flight trajectories of the homing pigeon using gaussian processes. Journal of The Royal Society Interface, 8(55):210–219.
[McKeachie and Svinicki, 2013] McKeachie, W. and Svinicki, M. (2013). McKeachie’s teaching tips. Cengage Learning.
[Rissanen, 1978] Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5):465–471.
[Settles, 2010] Settles, B. (2010). Active learning literature survey. University of Wisconsin, Madison, 52:55–66.
[Simon, 2013] Simon, P. (2013). Too Big to Ignore: The Business Case for Big Data. John Wiley & Sons.
[Toppins, 1987] Toppins, A. D. (1987). Teaching students to teach themselves. College Teaching, 35(3):95–99.
[Wallace and Boulton, 1968] Wallace, C. S. and Boulton, D. M. (1968). An information measure for classification. The Computer Journal, 11(2):185–194.

Wednesday, June 7, 2017

General election 2017: Opinion Polls vs Betting Markets

Update, June 9: The results are in, and the BBC gives the vote share for each party. Although the polls gave a wide variety of different predictions between different polling companies, the average of the polls appears to have outperformed the betting markets again!



See my recent article in The Conversation for some reasons why betting markets may have been performing so badly in predicting elections and referenda in recent years, or read the original research here

------

Tomorrow is the polling day in the UK General Election 2017 (make sure you vote!). Today's news will be full of the latest opinion poll numbers, and pundits making predictions. Increasingly people are also looking to betting and prediction markets to get an idea of what is likely to happen as well. Both opinion polls and betting markets have made some very significant errors in recent years. Before the Brexit referendum I did an analysis of what bets on Betfair were telling us about the predicted vote share for Leave/Remain. Punters got that one wrong, just like the election of Trump in the USA, while polls were more accurate in predicting tight races.

Before we go to the polls tomorrow, lets compare what opinion polls and betting markets are telling us, so we can evaluate which is more accurate on this occasion. I'll focus simply on raw vote share for the two main parties (ignoring constituency effects), and I'll use the Financial Times poll-of-polls as a benchmark for the opinion polls and Betfair's vote share markets for betting markets.

First the opinion polls: https://ig.ft.com/elections/uk/2017/polls/



This gives a central forecast of Conservatives on 43%, Labour on 37%

To calculate the predicted vote share from Betfair I'll be repeating the analysis I did here (see previous post for R code), fitting a beta-distribution to the vote share divisions given on the market. I've taken screenshots of the Conservative and Labour markets, as these will no doubt change after I post this:

Conservative:



Labour:



Performing the analysis to get the predicted vote share gives the following results:


This puts the Conservatives on 44% and Labour on 34% - almost identical for the Conservatives as the opinion poll, but somewhat lower for Labour.

Labour have recently surged in the polls from a very low position. It seems that the betting markets don't fully trust this. Come tomorrow night we'll have a good idea which of the polls or the market has been more accurate.



Tuesday, May 2, 2017

A simple reward system could make crowds a whole lot wiser

Richard Mann, University of Leeds

There’s a problem with the wisdom of crowds. The Conversation

Market economies and democracies rely on the idea that whole populations know more about what is best for them than a small elite group. This knowledge is potentially so powerful it can even predict the future through stock markets, betting exchanges and special investment vehicles called prediction markets.

These markets allow people to trade “shares” in possible future outcomes, such as the winner of upcoming elections. Anyone with new information about the future has a financial incentive to spread it by buying these shares. Prediction markets now routinely inform bookmakers odds and are quoted in news coverage of elections alongside more traditional opinion polls.

But prediction markets are having a crisis of confidence in the abilities of the crowd. They have been systematically wrong about a series of high profile political decisions, including the UK general election of 2015, the Brexit referendum and the US presidential election of 2016.

We shouldn’t expect perfect accuracy on every occasion, just as we know opinion polls are often flawed. But to be wrong so consistently about such prominent events points to possible flaws in the assumptions we make about crowd intelligence. For example, people don’t always act on the information they have and so it might never become part of the crowd’s decision. The dynamics of crowds and markets might also stop people from paying attention to some sources of information at all.

However, there might be a way forward. My colleagues and I have come up with a model that overcomes this problem by giving people a incentive to seek out new sources of information, and an extra reason to share it.

An important question for markets is “where do individuals get their information?” Research shows that our opinions and activities very often match those of our peers. We also tend to look for information in the most obvious places, in line with everyone else.

To give an example, if you look around on any public transport in the City of London you’ll probably see people holding copies of the Financial Times. This is a problem because if everyone has the same information, the crowd is no smarter than a single individual. Studies show that having a diverse collection of opinions, especially including minority views, is crucial for creating a smart group.

Thinking the same. Shutterstock

So why do we tend to narrow the sources of our opinions? One reason is because we have an innate desire to imitate our peers, to behave in ways that are safe and acceptable within our community. But it may also be because of a rational, profit-seeking motivation.

We studied how theoretical profit-motivated people behave when faced with the types of rewards seen in market-like situations. To do this, we created a computer simulation of a prediction market, where people received a reward for making correct predictions. Rewards were larger when fewer people guessed the right answer, just like in a prediction market or a betting exchange.

The reward an individual received was a fixed amount divided by the number of other people who made a correct prediction. This was supposed to give people an incentive to look for right answers that other people wouldn’t find. But we found that people still gravitated towards a very small subset of the available information – just like London bankers with their copies of the Financial Times.

The more complex the situation was, the smaller the percentage of available information people actually used. The problem was that the more niche, unused information, though it might be useful to the group, was so rarely useful to the individual that possessed it that there was no incentive for them to seek it out.

New reward system

To counter this, we created a theoretical new prediction market system, where people would only be rewarded if they expressed accurate views but were also in the minority. For example, if someone predicted that Donald Trump would win the US election, against the consensus view, they would have received a reward once the result was known. Conversely, if most people accurately predict the Conservative Party will win the upcoming UK election then they wouldn’t receive any reward.

We found that this “minority reward” system, which explicitly favours those who go against popular opinion if they turn out to be correct, produced much more accurate collective decisions. This was especially the case when the situations were complex, influenced by many factors.

Intuitively, this makes sense. If your opinion supports the existing popular view, you can’t change whether the group will be correct or not. In our model, people have an incentive to go hunting for more esoteric sources of information about possible future outcomes. For example, rather than reading the Financial Times, they might follow obscure blogs, or read local newspapers looking for information on companies in the area.

They know that only by finding information that very few have access to will they have a chance to correctly go against the prevailing wisdom. This encourages the whole group to bring together a much wider set of information, leading to more accurate collective decisions.

Our results are so far confined to a theoretical model, but they give us an insight into why current forms of prediction markets may be prone to failure, and how we might try to improve them in future. We hope that these insights will be used to create more accurate prediction markets, as we could all benefit from better collective foresight.

Better predictions and collective decision making could help society decide which political ideas will or won’t work. Improving the ability of stock markets to predict which companies and ideas will do well could improve the return on investment and generate greater economic growth. Even academia is a large-scale exercise in collective wisdom. If changing the way that researchers are rewarded can improve the wisdom of this crowd, it could lead to more important scientific discoveries.

Richard Mann, University Academic Fellow in Data Analytics, University of Leeds

This article was originally published on The Conversation. Read the original article.

Tuesday, December 20, 2016

Cheap wins from data in healthcare

There have been many calls for a 'data revolution', or even a 'Big Data revolution' in healthcare. Ever since the completion of the Human Genome Project, there has been an assumption that we will be able to tailor individual treatments based on data from an individuals DNA. Meanwhile, others dream of using the masses of routinely collected clinical data to determine which treatments work and for whom through data mining. As individuals we are encouraged to record our health metrics using smartphones to optimise our lifestyles for better health.

Each of these aspects of data-driven healthcare has promise, but also problems. It is very difficult to reliably associate a disease or drug efficacy with a small number of testable gene alleles, and very easy to identify false positive gene associations. Routinely collected data is very difficult to make reliable inferences from in terms of cause and effect, because treatments are not randomly assigned to patients. Sophisticated analytics do not stop you needing to think about how your data was collected. Lifestyle optimisation via smartphones probably owes more to Silicon Valley's ideal of the hyper-optimised individual and a corporate desire for ever more personal data than any real health benefits beyond an increased motivation to exercise.

However, there are easy wins to be had from data. These are in prediction of future events that involve no medical intervention. It is difficult to predict how a drug will affect a patient, because you need to infer the drug's effect against a background of other potential causes. But it is much easier to tell if a patient arriving at the hospital for a specific operation will need to stay overnight; simply look at whether similar patients undergoing similar operations have done so. If this sounds exceptionally simple, that's because it is. However, the gains could be great. Hospitals routinely have to keep expensive beds available to deal with emergencies, or cancel planned operations to deal with unexpected bed shortages. A reliable system to estimate the length of patient stay after an operation with some accuracy would reduce the need for these expensive, time consuming and inconveniencing issues. On the ground staff already have a good sense for which patients will need to stay longer than others. However, in the maelstrom of an NHS hospital, anything that can help to systematise and automate the making and use of these estimates will reduce pressures on staff.

Exploring this possibility, we performed an analysis of data the NHS routinely collects for patients and procedures, such as age, year, day and surgery duration (see figure below), and used this to predict stay duration. Our results showed that a substantial portion of the variability in stay duration could be predicted from these data, which would translate to a significant saving for the NHS if generally applied and combined with current estimates of stay given by experts on the ground from their past experience. Note, importantly, we are not suggesting any intervention on the individual as a result of this analysis. For instance we make no judgement on whether the variation by day indicates anything important about treatment, only that this helps planners to know whats likely to come up next. This work is not about whether the NHS should operate a full weekend service!


The variation in predicted stay duration based on four possible indicators. Black line indicates median prediction, grey region is a 95% confidence interval. From Mann et al. (2016) Frontiers in Public Health

As with numerical weather forecasts, we envisage this supplementing and supporting existing human expert judgement, rather than replacing it - there are clearly facets of the patient that we cannot capture in a simple data analysis. This provides a minimal cost use of existing data, with little or no complicating causal issues, that could save the NHS money on a daily basis. The size of the NHS means that small gains can be amplified on a national scale, while NHS data provides an enormous potential resource. It may be in these unglamorous aspects of healthcare provision that data analytics has immediate potential.

Monday, December 5, 2016

Machine-learning doesn't give you a free pass

A few weeks ago I read this paper on arXiv, purporting to use machine-learning techniques to determine criminality from facial expressions. The paper uses ID photos of "criminals and non-criminals" and infers quantifiable facial structures that separate these two classes. I had a lot of issues with it and was annoyed if not surprised when the media got excited by it. Last week I also saw this excellent review of the paper that echoes many of my own concerns, and in the spirit of shamelessly jumping on the bandwagon I thought I'd add my two-cents.

As someone who has dabbled in criminology research, I was pretty disturbed by the paper from an ethical standpoint. I think this subject, even if it is declared fair game for research, ought to be approached with the utmost caution. The findings simply appeal too strongly to some of our more base instincts, and to historically dangerous ideas, to be treated casually. The sparsity of information about the data is troubling, and I personally find the idea of publishing photos of "criminals and non-criminals" in a freely-available academic paper to be extremely unsettling (I'm not going to reproduce them here). The paper contains no information on any ethical procedures followed.


Aside from these issues, I was also disappointed from a statistical perspective, and in a way that is becoming increasingly common in applications of machine-learning. The authors of this paper appear not to have considered any possible issues with the causality of what they are inferring. I have no reason to doubt that the facial patterns they found in the "criminal" photos are distinct in some way from those in the "non-criminal" set. That is, I believe they can, given a photo, with some accuracy predict which set it belongs to. However, they give no consideration to any possible causal explanation for why these individuals ended up in these two sets, beyond the implied idea that some individuals are simply born to be criminals and have faces to match.

Is it not possible, for example, that those involved in law enforcement are biased against individuals who look a certain way? Of course it is. Its not like there isn't research on exactly this question. Imagine what would happen if you conducted this research in western societies: do you doubt that the distinctive facial features of minority communities would be inferred as criminal, simply because of well-documented police and judicial bias against these individuals? In fact, you need not imagine, this already happens: machine-learning software analyses prisoners risk of reoffending, and entirely unsurprisingly attributes higher risk to black offenders, even though race is not explicitly included as a factor.

If this subject matter was less troublesome, I would support the publication of such results as long as the authors presented the findings as suggesting avenues for future, more careful controlled studies. However, in this case the authors resolutely do not take this approach. Instead, they conclude that their work definitively demonstrates the link between criminality and facial features:
"We are the first to study automated face-induced inference
on criminality. By extensive experiments and vigorous
cross validations, we have demonstrated that via supervised
machine learning, data-driven face classifiers are able
to make reliable inference on criminality. Furthermore, we
have discovered that a law of normality for faces of noncriminals.
After controlled for race, gender and age, the
general law-biding public have facial appearances that vary
in a significantly lesser degree than criminals."

This paper remains un-reviewed, and let us hope it does not get a stamp of approval by a reputable journal. However, it highlights a problem with the recent fascination with machine-learning methods. Partly because of the apparent sophistication of these methods, and partly because many in the field are originally computer scientists, physicists or engineers, rather than statisticians, there has been a reluctance to engage with statistical rigour and questions of causality. With many researchers hoping to be picked up by Google, Facebook or Amazon, the focus has been on predictive accuracy, and on computational efficiency in the face of overwhelming data. Some have even declared that the scientific method is dead now that we have Big Data. As Katherine Bailey has said: "Being proficient in the use of machine learning algorithms such as neural networks, a skill that’s in such incredibly high demand these days, must feel to some people almost god-like ".

This is dangerous nonsense, as the claim to infer criminality from facial features shows. It is true that Big Data gives us many new opportunities. In some cases, accurate prediction is all we need, and as we have argued in a recent paper, prediction is easy, cheap and unproblematic compared to causal inference. Where simple predictions can help, we should go ahead. We absolutely should be bringing the methods and insights of machine-learning into the mainstream of statistics (this is a large part of what I try to do in my research). Neil Lawrence has said that Neural Networks are "punk statistics", and by God statistics could do with a few punks! But we should not pretend that simply having a more sophisticated model, and a huge data set, absolve us of the statistical problems that have plagued analysts for centuries when testing scientific theories. Our models must be designed precisely to account for possible confounding factors, and we still need controlled studies to carefully assess causality. As computer scientists should know: garbage in, garbage out.


This is not a plea for researchers to 'stay in their lane'. I think criminology and statistics both need fresh ideas, and many of the smartest people I know work in machine-learning. We should all be looking for new areas to apply our ideas in. But working in a new field comes with some responsibility to learn the basic issues in that area. Almost everyone in biology or social science has a story about a physicist who thought they could solve every problem in a new field with a few simple equations, and I don't want data scientists to do the same thing. I fear that if modern data science had been invented before the discovery of the Theory of Gravity, we would now have computers capable of insanely accurate predictions of ballistics and planetary motions, and absolutely no idea how any of it really worked.



Sunday, August 21, 2016

A Bayesian Olympics medals table

As the 2016 Rio Olympics draw to a close, much of the media coverage here in the UK focuses on how many medals Team GB has won, and how this compares to other countries and to previous Olympics. Team GB has done particularly well this year, rising to 2nd in the medal table (as of Sunday afternoon) and increasing the number of medals won compared to London - the first time a host country has improved its medal haul in the subsequent Olympics.

The medal table has become an increasingly prominent feature of the Olympics (at least in the UK). Many people have pointed out an simple flaw in looking at a country’s position in the table as a measure of its sporting ‘quality’ (whatever that means): larger countries win more medals, simply by having more people. The USA, China and in the past the Soviet Union have been large countries dominating the upper echelons of the table. The obvious way to compare countries ‘fairly’ is to look at a per capita medal table. One website that has done this places the Bahamas at the top of its list of per capita gold medals. On the one hand correcting for population size in this way seems like a sensible thing to do if you want to know whether a country performed well for its size or not. But I can’t help noticing that of the top 10 countries in this list, none has a population onf more than 10m people, and two have populations below 1m. A single gold medal in the Bahamas puts them top of the list. This suggests to me that places at the top of the per capita table are likely to be the result of statistical noise - whichever of the many small countries compteting manages to win one gold tops the table.

A more robust solution is to treat the medal table as a statistical sample that is generated from the underlying sporting quality of each country, and to try to infer this quality from the data that we observe. To do this we can use Bayesian inference. Let the quality of a country in Olympic sport be represented by a single number, \(q\), such that the expected number of gold medals that country will win is \(qN\), with \(N\) being the population of the country (I’ll ignore complications about differing proportions of athlete-age population). Bayes’ rule tells us that our belief about the quality of a country should be represented by a probability distribution that combines our prior beliefs about \(q\), \(P(q)\) and the likelihood of observing the medals we saw given a specific value of q, \(P(\textrm{# Golds = g} \mid q)\): \[ P(q \mid \textrm{# Golds = g}) \propto P(q)P(\textrm{# Golds = g} \mid q) \] The likelihood is easy to define. Given that gaining a gold is a rare event, the number of golds won should follow a Poisson distribution. Therefore: \[ P(\textrm{# Golds = g} \mid q) = \frac{(qN)^g \exp(-qN)}{g!} \] For the prior distribution of \(q\) we can use the Principle of Maximum Entropy: we use a distribution that has the most uncertainty given the facts that we know. We know what the mean number of golds per person over the whole world must be, since the total number of golds, \(G\) and the world population, \(N_W\) is fixed at the time of the Olympics. The maximum-entropy distribution defined over positive numbers and with a known mean is the exponential distribution: \[ P(q) = \frac{N_W}{G}\exp(-\frac{qN_W}{G}) \] Putting this together and discarding constants we get \[ P(q \mid \textrm{# Golds = g}) \propto q^g \exp \left(-q\left(N + \frac{N_W}{G}\right) \right) \] If we want a single number to represent this distribution we should use the mean value \(\bar{q} = \int_0^1 qP(q \mid \textrm{# Golds = g}) dq\), which we can calculate as below: \[ \bar{q} = \frac{\int_0^1 q^{g+1} \exp \left(-q\left(N + \frac{N_W}{G}\right)\right)dq}{\int_0^1q^{g} \exp \left(-q\left(N + \frac{N_W}{G}\right)\right)dq} \\ = \frac{g+1}{N + \frac{N_W}{G}}\frac{1-\exp(-(N + \frac{N_W}{G}))\sum_{i=0}^{g+1} \frac{(N + \frac{N_W}{G})^i}{i!}}{1-\exp(-(N + \frac{N_W}{G}))\sum_{i=0}^{g} \frac{(N + \frac{N_W}{G})^i}{i!}} \] where the final step is done using repeated integration by parts. In practice the exponential terms in the final expression tend to be extremely small, so this can be approximated as \(\bar{q} = \frac{g+1}{N + N_W/G}\). This shows what effect the Bayesian prior has: the simple per capita estimate is just \(\frac{g}{N}\); using the prior effectively increases the medal count by 1 and the population count by \(N_W/G\), the worldwide number of people per medal, so it is as if the country got one more gold medal at the cost of having an additional population of the worldwide average needed to do this.

So I’m sure if you’ve slogged through the mathematics this far you’re dying to know what the Bayesian medal table actually looks like. Here is the R code used to do the above calculations, and then finally the medal table:

library(knitr)

#Read in data
medal_table = read.delim("medal_table.txt")
medal_table$Population = as.numeric(gsub(",", "", as.character(medal_table$Population)))


#Define prior distribution mean parameter
world_pop = 7.4e9
prior_mean = sum(medal_table$Gold)/world_pop

#Define useful function for calculating posterior mean
myf <- function(n, k){
  s = rep(0, k)
  for (ii in 0:k){
    s[ii] = -n + ii*log(n) - lfactorial(ii)
  }
  
  y = 1 - sum(exp(s))
  return(y)
}

#Loop over countries and calculate posterior mean 
medal_table$Quality = rep(NA, dim(medal_table)[1])
for (i in 1:dim(medal_table)[1]){
  k = medal_table$Gold[i]
  n = medal_table$Population[i] + 1/prior_mean
  
  #Calculate mean of the posterior distribution
  medal_table$Quality[i] = ((k+1)/n)*myf(n, k+1)/myf(n, k)
  
}

#Order results by quality and print
medal_table_print=medal_table[order(medal_table$Quality, decreasing=TRUE), c("Country", "Gold", "Population", "Quality")]
#Print only countries with quality higher than the prior
medal_table_print = medal_table_print[which(medal_table_print$Quality > prior_mean), ]
row.names(medal_table_print) <-NULL

kable(medal_table_print, digits = 9)
Country Gold Population Quality
Great Britain 27 65138232 3.13e-07
Hungary 8 9844686 2.64e-07
Jamaica 6 2725941 2.60e-07
Netherlands 8 16936520 2.19e-07
Croatia 5 4224404 2.11e-07
Australia 8 23781169 1.88e-07
New Zealand 4 4595700 1.74e-07
Germany 17 81413145 1.70e-07
Cuba 5 11389562 1.69e-07
United States 46 321418820 1.36e-07
South Korea 9 50617045 1.34e-07
Switzerland 3 8286976 1.23e-07
France 10 66808385 1.21e-07
Russian Federation 19 144096812 1.19e-07
Greece 3 10823732 1.14e-07
Spain 7 46418269 1.13e-07
Georgia 2 3679000 1.08e-07
Italy 8 60802085 1.06e-07
Slovakia 2 5424050 1.01e-07
Denmark 2 5676002 1.00e-07
Kenya 6 46050302 1.00e-07
Serbia 2 7098247 9.60e-08
Kazakhstan 3 17544126 9.60e-08
Uzbekistan 4 31299500 9.00e-08
Sweden 2 9798871 8.80e-08
Japan 12 126958472 8.60e-08
Belgium 2 11285721 8.50e-08
Canada 4 35851774 8.30e-08
Bahamas 1 388019 8.10e-08
Fiji 1 892145 8.00e-08
Bahrain 1 1377237 7.80e-08
Kosovo 1 1859203 7.70e-08
Slovenia 1 2063768 7.60e-08
Armenia 1 3017712 7.40e-08
Puerto Rico 1 3474182 7.20e-08
Singapore 1 5535002 6.70e-08
Jordan 1 7594547 6.30e-08
Tajikistan 1 8481855 6.10e-08
North Korea 2 25155317 6.10e-08
Belarus 1 9513000 5.90e-08
Argentina 3 43416755 5.90e-08
Azerbaijan 1 9651349 5.90e-08
Czech Republic 1 10551219 5.80e-08
Colombia 3 48228704 5.50e-08
Poland 2 37999494 4.80e-08
Romania 1 19832389 4.50e-08
Ukraine 2 45198200 4.30e-08
Cote d’Ivoire 1 22701556 4.30e-08
Taiwan 1 23510000 4.20e-08

Team GB tops the chart! Mathematically, this is because GB combines a large rate of medals per capita with a large population. Therefore it has the statistical weight to move the inferred value of \(q\) away from the prior expectation. Smaller countries with several golds like Jamaica also do well, but tiny Bahamas is now much further down the list - 1 gold medal just isn’t enough information to tell you much about the underlying rate at which a country tends to win golds.

You could easily extend this analysis by aggregating the results of previous Olympics too. With data from more years there would be more evidence to move the quality of smaller countries away from the prior. In terms of predicting the future performance of countries you would need to decide on an appropriate weighting of past results, which you could in principle do by trying to make a predictive model for the 2016 results from 2012, 2008 etc. Data from Rio and previous Olympics is available here

Additional note: this is my first blog post written entirely in R Markdown.