# How To Understand Statistics

Created | Updated Jul 25, 2011

The world is littered with statistics, and the average person is bombarded with five statistics a day^{1}. Statistics can be misleading and sometimes deliberately distorting. There are three kinds of commonly recognised untruths:

Lies, damn lies and statistics.

- Mark Twain

This quote from Mark Twain is accurate; statistics are often used to lie to the public because most people do not understand how statistics work. The aim of this entry is to acquaint the reader with the basics of statistical analysis and to help them determine when someone is trying to pull a fast one.

Think about how stupid the average person is; now realise half of them are dumber than that.

- George Carlin

There are many books which teach statistics, but they are mostly big and heavy mathematical books, which cost a lot of money, and which may require a degree in the subject to understand anyway. For many years there has been a need for a *Statistics for Dummies* book and in fact there is one, written by Deborah Rumsey. On the Internet information on how to understand statistics can be found, but the sites mostly cater for medical students who need to examine experimental drug studies, although a great online starting place is RobertNiles.com, which explains how to examine statistics for errors and how to create your own statistics correctly.

### Some Examples of Misleading Statistics

#### Women are Better Drivers Than Men

This is not the same thing as saying that all women are better drivers than men, although many people, by the look of some insurance company advertisements, seem to think that that is exactly what it means. In fact it simply shows that, on average, a woman between the ages of 20 and 65 who drives a car will have had fewer accidents than a man of the same age, driving the same car. The data is drawn almost exclusively from insurance company statistics. It may not, however, be accurate, as few people bother to alert their insurers if they clip the wing mirror or scratch the paint.

Here is another, rather famous use of distorting statistics...

#### Toddlers who Attend Pre-school Exhibit Aggressive Behaviour

A study was conducted on four-year-olds, comparing those who went to pre-school and socialised with other children, with those that stayed at home with their mothers. It measured aggressive behaviour such as stealing toys, pushing other children and starting fights.

It showed that children who went to pre-school were three times more likely to be aggressive than those who stayed at home with their mothers. The statistics were well documented and were, technically, accurate. The report used these statistics to persuade parents to keep their children at home until they start school, aged five.

What the study failed to mention was that aggressive behaviour is normal in four-year-olds. Parents who keep their children at home, but take them to toddler groups also observe their children being aggressive. Psychologists say it is the child learning about society's 'pecking order'. The children who stayed at home and did not attend pre-school were less aggressive, because their behaviour was abnormal. A follow-up survey (done by another group) demonstrated that the children who stayed at home before attending school ended up being more aggressive at a later age than those who had gone to pre-school.

In other words, the children who attended pre-school were 'normal', for want of a better word. The ones who stayed at home with their mothers were not.

The initial study was funded by a mother support group. They used the statistics to promote their own, pre-determined agenda. This illustrates the first rule of dealing with statistics: *always ask who's paying for a study*^{2}.

#### First World War Head Injuries

Another strange statistical anomaly was the introduction of tin helmets to the front line. In the First World War the number of head injuries was very high and soldiers took a long time to recover. To begin with, the soldiers only had cloth hats to wear, but after the introduction of tin hats the number of injuries to the head increased dramatically. No one could explain it, until it was revealed that the earlier records only accounted for the injuries, not fatalities. After the introduction, the number of fatalities dropped dramatically, but the number of injuries went up because the tin helmet was saving their lives, but the soldiers were still injured. This demonstrates the second rule of statistical interpretation: *which question is being asked?* A leading or misleading question used to gather statistics can result in misleading statistics.

The examples above demonstrate that statistical conclusions can be misleading, and can even be used to prove a negative, showing something false to be true. A good eye for spotting any irregularities in statistical interpretation is a useful skill.

### Things to Look Out For

47.3% of all statistics are made up on the spot.

- Steven Wright

Where did the data come from? Who ran the survey? Do they have an ulterior motive for having the result go one way?

How was the data collected? What questions were asked? How did they ask them? Who was asked?

Be wary of comparisons. Two things happening at the same time are not necessarily related, though statistics can be used to show that they are. This trick is used a lot by politicians wanting to show that a new policy is working.

Be aware of numbers taken out of context. This is called 'cherry-picking', an instance in which the analysis only concentrates on such data that supports a foregone conclusion and ignores everything else.

A survey on the effects of passive smoking, sponsored by a major tobacco manufacturer, is hardly likely to be impartial, but on the other hand neither is one carried out by a medical firm with a vested interest in promoting health products.

If a survey on road accidents claims that cars with brand X tyres were less likely to have an accident, check who took part. The brand X tyres may be new, and only fitted to new cars, which are less likely to be in accidents anyway.

Check the area covered by a survey linking nuclear power plants to cancer. The survey may have excluded sufferers who fall outside a certain area, or have excluded perfectly healthy people living inside the area.

Do not be fooled by graphs. The scale can be manipulated to make a perfectly harmless bar chart look worrying. Be wary of the use of colours. A certain chewing gum company wanted to show that chewing gum increases saliva. The chart showed the increase in danger to the gums after eating in red and safe time after chewing in blue. However the chart showed that the act of chewing would have to go on for 30 minutes to take the line *out of* the danger zone. The curve was just coloured in a clever way to make it look like the effect was faster.

Perhaps the most important thing to check for is sample size^{3} and margin of error. It is often the case that with small samples, a change in one sample or one data item can completely change the results. Small samples can sometimes be the only way to get the analysis done, but generally the bigger the sample size, the more accurate the results are and the less likely a single error in sampling will affect the analysis. For example, people will go on about how 95% of children passed their exams at such a school and 92% of children passed their exams at a different one, but the sample sizes are not actually big enough for the difference to be statistically significant: in a year group of 100, a 3% difference is a difference of three students, which makes the difference insignificant.

### The Problem with Statistics

The main problem with statistics is that people like favourable numbers to back up a decision. For example, when choosing an Internet provider, most people will choose the one with the most customers. But that statistic does not tell you other useful things like what their customer turnover might be, what their connection reliability is, what the mean time taken to answer a technical fault call is, and so on. People will simply make the assumption that a lot of customers means that the company should be be all right. Generally this is true, but there are companies which work by having a large body of customers, providing bad service and making it hard for people to cancel their agreements. Just because a company is the most popular, does not automatically mean it is the best.

Common sense can cloud statistical results. For instance, a technology firm discovered that 40% of all sick days were taken on a Friday or a Monday. They immediately clamped down on sick leave before they realised their mistake. Forty per cent represents two days out of a five day working week and therefore is a normal spread, rather than a reflection of swathes of feckless opportunists trying to extend their weekends.

Fundamental to the mathematics of probability is the requirement for conditional probabilities to be independent of each other, such as dice rolls or coin flips. If they are not independent the maths stops working and the answers stop making sense. However, a lot of statistics are worked out at a distance from the core events, so working out if the results are valid can be next to impossible. This is essentially the same as the gambler who thinks his luck must change soon because he couldn't continue to have bad luck all night. This is wrong; there's nothing to say the dice should start rolling your way based on previous behaviour.

#### Legal History

A more serious problem was highlighted in a court case, in which an innocent man was accused of being at a crime scene, which he denied, but was facing fingerprint evidence. A finger print expert was presented in court by the prosecution, who asked.

**Prosecution** - 'Assuming that the defendant did not commit this crime, what is the probability that the defendant and the culprit having identical fingerprints?'

**Expert** - 'One in several billion.'

**Prosecution** - 'Thank you.'

**Defence lawyer** - 'Let me ask you a different question. What is the probability that a fingerprint lifted from a crime scene would be wrongly identified as belonging to someone who wasn't there?'

**Expert** - 'Oh, about 1 in 100.'

It's all about the question asked. The defendant's fingerprints had been incorrectly identified as being the same as the ones lifted from the scene. Several subsequent expert examinations showed that the fingerprints were not the same, even though the fingerprint evidence was submitted in court as fact. It is not a fact, it is a science, and is governed by probabilities.

Other cases involving cot deaths have raised serious questions about the presentation of statistics from experts in court. All too often these are presented as fact in a case. One such case is the story of Sally Clark, who served three years in prison before having her conviction overturned by the Appeal Court in February, 2003. In her case, as with several others in recent years, evidence from expert pathologists stating that the chance of multiple cot deaths in a single family was almost impossible led to the assumption that the deaths were murders. This was presented as a scientific fact, because the jury did not analyse the statistics. In actual fact multiple cot deaths in a family are not independent^{4}, and the probabilities are much lower, to such an extent that when the third child dies, cot death is the most likely cause even before a post mortem is carried out. Calling mothers of multiple cot deaths serial murderers is analogous to assuming all air crashes are caused by pilot error.

#### No Average

The main thing statistics shows is that there is no such thing as average. If 50% of a company's employees are above average in productivity, then 50% must be below average. Changing the definition will not help, 50% must always be below it, as demonstrated in bell curve graphs.

This demonstrates another problem people have in interpreting statistics. Many people try to make their statistics fit the normal distribution but there are *non-normal* distributions, and that the statistics used for normal distributions are often inappropriate when the distribution is patently non-normal.

Many people think that 'mean' means the same thing as 'average'. It doesn't; mean is a mathematical term. Average is often used as a description for a person or data item, but in mathematics it means 'a number that typifies a set of numbers of which it is a function'. In other words, average can mean mean, median or mode.

- The median is the middle value in a distribution, above and below which lie an equal number of values.
- The mean is a number that typifies a set of numbers, such as a geometric mean or an arithmetic mean; the average value of a set of numbers.
- Mode is the value or item occurring most frequently in a series of observations or statistical data.

Example data 1: | 2 | 5 | 5 | 6 | 9 | 12 | 15 |

Analysing the data, we get mean: 7.71, median: 6, mode: 5

Example data 2: | 4 | 5 | 5 | 5 | 8 | 12 | 86 |

Analysing this data, we get mean: 17.857, median: 5, mode: 5

Statistics do have a sort of magical appeal. They appear to the untrained eye to be based on complex maths that is difficult to understand. This is rubbish: statistics are easy to create. Accurate statistics are much more difficult to calculate.

Statistics are governed by a term used to describe computer problems 'GIGO', or 'Garbage In Garbage Out'. If the survey asked the wrong question, asked the wrong group of people or was subject to any other major problem, there is no statistical analysis method in the world that can create meaningful information from the raw data. There are some techniques that can correct small errors, but the more small errors corrected, the less accurate the results will be.

### Fun With Statistics

Statistics can create some unusual mental games, with interesting answers. They can be great conversation starters at parties and can be fun to baffle your friends. They're a bit like mathematical magic tricks.

### More Information

For more information on statistics check out: National Statistics Online or RobertNiles.com, a good source for statistical analysis for the beginner. But there's also*Cartoon Guide to Statistics*, by Larry Gonick.

If you enjoyed reading this entry, you may like to read: Things to Consider when Reading Medical Research

^{1}This is an example of a made-up statistic.

^{2}So often a company will collect statistics on hundreds of variables and perhaps calculate 1000 more from those original hundreds, and then present only the two or three most positive findings to the public.

^{3}That is to say, the total number of things surveyed for the purposes of a study.

^{4}There may even be a cot death gene that affects child mortality.