My Journey as a Librarian: Statistics - the absolute basics and when to ignore numbers

Average - this is a number we choose to best represent others. There are 3 kinds of average. I learnt this in primary school.

- mean: this is the most used. It is simply the sum of all the numbers divided by the no. of values (I mean the numbers, not the ones your mom should have given you). The mean is useful in achieving equity. It is most representative when there are no extreme numbers that deviate from the rest.

- median: this simply refers to the middle value when all the numbers are arranged in sequence. If there are an even no. of values, we use the mean of the middle two. Given a median, you know that there are an equal number of values greater and smaller than it. A median is useful when there are some outlying values (values that are uncharacteristically large) that might distort your average value if you use the mean. Eg if you calculate the prices of public housing around the world, the median and the mean may be quite different because of Singapore HDB prices. The median would likely be more representative.

- mode: this refers to the value that happens most often. There can be more than one mode or none at all (ie. all values occur the same no. of times). Ideally, given a number from that set, it is most likely to be the mode.

Using all three can tell us different things about the data set.

Deviation - this tells us how much the various values differ from the mean. Usually, standard deviation is used.
eg {5, 10, 15} and {0, 10, 20} both have a mean of 15 but drastically different deviation.

Statistical Significance - this tells us how sure we are that a particular result is true and not due to randomness. For example,
if we throw a dice 6 times and we get "three" 4 out of the 6 times, how confident are we that the dice is biased rather than it being a matter of 'luck'.
Conversely, how many more people should we find have borrowed from the romance section than the cookery section for us to be 95% sure that cookery books are really more popular.

This calculation is needed because typically we cannot obtain all the data (eg survey every single person) but use a sampling, so randomness comes into the picture.

Caution: Statistical significance depends on a true random sample. Otherwise the sigificance level can be misleading. If we are obtaining feedback from all participants, there is no need for this, unless we are using their feedback to infer the preferences of the population at large. Then more will need to be done statistically.

Examples of Applications:
1) The use of deviation in loans or in-library collection size can help us figure out how much shelf space we really need. After all, the amount of shelf space depends more directly on how many books are left the library at any time rather than the overall collection size.

2) We may receive feedback that a particular timing is preferred for a programme. Given that the feedback is from a random sample, how sure are we that the indicated timing is really better for the community?

3) Librarians are told to pick 5 children from every session of a programme for feedback (because it is logistically impossible to poll everyone). If the resulting positive feedback has a 0.05 significance level (ie 95% confidence) can we be sure the programme is good? This actually depends on how randomly the librarian selected the children. Usually bias to pick favourable and more forthcoming children will come into play.

4) There seems to have been a recent rise in loans. Can we be 99% sure that this is a trend rather than a 'random' variation?

Ignoring Statistics
Limits of Statistics - There will be times when we wish to ignore statistics given. This may be times when broad strokes alone are not helpful. For example, the parent of a child whose child was murdered won't care if their neighbourhood is actually very safe and that their child's death does not add significantly to the mortality rate. To do otherwise is to allow a tyranny of the majority over the needs of the minority. Some needs are so important that zero tolerance may be the way to go.
Statistics are descriptive. Just because something is statistically significant or not does not affect reality, it only describes how sure we are that it reflects reality. Statistically, it is impossible to be hundred percent sure unless we use the entire population as a sample. Thus there can be occasions when the best statistics are wrong and the worse may prove to be right. Thus assessing purely by statistics alone is to play a game purely on probability. But a strategy that is very likely to succeed can still fail (see below). All things being equal, the larger the sample size thre more sure we can be but the flip side is that cost rises exponentially.

Moreover, the brain is more complex than mere statistics and is capable of heuristics that can be chillingly more efficient and accurate depending on the relevant experience he/she has. Humans as a group can exhibit irrationality that defy simplistic mathematical models and the seemingly improbable is not impossible. Quantum mechanics and the normal curve includes an implicit awareness of this. Statistically air travel is safer than land but planes still crash ever so often. It is near impossible for an individual to strike the lottery yet we know there must be a winner at every draw. Marketing is deeply aware of this so you will never see an ad that appeals by logic. Every marketing technique I've read about appeals to intuition and emotions (That is how many make decisions). The statistics they use are direct results from testing the effectiveness of various methods. So while statistics can be helpful, making purely logical inferences from it can be misleading. I have no examples here: this is an area I am only beginning to explore. This is the limit of my understanding so far.

Incomplete picture - One typical manner in which statistics can be misleading is when only certain sides are presented. We may say that a particular organisation only has a 5% penetration rate in a particular section of the market. But what's missing is how large and profitable the market is and what the ROI is for various products. And how saturated the market is. Typically, time and manpower (including training hours) are sometimes neglected when computing costs. The scalability and replicability of a model is important and are currently not included in most calculations used.

Beware of probability - we have all heard of how a dice gives an equal chance for any number to appear. Yet we have all had times when we threw a succession of 6's or 1's. The skinny on probability is that it is ultimately a statistics assuming that the event occurs infinite times. Thus any programme based solely on probability can be dangerous unless there are enough resources and time to run it many, many times. Even then, there may often be better ways. And many types of events are independant ie, past performance does not and never will predict future events. You could have thrown 100 6's on a dice and the probability of getting a 6 (or any numer from 1 to 5) on the next throw is still 1 in 6.

My Journey as a Librarian

Thursday, February 16, 2012

Statistics - the absolute basics and when to ignore numbers

No comments:

Post a Comment