The Problem with Statistics
Here in Canada our upcoming federal election is sparking a veritable barrage of polls and statistics. Often times the figures just don't seem to agree. What are we to make of all this? How do we wade through all of this information to glean any meaningful insights? Moreover, how does one evaluate which studies are most accurate, or more to the point, truthful? Now that the information age is truly upon us, it's perhaps time to re-evaluate the purpose of statistics in our daily lives and what they mean. All-too-often, it's not what you think.
Today, I'd like to run through a few of the more troubling aspects of statistics and how these may be used to advance an agenda or skew the facts to someone's favor.
Epidemic or Random Clustering?
Examine the location of the Aces in a recently shuffled deck of cards. Are they spaced a similar number of cards apart or are they all within close proximity of each other? Say that the four Aces were all within ten cards. Would you simply accept that it’s a perfectly normal consequence of randomization or would you suspect that cheating might be involved? It might depend on how the cards were shuffled. You would probably want to observe the person or machine doing the shuffling.
The same concept becomes much more volatile when people’s lives are involved. Every year, we hear about small towns that have an unusually high rate of disease such as cancer. Should a smoking gun – like a chemical plant – be found nearby, and you have the makings of a sensational story.
In truth, determining the presence of a disease cluster is quite difficult. In many reported cases, the clustering is due to the "bull's-eye effect", which is something akin to drawing a target on the wall after the darts have been thrown. So next time the media reports a cluster scare, don't jump to conclusions until a thorough investigation has been performed by the relevant authorities.
Why Average Describes both Everyone and No One
I used to wonder why life expectancies were so much shorter historically than they are now, even though I have read so much about old people in books and seen them portrayed on the screen. Were these people exceptionally lucky to have lived to a ripe old age?
The answer lies in how averages are calculated: you add up a collection of values and divide by the number of values. Hence, if you had ten people and they lived to 69, 90, 89, 45, 78, 67, 79, 82, 76, 64 their average lifespan would be equal to 73.9 years. That's
(69 + 90 + 89 + 45 + 78 + 67 + 79 + 82 + 76 + 64) / 10 = 73.9
That works very well, so what’s the problem?
The flaw becomes apparent when we use an example that factors in the higher rates of infant mortality. For instance, during the Middle Ages, infant mortality was likely to be around 30%, and may even have been as high as 50%! 
Let's try that again with some much lower numbers - 69, 90, 1, 45, 78, 4, 79, 5, 76, 64.
That gives us an average of 51.1 years:
(69 + 90 + 1 + 45 + 78 + 4 + 79 + 5 + 76 + 64) / 10 = 51.1
So who in the above group expired at 51 years old? Did anyone even live close to that length? The closest match is entry four at 45 years. That is why averages often describe everyone and no one at the same time. Without knowing how many outliers may have been included in the calculation, it's impossible to know how typical the resulting figure may be of real word outcomes.
Sampling Gone Bad
Getting back to the polling mentioned in the introduction of this article, for economic and logistical purposes, it's simply not feasible to ask every adult in the country who he or she will be voting for. Instead, data is obtained by taking a sample from a larger group that hopefully has the same characteristics as the larger group. For example, if pollsters were to ask 100 people who they are going to vote for in the next election, and 45 of them say they will vote for Johnson, we might extrapolate that about 45% of all the voters will vote for Johnson.
Sampling provides many benefits, but it's not without some important limitations.
The first issue is that the pollsters could have just happened to talk to an unusually large percentage of Johnson supporters by blind luck. This is the problem of sample size. The smaller the sample, the greater the influence of luck on the results we get. For a population of millions, a sample of one hundred participants is far too little.
Another issue is that the way the people in the sample were picked might preclude a certain result. If Johnson supports spending a lot of money on the arts, and the pollsters approach people attending a free concert, we might find an atypically high percentage of Johnson supporters. On the other hand, if those same pollsters were to sample people at a bar frequented by mostly single people, they might find a much lower percentage of Johnson supporters. In either case, the results will be unreliable.
Finally, another type of bias results when the people being sampled are free to choose whether or not to respond. A radio talk show might ask people to call in and vote on some issue. If the issue is especially contentious, people may be more likely to vote one way than the other. For example, if people were to vote about placing a new land fill in their neighborhood, chances are good that most people who called in would be opposed. This tendency is known as response bias.
Recognizing that statistics people present to us are frequently flawed doesn't imply that statistics are useless. On the contrary, statistics by-and-large offer excellent evidence, and are often the easiest and most concise way to express it. Just be aware that the burden to examine the figures for relevance, validity and authority fall squarely on your shoulders. Conversely, taking statistics at face value might lead you to draw erroneous conclusions.
Rob Gravelle resides in Ottawa, Canada, and is the founder of GravelleConsulting.com. Rob has built systems for Intelligence-related organizations such as Canada Border Services, CSIS as well as for numerous commercial businesses. In his spare time, Rob has become an accomplished guitar player, and has released several CDs. His former band, Ivory Knight, was rated as one Canada's top hard rock and metal groups by Brave Words magazine (issue #92).
Stay up to date on the latest developments in Internet terminology with a free weekly newsletter from Webopedia. Join to subscribe now.
Like everything in technology, AI touches on so many other trends, like self-driving cars and automation, and Big Data and the Internet of Things... Read More »DevOp's Role in Application Security
As organizations rush to release new applications, security appears to be getting short shrift. DevSecOps is a new approach that holds promise. Read More »Slideshow: Easy Editorial SEO Tips to Boost Traffic
This slideshow reviews five easy on-page editorial SEO tips to help drive organic search engine traffic, including the page title, heading,... Read More »
Java is a high-level programming language. This guide describes the basics of Java, providing an overview of syntax, variables, data types and... Read More »Java Basics, Part 2
This second Study Guide describes the basics of Java, providing an overview of operators, modifiers and control Structures. Read More »The 7 Layers of the OSI Model
The Open System Interconnection (OSI) model defines a networking framework to implement protocols in seven layers. Use this handy guide to compare... Read More »