Statistics: How to Lie with Statistics

An example of misleading with statistics……..

 

Wired magazine recently declared that the Web is dead, backing it up with this chart that shows the history of the proportion of total Internet bandwidth various types of traffic.

According to Wired’s Chris Anderson – the proportion of Internet traffic taken up by Web requests is getting smaller – so “the Web is dead”.

It seems that Wired, in its quest to come up with outrageous linkbait, decided that the percentage of traffic carried over the Internet is the determiner of the viability of the Web and other forms of Internet traffic.

 

Did Podcasting Kill The Web?

By Wired’s way of thinking, you could argue that podcasting killed the Web. Audio and video podcasts are several orders of magnitude larger than a typical Web page.

What Wired ignores, though, is that the volume of Web traffic continues to grow by leaps and bounds.

 

Boing Boing counters Wired’s ludicrous argument with this chart, which more accurately shows that the Web is growing at an amazing rate, but media file traffic is growing faster – because media are a lot bigger:

Unfortunately, Wired’s need to come up with a new meme may keep people from considering a real trend – that devices like the iPad are forking the Web.

Just when it looked like the Web was going to kill off all other media – the iPad is demonstrating the benefits of dedicated Internet-enabled applications.

Instead of the one-size-fits-all approach of the computer Web browser, Apple is creating a platform of dedicated apps that do one thing really well.

.

Common Techniques for producing misleading statistical analyses:

 

There are a number of techniques to consider in lying with statistics:

Use a small, biased, sample

4 out of 5 doctors surveyed recommend using our product!

Of course, there are 4 doctors on our board. We asked them first. Then we called one at random from the AMA's member directory. Any correlation between our survey method and the results is pure serendipity.

 

Use a self-selecting population

4 out of 5 of our female subscribers who responded to our survey cheated on their husbands!

Sure, but you have two self-selections. First: You only surveyed your subscribers, not a random sample of all women. Second, since people had to respond, the ones most likely to respond were the ones who had cheated.

 

Tie one data set to an earlier one, implying causality

100% of all crack addicts drank water before becoming addicted to crack. Water kills!

The more interesting (and likely truthful) statistic is "How many water drinkers become addicted to crack?" By tying together two unrelated (or even semi-related) groups together, nearly anything can be proved. Try these out for size:

·         Being baptized leads to sinning

·         Eating causes cancer

·         Being asked, "Would you like fries with that?" causes road rage.

·         Scantily clad people lead to overpopulation

Use a lower confidence to gain a higher probability

Over 80% of America watches our show!

Sample sets are merely predictors of the population at large. If 80% of a sample set has some attribute, then there is a confidence level associated with saying, "at least 80% of the population has this attribute." By increasing the probability, you lower your confidence, and, by decreasing the probability, you increase the confidence.

 

Use obscure definitions and data sets

50% of Yankees are let go from their jobs at least once a year. The Yankee work ethic makes it hard to keep a job.

First, what's a Yankee? An American, a New Englander, a Vermonter, a woodchuck? What does "let go" mean? Perhaps this statistic started as "50% of rural Vermonters have a second job in a seasonal industry, which supplements their annual income.

 

Compare a statistic that affects most of the population to one that affects a small portion of the population

You are more likely to be hit by lightning thrice than attacked by a shark.

Well, let's see. Who is vulnerable to lightning strikes? Just about everyone in the world. Who is vulnerable to shark attack? Only those who swim in shark-infested waters. If you have a population, which 100% of its members have a 0.05% chance of event A happening to them, and 5% of its members have a 2% chance of event B happening, then the following three facts are true:

1.    Event B happens with greater frequency than event A.

2.    People in the 5% group are more likely to have event B happen than event A.

3.    People in the other 95% are more likely to have event A happen than event B.

Unless you know which group someone is in, you can't really predict which is more likely to happen

 

 

The first way in which statistics can be distorted is through the type of sample used. Rarely is an entire population surveyed in a study. More often, a sample is taken and the data from that sample is extrapolated onto the rest of the population. It is thus vitally important that the sample be judiciously chosen.

Let's imagine we are looking for the average height of 
Canadians. We choose to sample three random Canadians and we get their heights. The mean value we find for the height of Canadians is a random variable (do follow the linkpipe on that one, the term random variable has an important meaning here). The height of Canadians could be any number within a particular range but it is not equallyprobable that it would be any of them. Let's imagine that we take 50 samples of three people each and plot the resulting data set on ahistogram. The mean of this histogram (the average of the fifty averages) is the population mean as nearly as we can determine it. 

Each of the fifty 
data sets could also be plotted on a histogram. The small size of the sample means that there is a relatively high chance of an unusually tall or short person turning up in our data set and thus making its mean dramatically different from the mean of the entirepopulation. Therefore, as our sample size gets larger, the distribution of the sample averages will have less spread.

If we knew the true 
mean height of the Canadian population, we could put it on the histogram from one trial. It will usually be either too large or too small when compared with our estimated value. How far off it will be depends on the sample size of the trial. Since a larger sample represents the whole population more effectively, it makes sense that it would do a better job of estimating the true value.

95% of the time, the true value of the 
mean height of the Canadian population will be within two standard deviations of the estimated value. The standard deviation of the histogram will become smaller as the sample becomes larger. This means that the area in which the true value almost certainly lies on a histogram becomes smaller when a larger sample size is used. This concept may be more familiar than you think.

Consider polls. When a poll result is stated, it is usually in the form: "55% of Canadians say Jean Chretien should play more golf, plus or minus 5% 19 times out of 20." The "19 times out of 20" is the same 95% from the above paragraph. This means that 5% represents twice the 
standard deviation for the set from which the 55% value is determined. The pollsters are giving you the standard deviation in disguise!

Another common method by which 
statistics are fudged is conditioning. This is the process of selecting specific sub samples within a data set for comparison. An example is the average male wage compared with the average female wage. The manner in which this is done affects the results you get.

Studies have shown that kids who go to private schools earn 10% more, on average, than those who go to public schools. What does this mean? If we change the 
conditioning to examine neighbourhood and background, we see the difference reduced to zero. This essentially means that the marginal impact of going to private school if you already live in a good area (high average income, lowunemployment) is quite small. Contrarily, students coming from a poor area stand to gain 10% in their average income for going to private school. Such statistical evidence (keep in mind that this is just an example) can lead to government policy decisions. The above conclusion would support a proposal for vouchers allowing poor kids to go to private school, for example.

One final 
statistical trick I shall examine is that of scale. Somebody can call an increase from 2-3% inflation (as calculated by theConsumer Price Index, for example) a "50%" jump. In actuality, the change was rather small. Whenever percentage changes are used to examine changes in small values, alarmingly large percentage changes can result. For this reason, if you are presented with very large percentage changes you ought to keep in mind that they may simple represent small variations in small quantities. For the GDP of Luxemburg to grow by 10 or even 50% represents very little actual growth compared with the GDP of the United States growing even 1%.

 

Example:

 From Microsoft: How to Lie With Statistics

 

Posted on June 27, 2010 by dsarna

 

Microsoft’s Frank X Shaw created a buzz with his memo (uncritically parroted all over the Internet) which purports to show that Microsoft is still at the top of its game, outdistancing its competitors in every category. Though “sources” for the numbers are cited, most of the citations are uncheckable numbers from Microsoft’s shills and water-carriers, such as IDC. The biggest problem, however, is that the numbers, even if true on their face, don’t distinguish between a license bought decades ago and the server long having been buried in a landfill somewhere, and a license currently in use. For example, according to Shaw:

24%
Linux Server market share in 2005. [source]

33%
Predicted Linux Server market share for 2007 (made in 2005). [source]

21.2%
Actual Linux Server market share, Q4 2009. [source]

 

For the truth, see Netcraft’s

June 2010 Web Server Survey

The first version of the Apache web server software running under Linux was created by Robert McCool in 1992, and since April 1996 Apache has been the most popular HTTP server software in use. As of June 2010, 54% of the 111 million http servers used Apache[1] running under Linux. If you look at those with the bulk of the traffic, then the numbers are even more impressive:

http://googlegazer.files.wordpress.com/2010/06/market-share-for-active-servers.png?w=444&h=128

The numbers are actually worse than inverted. 66.61% of the traffic is Apache/Linux and only 17.04% runs on a Microsoft platform.

 

 

 

 

 

Not quite what Microsoft asserted!

 

Some academic papers about misleading statistics:

The titles are hyperlinked to the full articles

Statistical Science

2005, Vol. 20, No. 3, 210–214

DOI 10.1214/088342305000000232

© Institute of Mathematical Statistics, 2005

 

Lies, Calculations and Constructions: Beyond How to Lie with Statistics

Joel Best

Abstract. Darrell Huff’s How to Lie with Statistics remains the best-known,

nontechnical call for critical thinking about statistics. However, drawing a

distinction between statistics and lying ignores the process by which statistics

are socially constructed. For instance, bad statistics often are disseminated by

sincere, albeit innumerate advocates (e.g., inflated estimates for the number

of anorexia deaths) or through research findings selectively highlighted to attract

media coverage (e.g., a recent study on the extent of bullying). Further,

the spread of computers has made the production and dissemination of dubious

statistics easier. While critics may agree on the desirability of increasing

statistical literacy, it is unclear who might accept this responsibility.

Key words and phrases: Darrell Huff, social construction, statistical literacy

 

Statistical Science

2005, Vol. 20, No. 3, 223–230

DOI 10.1214/088342305000000296

© Institute of Mathematical Statistics, 2005

How to Confuse with Statistics or: The Use and Misuse of Conditional

Probabilities

Walter Krämer and Gerd Gigerenzer

Abstract. This article shows by various examples how consumers of statistical

information may be confused when this information is presented in terms

of conditional probabilities. It also shows how this confusion helps others to

lie with statistics, and it suggests both confusion and lies can be exposed by

using alternative modes of conveying statistical information.

Key words and phrases: Conditional probabilities, natural frequencies,

 

Statistical Science

2005, Vol. 20, No. 3, 231–238

DOI 10.1214/088342305000000269

© Institute of Mathematical Statistics, 2005

How to Lie with Bad Data

Richard D. De Veaux and David J. Hand

Abstract. As Huff’s landmark book made clear, lying with statistics can be

accomplished in many ways. Distorting graphics, manipulating data or using

biased samples are just a few of the tried and true methods. Failing to use the

correct statistical procedure or failing to check the conditions for when the

selected method is appropriate can distort results as well, whether the motives

of the analyst are honorable or not. Even when the statistical procedure and

motives are correct, bad data can produce results that have no validity at all.

This article provides some examples of how bad data can arise, what kinds

of bad data exist, how to detect and measure bad data, and how to improve

the quality of data that have already been collected.

Key words and phrases: Data quality, data profiling, data rectification, data

consistency, accuracy, distortion, missing values, record linkage, data warehousing,

 

data mining.

Statistical Science

2005, Vol. 20, No. 3, 239–241

DOI 10.1214/088342305000000250

© Institute of Mathematical Statistics, 2005

How to Accuse the Other Guy of Lying

with Statistics

Charles Murray

Abstract. We’ve known how to lie with statistics for 50 years now. What

we really need are theory and praxis for accusing someone else of lying with

statistics. The author’s experience with the response to The Bell Curve has

led him to suspect that such a formulation already exists, probably imparted

during a secret initiation for professors in the social sciences. This article

represents his best attempt to reconstruct what must be in it.

Key words and phrases: Public policy, regression analysis, lying with statistics

 

Statistical Science

2005, Vol. 20, No. 3, 215–222

DOI 10.1214/088342305000000241

© Institute of Mathematical Statistics, 2005

Lying with Maps

Mark Monmonier

Abstract. Darrell Huff’s How to Lie with Statistics was the inspiration for

How to Lie with Maps, in which the author showed that geometric distortion

and graphic generalization of data are unavoidable elements of cartographic

representation. New examples of how ill-conceived or deliberately contrived

statistical maps can greatly distort geographic reality demonstrate that lying

with maps is a special case of lying with statistics. Issues addressed include

the effects of map scale on geometry and feature selection, the importance

of using a symbolization metaphor appropriate to the data and the power

of data classification to either reveal meaningful spatial trends or promote

misleading interpretations.

Key words and phrases: Classification, deception, generalization, maps,

statistical graphics.

 

Statistical Science

2005, Vol. 20, No. 3, 242–248

DOI 10.1214/088342305000000214

© Institute of Mathematical Statistics, 2005

Ephedra

Sally C. Morton

Abstract. In February 2004, the U.S. Food and Drug Administration (FDA)

prohibited the sale of dietary supplements containing ephedrine alkaloids

(ephedra), stating that such supplements present an unreasonable risk of illness

or injury. The Dietary Supplement Health and Education Act (DSHEA)

of 1994 (21 USC §301, 1994) governs dietary supplement regulation in the

U.S. DSHEA places the burden of proof for safety on the government rather

than on the manufacturer and thus differs significantly from regulations that

govern the marketing of drugs. Part of the evidence the FDA used in reaching

its decision was a systematic review of the efficacy and safety of ephedra

conducted by the Southern California Evidence-Based Practice Center. In

addition to a meta-analysis of controlled trial data, the review contained an

evaluation of observational case report data, a study design that has limited

inferential abilities regarding cause and effect.

How did the FDA decide what data were relevant to its decision? How did

the FDA argument for the ban differ from a decision based solely on statistical

hypothesis testing? This paper will address these questions by describing

the systematic review approach, the evidence presented, the interpretation

of that evidence by those on both sides of the argument and the process by

which the decision was made.

Key words and phrases: Dietary supplements, meta-analysis, research synthesis