Statistics: How to Lie with Statistics
An example of misleading with statistics……..

Wired magazine recently declared that the Web is dead, backing it up with this chart that shows the history of the proportion of total Internet bandwidth various types of traffic.
According to Wired’s Chris Anderson – the proportion of Internet traffic taken up by Web requests is getting smaller – so “the Web is dead”.
It seems that Wired, in its quest to come up with outrageous linkbait, decided that the percentage of traffic carried over the Internet is the determiner of the viability of the Web and other forms of Internet traffic.
Did Podcasting Kill The Web?
By Wired’s way of thinking, you could argue that podcasting killed the Web. Audio and video podcasts are several orders of magnitude larger than a typical Web page.
What Wired ignores, though, is that the volume of Web traffic continues to grow by leaps and bounds.

Boing Boing counters Wired’s ludicrous argument with this chart, which more accurately shows that the Web is growing at an amazing rate, but media file traffic is growing faster – because media are a lot bigger:
Unfortunately, Wired’s need to come up with a new meme may keep people from considering a real trend – that devices like the iPad are forking the Web.
Just when it looked like the Web was going to kill off all other media – the iPad is demonstrating the benefits of dedicated Internet-enabled applications.
Instead of the one-size-fits-all approach of the computer Web browser, Apple is creating a platform of dedicated apps that do one thing really well.
.
Common Techniques for producing misleading statistical analyses:
There are a number of techniques to consider in lying with statistics:
4 out of 5 doctors surveyed recommend using our product!
Of course, there are 4 doctors on our board. We asked them first. Then we called one at random from the AMA's member directory. Any correlation between our survey method and the results is pure serendipity.
Use a self-selecting population
4 out of 5 of our female subscribers who responded to our survey cheated on their husbands!
Sure, but you have two self-selections. First: You only surveyed your subscribers, not a random sample of all women. Second, since people had to respond, the ones most likely to respond were the ones who had cheated.
Tie one data set to an earlier one, implying causality
100% of all crack addicts drank water before becoming addicted to crack. Water kills!
The more interesting (and likely truthful) statistic is "How many water drinkers become addicted to crack?" By tying together two unrelated (or even semi-related) groups together, nearly anything can be proved. Try these out for size:
· Being baptized leads to sinning
· Being asked, "Would you like fries with that?" causes road rage.
· Scantily clad people lead to overpopulation
Use a lower confidence to gain a higher probability
Over 80% of America watches our show!
Sample sets are merely predictors of the population at large. If 80% of a sample set has some attribute, then there is a confidence level associated with saying, "at least 80% of the population has this attribute." By increasing the probability, you lower your confidence, and, by decreasing the probability, you increase the confidence.
Use obscure definitions and data sets
50% of Yankees are let go from their jobs at least once a year. The Yankee work ethic makes it hard to keep a job.
First, what's a Yankee? An American, a New Englander, a Vermonter, a woodchuck? What does "let go" mean? Perhaps this statistic started as "50% of rural Vermonters have a second job in a seasonal industry, which supplements their annual income.
Compare a statistic that affects most of the population to one that affects a small portion of the population
You are more likely to be hit by lightning thrice than attacked by a shark.
Well, let's see. Who is vulnerable to lightning strikes? Just about everyone in the world. Who is vulnerable to shark attack? Only those who swim in shark-infested waters. If you have a population, which 100% of its members have a 0.05% chance of event A happening to them, and 5% of its members have a 2% chance of event B happening, then the following three facts are true:
1. Event B happens with greater frequency than event A.
2. People in the 5% group are more likely to have event B happen than event A.
3. People in the other 95% are more likely to have event A happen than event B.
Unless you know which group someone is in, you can't really predict which is more likely to happen
The first way
in which statistics can be distorted is through the type of sample used. Rarely is an entire population surveyed in a study. More
often, a sample is taken and the data from that sample is extrapolated onto the rest of the population. It is thus vitally important
that the sample be judiciously chosen.
Let's imagine we are looking for the average
height of Canadians. We choose to sample three random Canadians and we get their heights. The mean value we find for the height of Canadians is a random variable (do follow the linkpipe on that
one, the term random variable has an important meaning here).
The height of Canadians could be any number within a particular range but it is not equallyprobable that it would be any of them.
Let's imagine that we take 50 samples of three people each and plot the
resulting data set on ahistogram. The mean of this histogram (the average of the fifty averages) is the population mean as nearly as we can determine
it.
Each of the fifty data sets could also be plotted on a histogram. The small size of the sample means that there is a relatively high chance of an unusually tall
or short person turning up in our data set and thus making its mean dramatically different from the mean of the entirepopulation. Therefore, as our sample size gets larger, the distribution of the sample averages will have less spread.
If we knew the true mean height of the Canadian population, we could put it on the histogram from one trial. It will usually
be either too large or too small when compared with our estimated value. How far off it will be
depends on the sample size of the trial. Since a larger
sample represents the whole population more effectively, it makes sense that it
would do a better job of estimating the true value.
95% of the time, the true value of the mean height of the Canadian population will be within two standard
deviations of the estimated value. The standard
deviation of the histogram will become smaller as the
sample becomes larger. This means that the area in which the true value almost
certainly lies on a histogram becomes smaller when a larger sample size is used. This concept may be
more familiar than you think.
Consider polls. When a poll result is stated, it
is usually in the form: "55% of Canadians say Jean Chretien should play
more golf, plus or minus 5% 19 times out of 20." The "19 times out of
20" is the same 95% from the above paragraph. This means that 5%
represents twice the standard
deviation for the set from which the 55% value is
determined. The pollsters are giving you the standard
deviation in disguise!
Another common method by which statistics are fudged is conditioning. This is the process of
selecting specific sub samples within a data set for comparison. An example is
the average male wage compared with the average female wage. The manner in
which this is done affects the results you get.
Studies have shown that kids who go to private
schools earn 10% more, on average, than those who go to public schools. What
does this mean? If we change the conditioning to examine neighbourhood and
background, we see the difference reduced to zero. This essentially means that
the marginal impact of going to private school if
you already live in a good area (high average income, lowunemployment) is quite small. Contrarily,
students coming from a poor area stand to gain 10% in their average income for
going to private school. Such statistical evidence (keep in mind that this is
just an example) can lead to government
policy decisions. The
above conclusion would support a proposal for vouchers allowing poor kids to go to
private school, for example.
One final statistical trick I shall examine is that
of scale. Somebody can call an increase from 2-3% inflation (as calculated
by theConsumer
Price Index, for example)
a "50%" jump. In actuality, the change was rather small. Whenever
percentage changes are used to examine changes in small values, alarmingly
large percentage changes can result. For this reason, if you are presented with
very large percentage
changes you ought to keep in mind that
they may simple represent small variations in small quantities. For the GDP of Luxemburg to grow by 10 or even 50%
represents very little actual growth compared with the GDP of the United States growing even 1%.
Example:
From Microsoft: How to Lie With Statistics
Posted on June 27, 2010 by dsarna
Microsoft’s Frank X Shaw created a buzz with his memo (uncritically parroted all over the Internet) which purports to show that Microsoft is still at the top of its game, outdistancing its competitors in every category. Though “sources” for the numbers are cited, most of the citations are uncheckable numbers from Microsoft’s shills and water-carriers, such as IDC. The biggest problem, however, is that the numbers, even if true on their face, don’t distinguish between a license bought decades ago and the server long having been buried in a landfill somewhere, and a license currently in use. For example, according to Shaw:
24%
Linux Server market share in 2005. [source]
33%
Predicted Linux Server market share for 2007 (made in 2005). [source]
21.2%
Actual Linux Server market share, Q4 2009. [source]
For the truth, see Netcraft’s
The first version of the Apache web server software running under Linux was created by Robert McCool in 1992, and since April 1996 Apache has been the most popular HTTP server software in use. As of June 2010, 54% of the 111 million http servers used Apache[1] running under Linux. If you look at those with the bulk of the traffic, then the numbers are even more impressive:
The numbers are actually worse than inverted. 66.61% of the traffic is Apache/Linux and only 17.04% runs on a Microsoft platform.



Not quite what Microsoft asserted!
Some academic papers about misleading statistics:
The titles are hyperlinked to the full articles
Statistical Science
2005, Vol. 20, No. 3, 210–214
DOI 10.1214/088342305000000232
© Institute of Mathematical Statistics, 2005
Lies, Calculations and Constructions: Beyond How to Lie with Statistics
Joel Best
Abstract. Darrell Huff’s How to Lie with Statistics remains the best-known,
nontechnical call for critical thinking about statistics. However, drawing a
distinction between statistics and lying ignores the process by which statistics
are socially constructed. For instance, bad statistics often are disseminated by
sincere, albeit innumerate advocates (e.g., inflated estimates for the number
of anorexia deaths) or through research findings selectively highlighted to attract
media coverage (e.g., a recent study on the extent of bullying). Further,
the spread of computers has made the production and dissemination of dubious
statistics easier. While critics may agree on the desirability of increasing
statistical literacy, it is unclear who might accept this responsibility.
Key words and phrases: Darrell Huff, social construction, statistical literacy
Statistical Science
2005, Vol. 20, No. 3, 223–230
DOI 10.1214/088342305000000296
© Institute of Mathematical Statistics, 2005
How to Confuse with Statistics or: The Use and Misuse of Conditional
Walter Krämer and Gerd Gigerenzer
Abstract. This article shows by various examples how consumers of statistical
information may be confused when this information is presented in terms
of conditional probabilities. It also shows how this confusion helps others to
lie with statistics, and it suggests both confusion and lies can be exposed by
using alternative modes of conveying statistical information.
Key words and phrases: Conditional probabilities, natural frequencies,
Statistical Science
2005, Vol. 20, No. 3, 231–238
DOI 10.1214/088342305000000269
© Institute of Mathematical Statistics, 2005
Richard D. De Veaux and David J. Hand
Abstract. As Huff’s landmark book made clear, lying with statistics can be
accomplished in many ways. Distorting graphics, manipulating data or using
biased samples are just a few of the tried and true methods. Failing to use the
correct statistical procedure or failing to check the conditions for when the
selected method is appropriate can distort results as well, whether the motives
of the analyst are honorable or not. Even when the statistical procedure and
motives are correct, bad data can produce results that have no validity at all.
This article provides some examples of how bad data can arise, what kinds
of bad data exist, how to detect and measure bad data, and how to improve
the quality of data that have already been collected.
Key words and phrases: Data quality, data profiling, data rectification, data
consistency, accuracy, distortion, missing values, record linkage, data warehousing,
data mining.
Statistical Science
2005, Vol. 20, No. 3, 239–241
DOI 10.1214/088342305000000250
© Institute of Mathematical Statistics, 2005
How to Accuse the Other Guy of Lying
with Statistics
Charles Murray
Abstract. We’ve known how to lie with statistics for 50 years now. What
we really need are theory and praxis for accusing someone else of lying with
statistics. The author’s experience with the response to The Bell Curve has
led him to suspect that such a formulation already exists, probably imparted
during a secret initiation for professors in the social sciences. This article
represents his best attempt to reconstruct what must be in it.
Key words and phrases: Public policy, regression analysis, lying with statistics
Statistical Science
2005, Vol. 20, No. 3, 215–222
DOI 10.1214/088342305000000241
© Institute of Mathematical Statistics, 2005
Mark Monmonier
Abstract. Darrell Huff’s How to Lie with Statistics was the inspiration for
How to Lie with Maps, in which the author showed that geometric distortion
and graphic generalization of data are unavoidable elements of cartographic
representation. New examples of how ill-conceived or deliberately contrived
statistical maps can greatly distort geographic reality demonstrate that lying
with maps is a special case of lying with statistics. Issues addressed include
the effects of map scale on geometry and feature selection, the importance
of using a symbolization metaphor appropriate to the data and the power
of data classification to either reveal meaningful spatial trends or promote
misleading interpretations.
Key words and phrases: Classification, deception, generalization, maps,
statistical graphics.
Statistical Science
2005, Vol. 20, No. 3, 242–248
DOI 10.1214/088342305000000214
© Institute of Mathematical Statistics, 2005
Sally C. Morton
Abstract. In February 2004, the U.S. Food and Drug Administration (FDA)
prohibited the sale of dietary supplements containing ephedrine alkaloids
(ephedra), stating that such supplements present an unreasonable risk of illness
or injury. The Dietary Supplement Health and Education Act (DSHEA)
of 1994 (21 USC §301, 1994) governs dietary supplement regulation in the
U.S. DSHEA places the burden of proof for safety on the government rather
than on the manufacturer and thus differs significantly from regulations that
govern the marketing of drugs. Part of the evidence the FDA used in reaching
its decision was a systematic review of the efficacy and safety of ephedra
conducted by the Southern California Evidence-Based Practice Center. In
addition to a meta-analysis of controlled trial data, the review contained an
evaluation of observational case report data, a study design that has limited
inferential abilities regarding cause and effect.
How did the FDA decide what data were relevant to its decision? How did
the FDA argument for the ban differ from a decision based solely on statistical
hypothesis testing? This paper will address these questions by describing
the systematic review approach, the evidence presented, the interpretation
of that evidence by those on both sides of the argument and the process by
which the decision was made.
Key words and phrases: Dietary supplements, meta-analysis, research synthesis