Statistics theory
Statistics is a mathematical approach to describe something, predict events, or analyze the relationship between things. Statistical analysis can, for example, describe the average income of a population, test whether two groups have the same average income, or analyze factors that might explain the income level for a particular group.
Statistics uses mathematics to describe, predict, explain or analyze things. Use of mathematics makes it possible to test relationships between two or more groups or to test how observations compare to a prediction. Some of the statistical concepts include mean (average), standard deviation (how concentrated or spread out things are), and correlation (how related two different variables are). These concepts are further explained in this article.
A current example of the use of statistics is trying to figure out how to get the economy growing again. We build a 'model' of the economy, adding in or taking out many different factors or variables, and seeing which variable, or set of variables, results in growth. The model can be used to "predict" growth. However, building a model is not a simple task. Different people make different assumptions about the model and about the different variables to be used in the model. Some may believe, for example, that job creation will result in increased income and consequently increased spending. Thus, they will argue, the role of the government is to create jobs. On the other hand, others may believe that reducing taxes will result in more wealth going to people, who will then spend more. Thus, they believe, the job of the government is to reduce taxes. The point is that statistics is a tool to help build models, analyze relationships, and often predict outcomes. However, behind most statistical analysis is a set of assumptions and beliefs that people bring to the modeling, and those assumptions and beliefs are, very often, not statistical. The first step in using statistics, then, is to make clear the assumptions and beliefs that are being used, so that others can understand the starting points of the analysis.
Statistics is used in a very wide variety of fields. For example, statistics is used to develop and analyze psychological tests and public opinion surveys, in program evaluation to determine whether a program works or how it can be improved, in medicine with clinical trials to test the safety and effectiveness of new drugs, in engineering to look for outliers and anomalies and to test underlying assumptions[1] and in many other areas.
The usefulness of statistical analysis depends crucially upon the validity of the methods by which data are collected, whether the appropriate statistical techniques are used, whether basic assumptions are met, and how the results are interpreted. Generally, a good deal of professional training is required to appropriately apply statistics. The training should ideally also include training on how to present the results of statistical analysis, so that the general lay public can correctly understand the results.
Some basic concepts
Some basic concepts of statistics are easy for anyone to understand.
The first is measurements of central tendency. Most people know the mean. In general understanding, the mean is the average. Suppose there are five people. The people have, respectively, 1 TV in their house, 4 TVs, 2 TVs, no TVs and 3 TVs. You would say that the 'mean' number of TVs in the house is 2 (1+4+2+0+3)/5=2.
Another measure is the median. The median is the point at which half are above and half are below. In the above example, the median is also 2, because in this group of 5 people, 2 people have more than 2 TVs in their house and 2 people have fewer than 2 TVs in their house.
The median is often used in describing incomes of populations because the mean can be misleading. For example, say 9 people have an income of $10,000 a year, and one person has an income of $2,000,000 a year. The average is $2,090,000/10 = $209,000 a year. Really though, only one person really has a very high income and the rest have much lower incomes. In this case, the mean gives a misleading picture of the average income, and the median is usually a better indicator of where the bulk of incomes are.
A slightly more difficult concept is the measurements of variation. The standard deviation (SD) is a measure of variation or scatter, or how much things are spread out from the mean. A simple example could again use income. Suppose a community had people with mostly the same incomes, varying from $30,000 to $40,000. Say the average is $35,000. In this case there is little variation or little spread, and so the standard deviation would be pretty small. On the other hand, suppose there is another community where there are a number of wealthy people, some middle income people, and a group of people with low incomes. In this community, the incomes vary from $10,000 a year to $2,000,000 a year. Again, suppose the average is $35,000 a year. However, this second community is clearly different from the first. The two communities have the same average income, but have very different distribution of incomes. In the second community, the standard deviation would be quite large, which is very useful information, describing the large differences between the communities.
The final basic concept is about statistical testing. Now, suppose a researcher wanted to know whether the two communities described above were the same or were different. The means are the same, but the standard deviations are very different. The question of interest is whether they different enough to say they are significantly different. Statistical testing compares the actual difference to a theoretical difference that might be expected, if, based on theory, the differences were due to chance alone. If the differences are large enough, then the conclusion can be made that there is a real difference, and that the difference is not because of just some chance variation.
Transforming data
Statisticians may transform data by taking the logarithm, square root, reciprocal, or other function if the data does not fit a normal distribution.[2][3] Data needs to be transformed back to its original form in order to present confidence intervals.[4]
Summary statistics
Measurements of central tendency
- Mean In general understanding, the mean is the average. Suppose there are five people. The people have, respectively, 1 TV in their house, 4 TVs, 2 TVs, no TVs and 3 TVs. You would say that the 'mean' number of TVs in the house is 2 (1+4+2+0+3)/5=2.
- Median The median is the point at which half are above and half are below. In the above example, the median is also 2, because in this group of 5 people, 2 people have more than 2 TVs in their house and 2 people have fewer than 2 TVs in their house.
Measurements of variation
- Standard deviation (SD) is a measure of variation or scatter. The standard deviation does not change with sample size.
- Variance is the square of the standard deviation:
- Standard error of the mean (SEM) measures the how accurately you know the mean of a population and is always smaller than the SD.[5] The SEM becomes smaller as the sample size increases. The sample standard devision (S) and SEM are related by:
- 95% confidence interval is + 1.96 * standard error.
Inferential statistics and hypothesis testing
Problems in reporting of statistics
In medicine, common problems in the reporting and usage of statistics have been inventoried.[6] These problems tend to exaggerated treatment differences.
References
- ↑ NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, 2006.
- ↑ Bland JM, Altman DG (March 1996). "Transforming data". BMJ 312 (7033): 770. PMID 8605469. PMC 2350481. [e]
- ↑ Bland JM, Altman DG (May 1996). "The use of transformation when comparing two means". BMJ 312 (7039): 1153. PMID 8620137. PMC 2350653. [e]
- ↑ Bland JM, Altman DG (April 1996). "Transformations, means, and confidence intervals". BMJ 312 (7038): 1079. PMID 8616417. PMC 2350916. [e]
- ↑ What is the difference between "standard deviation" and "standard error of the mean"? Which should I show in tables and graphs?. Retrieved on 2008-09-18.
- ↑ Pocock SJ, Hughes MD, Lee RJ (August 1987). "Statistical problems in the reporting of clinical trials. A survey of three medical journals". N. Engl. J. Med. 317 (7): 426–32. PMID 3614286. [e]