History and Philosophy of Psychology Essay
TOPIC 6 - Significance testing, as a method of statistical inference, was imported into psychology because it ‘provided the illusion of an objective, mechanized form of inductive inference’. Critically discuss.
Significance testing was imported to psychology to analyze variance in small sample populations. It gives us an objective tool with which to make decisions on whether a results deviates from the expected outcome enough to consider that there is a statistical difference between the experimental mean and the hypothesized mean. Thought the test is objective, the parameters set are subjective, as is the decision taken by the scientist to accept or reject a hypothesis. Therefore, the illusion of objectivity, reached by researchers unaware of significance test limitations is key here. In the above quote, Gigerenzer refers to significance testing as a mechanized form of inference. This lies mainly on its current use by a majority of researchers, not in the test itself. Significance testing is a tool for decision making, not a tool that makes the decision for the researcher. There are ways, however, of limiting this subjectivity, and making the statistical test into a robust tool with modern applications.
Analyzing data in psychology poses some unique problems. In particular, when you randomly select a sample from a population, the sample will inevitable differ in mean from the normal mean of the population, even with no treatment of the sample on the DV. This can be caused by individual differences, and confounding factors. Statistical testing was introduced to psychology to try and aid researchers with a decision to take the experimental mean as sufficiently variant from the population mean to suggest an effect by the treatment. In other words, it is a way of "coping with the question of chance" (Falk & Greenbaum, pg. 88). What Gigerenzer (1987, 1993) argues however, is that psychologist have misinterpreted the purpose of the significant test, and have tried to convert it into an "illusion of an objective mechanical form of inductive inference"(1987, pg. 11). By looking at the history and evolution of significance testing we can see how this illusion came to be created. Gigerenzer does not, however, disclaim all use of the NHST. Rather, he warns against inadequate use of statistical testing, through this mechanical use. Significance testing can be a useful tool for researchers, if used in the appropriate conditions.
The origins of the statistical test commence with Sir R. Fisher in the turn of the century. Fisher came upon this problem while analyzing agricultural data. Analyzing crop effectiveness, it soon became obvious that variables (weather, soil quality) could not be controlled in the same way laboratory experiments can (Cowles, 1989). Fisher set about finding a way of distinguishing a substantial difference from one given purely by chance. Thus is born significance testing. In 1925, Fisher published Statistical Method for Research Workers, to which "the majority of text books on methodology and statistics in the social sciences are the offspring" (pg. 6). Fisher proposes that a null hypothesis be set up that reflects the expected effect size in a population (generally zero, though not necessarily (Cohen 1994, Rozeboom 1960)). Analysis of the data produces a p-value, which is the estimate of the probability of the result being produced from the expected variance in the population mean. Given the p-value, the researcher can then decide, given the evidence, whether it is statistically sound to take the results as being significantly different from the null.
This elegant theory did not rest unchallenged for very long. Two young mathematicians from the same university as Fisher tried to give the procedure a more definite application. Neyman and Pearson, not completely satisfied with Fishers work, added components to the theory. The main difference is the introduction of Type II and Type III errors, both which Fisher opposed. They also introduced the alternative hypothesis, which was mutually exclusive to the null. Neyman and Pearson believed that significance testing could be used to give objective measures of variance, by setting an acceptably low probability of a Type I and Type II error a priori, then you could decide to reject or accept the null hypothesis as true. Fisher and Neyman (Pearson was never as polar in his views, as where Neyman and Fisher (Cowles, 1989)), disagreed bitterly with these points, to a vindictive level which eventually cost Neyman his position at University College, London (pg. 192).
Simply stated, the difference in opinion resulted from the expected outcome of the significance test. According to Fisher, all decisions are provisional, and theories have to be constantly evaluated and re-tested (Cowles, 1989). Furthermore, Fisher also states that "A test of significance contains no criterion for accepting a hypothesis. According to circumstances it may or may not influence its acceptability (Fisher, 1959, cited Mulaik et al. pg. 78) Neyman and Person, unlike Fisher, saw that if you gave all the relevant statistics (including power and errors), significance testing could decide which hypothesis to accept. Neyman's and Pearson both had a frequentist view of statistic, and thus based their Type I and Type II errors to be from repeated sampling from the same population. Fisher argued that this did not work with experiments where the population is infinitely regressing (Gigerenzer, 1993). Fisher however does concede that Neyman and Pearson’s theory does have relevance to applications such as quality control manufacturing. It is beyond the scope of this essay to critically delve into the mathematical pros and cons of either Fisherian or Neyman mathematics. The point that is made is that the two ideas where competing ideas, and in a sense mutually exclusive.
The problem that many writers have pointed out (Cohen, 1994, Gigerenzer 1993, 1987, Rozeboom, 1960), is that modern psychology is using a hybrid version of the two philosophies. This hybrid leaves everyone unsatisfied, and is in large part the cause of the deep rooted problems of statistical testing (Gigerenzer, 1993). Neither mathematician would be happy with the result (Cowles, 1989). Significance testing is seen largely in modern literature as manna from heaven that produces decisions about Ho. To those who are mislead, it gives the false sense that all experiments can be evaluated by the same yard stick. This fallacy is only a way for researchers to compare research mechanically, i.e. "your "p" is bigger than mine".
Several reasons why this might be so. Rozeboom (1960)points out that psychologists are not mathematician, and therefore don’t feel comfortable criticizing or understanding the mathematical principles involved. On the outside, statistical testing looks like it is a routine that you can apply to any situation, and if there is a true effect, then it will spot it. This misconception might happen because the language used to explain significance. Normal speech lacks mathematical clarity, and therefore can bastardize the whole thing (Falk & Greenbaum, 1995).
Gigerenzer (1993) points out that from 1920- 1940, psychologist where performing quantitative experiments on single subjects. But during the 50’s, significance testing rose from 25% to 80%( Danzinger, 1990). Danzinger postulates that this could have been done to legitimize the practicality of psychology, to work for the then largest market for psychology, educational administrators. Also note that Neyman and Pearson’s adaptations to significance testing are perfect for a quality control and cost efficiency.
The way in which significance testing has been used for the past 50 years lead some to believe that it is not an objective measure. I propose that the test is objective. However, since the parameter that it is dependent on are subjective (sample size, alpha level, test construction) making sound analysis of these parameters can lead us to an objective measurement. Geiger does not question this, instead he warns of the mechanical use and acceptance of the test. NHST resulted in less-experimental ingenuity and construct reliability (Gigerenzer, 1987).Grayson (1998) points out that researchers don’t always know the underlying mechanisms of the statistics. This is further aggravated by common notions that significance testing is just a hurdle for publication (Harris, 1997).
Neyman predicted this subjectivity, quoting "there is a subjective element in the theory itself in the choice of significance level you are going to require……it is not very satisfying and rather pragmatic"(Cowles pg. 186). Fisher also acknowledges this problem : "although the obtained associated probability is objective, its meaning is not exhausted by the relative frequency upon which it is based" (Fisher, 1973 cited Oakes1986). Fisher’s postulate that significance testing is objective is matched even by strong advocators of the abolishment of significance testing. Schmidt and Hunter (1997), agree that "objectivity of the Significance test is sound". Significance testing is a measure that was created outside the data, and thus can be tested in an objective way. Furthermore, the outcome will be the same no matter who does the calculation. The decision you make from this result, however, is a subjective part.
Subjectivity in significance testing is obvious to see. As mentioned earlier, there are three main subjective factors; alpha level, type II errors and construction. We can begin with the arbitrary way which we come to choose the alpha value. Why is it that we choose the magical number 0.05 or 0.01 when choosing level of significance? Fisher is commonly credited to being the person who set the alpha at 0.05, but there is evidence that Pearson used this convention as early as 1900 (Cowles, 1989). Pollard (1993) summarizes the absurdity of the alpha level : "It seems rather harsh that………we often find ourselves in a position of not being able to conclude we have observed an effect simply because we have 5, rather than say 7, digits in our hands"(pg. 458).
Cohen (1994) argues that NHST test are unhelpful because of the high Type II errors that most experiments show. Yet, if you look at the objectives and limitations of the statistical testing, you can see that Type II errors can be minimized. Mulaik argues the validity of Cohen’s arguments, stating that the power that is worked out is related arithmetically to the effect size, which can only be arbitrarily assigned by Cohen. Furthermore, if you have a null hypothesis that has a high probability of being right, the likelihood of Type II errors weakens, and only Type I errors are applicable. In this way, significance testing can be a good measure of variance (Pollard, 1993, Rozeboom 1960 and Hunter,1997).
Part of the reason why researchers report such high Type II error rates is that test are being poorly designed. Good data analysis means collecting good, appropriate data. Errors need to be minimized before tests are conducted in order to crunch out worthwhile results (Gigerenzer, 1987). Significance tests cannot be seen as a way of sanctifying a hypothesis, nor can they be seen as a substitute for good experimental design (Abelson, 1997). Significance testing was designed to block out noisy data that is unavoidable, not to detect experimental flaws, as I suspect is happening with so many Type II errors. Sure, they where built to get rid of variance noise, however they cannot cope with bad experiments. Gigerenzer (1993) suggest that early in the 1900s there was a shift from blaming errors on external factors to internal factors, and postulates that significance testing was responsible.
Significance testing can be compared to correlation (in fact, Fisher developed NHST from intra-class correlation (Cowles, 1987)). Few would argue that correlation is subjective, yet our interpretation certainly is. Looking at Cohen’s correlation study of positive correlation between height and IQ (Cohen 1994, cited Oakes, 1986 ). So does significance testing have a place in modern psychology? The answer is a resounding yes. As we have seen, there are ways of making it a more robust measure of significant difference. Inference testing helps advance our research. Of course, what it can’t do is tell you if your hypothesis is right or not. Though it can help to evaluate the data, the decision is always a subjective one. Statistics do not drive advancements in science, theories do. Meehl (098, cited Gigerenzer, 1987) calls significance testing "one of the worst things [that] ever happened in the history of psychology"(pg. 314), while both Estes (1997) and Schmidt & Hunter (1997) both declare that the Type II errors in statistical testing "retard progress". But they only retard this process if there is misconceptions about their role in statistics. After all, statistics don’t kill theories, theories kill theories. By minimizing the mechanical application and the blind faith of significance testing, there is yet hope for it as a statistical tool of value.
References
Abelson, R.P (1997). On the surprising longevity of flogged horses: why there is a case for the significance test. Psychological Science, 8, 12-15.
Cohen, J. (1994). The earth is round (p<0.05). American Psychologist, 49, 997-1003.
Cowles, M. (1989). Statistics in Psychology : An Historic Perspective. New Jersey: Erlbaum.
Danziger, K (1990) Constructing the Subject: Historical Origins of Psychological Research. Cambridge: Cambridge Press.
Estes, W.K (1997) Significance testing in psychological research; some persisting issues. Psychological Science, 8, p18-19
Falk, R. & Greenbaum, C. (1995). Significance tests die hard. Theory & Psychology, 5, 75-98.
Gigerenzer, G (1987) Probabilistic thinking and the fight against subjectivity. In Gigerenzer G and Morgan M S The Probabilistic Revolution, Vol. 2 (pp. 11 –33). Cambridge, Mass: MIT press.
Gigerenzer, G (1993) The superego, the ego and the id in statistical reasoning. In Karen and C Lewis A Handbook for Data Analysis in the Behavioural Sciences : Methodological Issue (pp. 311-339) New Jersey.
Grayson, D. A (1998) The frequentist façade and the flight from evidential inference. British Journal of Psychology.
Harris, R J (1997) Significance tests have their place. Psychological Science, 8 p. 8-11
Hunter, J (1997) Needed: A Ban on the Significance Test. Psychological Science, 8 p 3-6
Mulaik et al. (1997) There is a time and a place for significance testing, In Mulaik et al What if there were no significance tests? (pp64-115) New Jersey.
Oakes, M. (1986) Statistical Inference : A Commentary for the Social and Behavioural Sciences. New York: Wiley.
Pollard (1987) How significant is significant? In Gigerenzer G and Morgan M S The Probabilistic Revolution, Vol. 2 (pp. 11 –33). Cambridge, Mass: MIT press.
Rozeboom, W.W (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57, 416-428