Statistics is as much moral philosophy and epistemology as it is mathematical analysis of data. One must make judgements about what you know, what you think you know, what the data represent, and how you think the world works. Also, one must have a strong moral sense of right and wrong, fair and unfair. The act of data analysis involves a huge set of judgments and decisions that must be made along the way, and almost all of these go unstated when the answer is presented to others. Lying with statistics is real, and most frequently one is really lying to oneself. Most importantly, I think every statistical analysis should be capable of telling the analyst that she or he is completely wrong – in other words, none of the models being considered fit the data.
I’m a pragmatist by philosophical bent, which means that prediction is the basis of my knowing the world. Can I predict what will happen if I do something or observe something? As a consequence, I am also an experimentalist by nature. Can I design an experiment to test some idea or hypothesis? As such, I am always very leery of data mining or model selection exercises when they are done for more than hypothesis generation. To me, the ultimate test of an idea is to put your ideas on the line, make a prediction, and then see if you are right by doing the hard work of collecting new data from some experimental manipulation to test your idea. Experiments are by far the best way to evaluate whether some hypothesis is wrong or possibly right.
However, not all scientific hypotheses can be evaluated experimentally. In many areas and for many ideas, experimental tests are not feasible or even possible. For example, some areas of evolutionary biology are not open to experimentation – we can’t set up replicate worlds that have different treatments to see if we can predict the causes of major evolutionary trends. We also cannot set up replicate planets to experimentally test the causes of climate change.
As a consequence, many different types of statistics have been developed to analyze data collected about the world to generate and test hypotheses and to explore relationships in those data. At present, three main “types” of statistics are marshaled for analyses. These are commonly known as “frequentist”, “Bayesian”, and “model selection” approaches. Bayesian and model selection approaches are very popular today, but I have a great deal of trepidation about them, mainly because they have no way to reject some hypothesis. In other words, they can’t tell the analyst that she or he is wrong.
Bayesian statistics takes a model or set of models, specifies a set of prior probabilities for each model and then calculated the posterior probabilities for the models. A Bayesian analysis essentially asks, “given what I already know, is there anything in these data that will change what I think?” Thus, a Bayesian analysis is influenced by your prior beliefs or knowledge about validity of an hypothesis. There is some considerable justification for including prior information in a statistical test. However, if your prior belief is wrong, how much data is needed to expunge that prior belief? I prefer each test to be independent of what has been done before. If one has a series of independent tests that all point to the same conclusion, is that not a more rigorous analysis of an hypothesis, than a series of coupled and therefore dependent tests where the results of one test define the starting point of the next? Moreover, the posterior probabilities are used only to rank the models included in the analysis.
Model selection approaches take a different approach. With model selection the analyst defines the collection of possible models to be evaluated, and then asks which is most consistent with the data being analyzed. This consistency is typically evaluated using some statistic such as the Akaike information criterion (AIC). The AIC does not provide a “goodness of fit” for each model, but rather simply a comparative statistic among models. Thus, what the analyst is essentially calculating are the parameters of the model with the best AIC score. In essence, the analyst is asking “assuming that I’m right, what are the parameters of the model that are most consistent with the data I have?”
My main problem with current model selection and Bayesian techniques are that they have no way of actually evaluating “goodness-of-fit”. There will always be a “best” model or set of “best” models from among the models included in the selection analysis. However, the “best” model may not be a good model. Thus, the possibility of rejecting all models being considered is not in the cards.
The techniques that are currently not in vogue are the oldest statistical methods – typically called “frequentist” methods, because they involve frequencies. Frequentist methods are those that eventually result in the dreaded P-value. The P-value is a statistical statement about the probability of being wrong about rejecting the null hypothesis (i.e., type I statistical error). The frequentist analyst is asking, “given my hypothesis, can I find support for it here?” And the frequentist approach is a model fitting approach if done correctly. However, a frequentist analysis can also reject, and it can lead the analysis to the conclusion that none of the possible models being considered has any support!! In other words, one can decide that what one thinks is actually wrong!! This critical conclusion is not a possibility in a Bayesian or model selection analysis.
Which brings me back to my original premise in this post. Statistics is about moral philosophy and the way one sees the world. Admitting that what you think is wrong has to be part of any good scientist’s thinking. Thus, the statistical analyses we use to evaluate hypotheses need to have “I’m wrong” as a clear and essential possible conclusion. Many problems exist with the use and interpretation of P-values in frequentist approaches to be sure. However, the essential feature of frequentist approaches that are completely lacking from other types of analyses is the possibility of “I’m wrong” as a conclusion.