Join the 200th Anniversary Celebration

Correspondence

More on Torturing Data

N Engl J Med 1994; 330:861-862March 24, 1994

Article

To the Editor:

We enjoyed Mills's recent Sounding Board article (Oct. 14 issue)1 deploring the torture of data and would like to add a few points. Multiple-hypothesis testing (data torturing) by a desperate or inexperienced investigator is analogous to the indiscriminate ordering of diagnostic tests by a floundering clinician -- they both waste time, money, and energy as red herrings are pursued. But occasionally an unexpected yet valid result emerges from the tortured data, like a lungfish from a puddle of Devonian mud. The finding is no less important if the investigator muddled through numerous puddles before stumbling upon it than if he or she knew what to fish for and caught it on the first try.

The problem arises in separating evolutionary advances (true positives) from Darwinian dead ends (false positives). The determination is not based merely on the presence or absence of a P value less than 0.05 or, as Mills suggests, a confidence interval. Contrary to what the author states, because a confidence interval is data-based, it cannot provide more information than a P value in the determination of the likelihood that a result represents a true positive. Instead, a 95 percent confidence interval represents the range of values that are consistent with the study result; any value outside the confidence interval would be rejected if P was less than 0.052.

To separate the advances from the dead ends, one must consider two other factors: the prior probability of the hypothesis and the degree of desired certainty. In determining the prior probability, the reader should take into account not only the biologic plausibility but also previous evidence supporting similar hypotheses and the existence of alternative explanations. As for how certain one wishes to be, that depends in large part on the potential costs and benefits of accepting a study's results.

Daniel B. Stryer, M.D.
Warren Browner, M.D., M.P.H.
Veterans Affairs Medical Center, San Francisco, CA 94121

Thomas Newman, M.D., M.P.H.
University of California, San Francisco, Ca 94143

2 References
  1. 1

    Mills JL. Data torturing. N Engl J Med 1993;329:1196-1199
    Full Text | Web of Science | Medline

  2. 2

    Browner WS, Newman TB. Confidence intervals. Ann Intern Med 1986;105:973-974
    Web of Science | Medline

To the Editor:

Mills's demand that hypotheses be generated before data collection and that studies be limited to questions that make sense is contrary to the spirit of scientific inquiry and the history of discovery. Newton first looked at the data and then formulated his hypothesis about gravity. It was and is a senseless hypothesis:

That gravity should be innate, inherent, essential to matter, so that one body may act upon another at a distance through a vacuum, without the mediation of anything else, by and through which the action and force may be conveyed from one to another, is to me so great an absurdity that I believe no man who has in philosophical matters a competent faculty of thinking can ever fall into it.1

In a similar manner, quantum mechanics represents a hypothesis in response to data. It makes no sense but it works, and that is sufficient. If editors had acceded to Mills's demands for “a clear biologic mechanism that could account for an effect in one subgroup but not in others” and that hypotheses be generated before data analysis, Mendel's laws would have remained unpublished and the double helix undiscovered.

Douglas Dix, Ph.D.
University of Hartford, West Hartford, CT 06117

1 References
  1. 1

    Kline M. Mathematics: the loss of certainty. New York: Oxford University Press, 1980:55-6.

To the Editor:

I would like to suggest a few more things to be concerned about in data analysis and reporting. First, I would argue that “proof” is not an objective phenomenon but rather lies in the mind of the thinker or reader. I find inappropriate Mills's remark that “study data . . . can be made to prove whatever the investigator wants to prove.” Support the investigator's hypothesis? Perhaps yes. But serve as proof to skeptical readers? From one study? Not often. Look at how long and how many studies it took to convince most people (but even now, not all people) that cigarette smoking is a cause of lung cancer, or even fewer people that super-radical mastectomy is not the treatment of choice for breast cancer. Proof usually requires a confirmation of a well-established, well-accepted theory or, in the absence of such theory, several confirmatory studies. (“Several” may turn out to be a large number, as in the history of studies of cigarette smoking and lung cancer).

Second, I raise a problem of inference. As Mills points out in one of his examples, given 158 possibly independent comparisons at the 0.05 level, one would expect to find 7 or 8 comparisons “statistically significant.” The research cited found 9, leading Mills to comment that “eight of those nine results could easily have occurred by chance.” Does that mean 1 of the 9 is real? Which one? Only 1? How does one find out? How could I identify that one (or more) in the absence of theory or replication, or both?

Third, the probability of 0.05 applies only to type I errors -- false positives. There are also false negatives -- the failure to find a real effect when one truly exists. Most often, this involves the power of a study, and the smaller the study the lower the power. Thus, in the study cited by Mills in which 158 comparisons were made, I would expect that some of these comparisons were among groups with very few subjects -- making the probability of finding an effect significant, especially a small effect, most unlikely. Fractionating data is often a way of producing “negative” results. I regard this fractionating a form of data torturing, too. In a spasm of cynicism a colleague once remarked bitterly, “If you want to show no significant effect it is easy. Just do a small, sloppy experiment.”

As for P values and confidence limits, once you report confidence limits I see little need to also report P values. But if your computer program computes these for you, and if the editor does not object, no harm will be done by publishing both.

Marvin A. Schneiderman, Ph.D.
National Research Council, Washington, DC 20418

To the Editor:

Mills wrongly calculates the probability that the differences in the birth-defects study would all be real, as 0.95158. Instead, the probability sought would be obtained by multiplying n factors, n being the number of differences found (which, according to the article, is different from 158). Each factor would be equal to 1 minus the P value obtained for the corresponding difference. The figure given by the author is the a priori probability that 158 differences, significant at the 0.05 level, would all be real.

Frank De Geeter, M.D.
Saint-john's General Hospital, 8000 Brugge, Belgium

Author/Editor Response

Dr. Mills replies:

To the Editor: Dr. Schneiderman points out that one study is not accepted as proof. As I indicated, those who censor data (Procrustean data torturing) often choose a popular hypothesis, hoping that theirs will be seen as the study that “proves” the point. Such studies are more dangerous than those that produce “exciting new findings” by opportunistic data torturing, because scientists will demand confirmation of the latter.

False negatives were not addressed in my paper. Anyone who has tried to publish a negative paper knows why data torturing to produce negative results is not a major problem. However, readers should be on the lookout for negative data torturing when studies report no adverse or toxic effects. Inadequate sample size and misclassification of exposure or outcome should be considered as possible explanations for negative findings in such studies.

I indicated that a confidence interval is useful in determining “the precision of an estimate and the likely values of a measure (such as the relative risk of disease).” I did not suggest that confidence intervals are superior to P values for determining whether an observation is due to chance.

Drs. Schneiderman and De Geeter discuss the problem of identifying true findings when multiple tests are performed. When one makes 158 independent comparisons, using a P value of 0.05, the probability of not making a type I error (calling a negative finding positive) is only 3 in 10,000. Apparently, this point was not stated clearly. As Schneiderman points out, when multiple comparisons are made, it can be very difficult to determine which of the positive findings are wheat and which are chaff. I would argue, as do Stryer et al., that the biologic plausibility of a finding is helpful in identifying true results. Replication of the finding is critical.

To some extent Dr. Dix and I have more of a semantic than a real disagreement. Hypotheses do not spring forth fully formed, like Athena from the forehead of Zeus. Clearly, they result from the investigator's earlier observations, clinical experience, or data. Our real disagreement is over his suggestion that studies not be limited to questions that make sense. Newton himself wasted a great deal of time on alchemy. Alchemy seemed illogical, and by golly, it was illogical. In my experience, for every fishing expedition (a study without a good scientific hypothesis) that comes back with a coelacanth (or a lungfish), there are dozens that catch only boots.

James L. Mills, M.D., M.S.
National Institutes of Health, Bethesda, Md 20892

Trends: Most Viewed (Last Week)

More Trends