Join the 200th Anniversary Celebration

Correspondence

Appropriateness Studies

N Engl J Med 1994; 330:432-434February 10, 1994

Article

To the Editor:

Phelps is correct to warn of the methodologic limitations of consensus-based definitions of appropriateness (Oct. 21 issue)1. His suggestion that a panel's decisions can be affected by the composition of its membership is true. We found striking differences between the views of two panels (one composed of surgeons, the other of general practitioners, radiologists, internists, and surgeons) on the appropriateness of cholecystectomy2. An additional concern is how agreement and disagreement are defined, which can have a profound effect on the results3.

I do not, however, share Phelps's puzzlement over the observation that estimated rates of inappropriateness explain little of the wide geographic variation in rates of use of discretionary interventions or over the finding that rates of inappropriateness do not vary among insurance plans. These findings simply confirm the central role of clinical judgment in clinical behavior and suggest that clinicians experience similar difficulties coping with uncertainty regardless of where they work or how they are paid. In a prospective study of prostatectomy for benign disease in the United Kingdom, we found that surgery performed in patients with private insurance was considered as appropriate (defined in terms of the severity of symptoms, clinical findings, and health status) as surgery in patients in the public sector, where such procedures are commonly rationed4.

Nick Black, M.D.
London School of Hygiene and Tropical Medicine, London WC1E 7HT, United Kingdom

4 References
  1. 1

    Phelps CE. The methodologic foundations of studies of the appropriateness of medical care. N Engl J Med 1993;329:1241-1245
    Full Text | Web of Science | Medline

  2. 2

    Scott EA, Black N. Appropriateness of cholecystectomy in the United Kingdom -- a consensus panel approach. Gut 1991;32:1066-1070
    CrossRef | Web of Science | Medline

  3. 3

    Scott EA, Black N. When does consensus exist in expert panels? J Public Health Med 1991;13:35-39
    Medline

  4. 4

    Black NA, Petticrew MP, McPherson CK. Comparisons of public and private patients undergoing elective TURP for benign prostatic hypertrophy. Qual Health Care 1993;2:11-16
    CrossRef | Medline

To the Editor:

Phelps correctly points out that the appropriateness method would benefit from further development and testing. (What would not?) He also suggests a framework for thinking about possible biases in the method, which may appeal to some people. But he incorrectly concludes that estimated rates of inappropriate treatment are likely to be biased substantially upward by false positive rates. He cites recent appropriateness studies that undermine his conclusion1,2 but ignores their implications.

There are four categories of cases, as shown in Table 1Table 1Categories of Estimated and Actual Cases of Inappropriate Treatment.. The actual number of inappropriate cases is the sum of the true positive cases and the false negative cases. The estimated number of inappropriate cases is the sum of the true positive cases and the false positive cases. The estimate is biased unless the number of false negative cases exactly equals the number of false positive cases.

As Phelps and one of the reports he cites3 point out, one needs repeated classifications to estimate false negative and false positive cases. However, one can establish a boundary for the estimates on the basis of a single study3. The recent studies of the appropriateness of cardiac procedures in New York State1,2 provide a tight upper bound for false positive cases.

The estimated rate of inappropriate coronary-artery bypass surgery in New York in 1990 was 2.4 percent,1 which equals the sum of true positive and false positive cases. Thus, false positive cases cannot exceed 2.4 percent. This is an upper bound on the bias in the estimated inappropriateness rate for three reasons: there are apt to be at least some true positive cases in the 2.4 percent, the upward bias is reduced if there are any false negative cases, and the bias is smaller for higher true rates of inappropriateness.

The methods are nearly identical for all the cardiology procedures and are generally the same for other procedures, so the misclassification rates should be similar. An adjustment of the previously reported inappropriateness rates to account for the 2.4 percent upper-bound bias does not change the import of any of them.

We support Phelps's call for additional validation of appropriateness methods, though we have already done more than Phelps acknowledges4 and further work is under way. For comparison, we would like to know more about misclassification rates for individual physicians exercising their clinical judgment unfettered by guidelines.

Rolla Edward Park, Ph.D.
Robert H. Brook, M.D., Sc.D.
Rand Corporation, Santa Monica, CA 90407

4 References
  1. 1

    Leape LL, Hilborne LH, Park RE, et al. The appropriateness of use of coronary artery bypass graft surgery in New York State. JAMA 1993;269:753-760
    CrossRef | Web of Science | Medline

  2. 2

    Hilborne LH, Leape LL, Bernstein SJ, et al. The appropriateness of use of percutaneous transluminal coronary angioplasty in New York State. JAMA 1993;269:761-765
    CrossRef | Web of Science | Medline

  3. 3

    Walter SD, Irwig LM. Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review. J Clin Epidemiol 1988;41:923-937
    CrossRef | Web of Science | Medline

  4. 4

    Brook RH. The RAND/UCLA appropriateness method. In: McCormick KM, Moore S, Siegel RA, eds. Clinical practice guideline development: methodologic perspectives. Rockville, Md.: Public Health Service (in press).

To the Editor:

Our experience with the use of appropriateness criteria suggests that the method has sufficient validity if it is implemented in a fashion that overcomes specific limitations. Furthermore, implemented carefully, appropriateness criteria can improve decisions about patient care.

Our company has developed a prospective data-collection tool to assess the appropriateness of proposed major procedures. We have developed criteria using the Rand method. Health maintenance organizations, insurers, and utilization-review companies use this system to assess the care received by 12 million people. Over 450,000 cases have been analyzed. We have endeavored to overcome some of the limitations outlined by Phelps.

First, we explicitly incorporate patients' preferences. For example, review nurses collect information on a prospective basis directly from the patients (including the desire to preserve fertility in cases of a proposed hysterectomy). The review nurses also collect data directly from the physicians or members of their office staff.

Second, individual cases of potentially inappropriate treatment are subjected to a closer review by specially trained physician advisers. These advisers (who are often peer-matched specialists) contact the attending physicians, verify pertinent data, and explore extenuating circumstances. This important step overcomes the limitations inherent in retrospective reviews based solely on information contained in the medical record.

Third, clients use a conservative approach and focus only on clearly inappropriate cases (those rated 1 to 3 without disagreement according to the system described by Phelps). Furthermore, potentially inappropriate treatment in cases undergoing physician review is approved when extenuating circumstances exist, even when those cases fail to match one of the explicit indications developed by the consensus panel.

Using this multistep review process, we have found a 19 percent rate of inappropriateness for hysterectomy (19,900 cases). These data closely mirror the Rand data for health maintenance organizations1. For the nine most commonly reviewed procedures, we have found an 11 percent rate of inappropriateness (150,467 cases)2.

In addition to finding substantial rates of inappropriateness, groups using this system have also observed marked changes in decision making. By examining population-based rates of procedures before and after instituting this approach and comparing them with rates for control (unreviewed) procedures, health maintenance organizations and other managed-care plans documented net reductions in use ranging from 15 to 37 percent2.

Robert W. Dubois, M.D., Ph.D.
Value Health Sciences, Santa Monica, CA 90404

2 References
  1. 1

    Bernstein SJ, McGlynn EA, Siu AL, et al. The appropriateness of hysterectomy: a comparison of care in seven health plans. JAMA 1993;269:2398-2402
    CrossRef | Web of Science | Medline

  2. 2

    Kosecoff JK, Patricelli RE, Dubois RW, et al. How guidelines change physician behavior. Presented at the 10th Annual Meeting of the Association for Health Services Research, Washington, D.C., June 28, 1993. abstract.

To the Editor:

Phelps assumes that the method for determining appropriateness has a purely descriptive function -- namely, to identify instances of inappropriate care. Instead, if we view the method as having the partially normative (prescriptive) function of defining how treatment should be rendered, then Phelps's approach to evaluation loses its relevance.

Evaluating the use of the appropriateness method as a purely descriptive enterprise assumes, in Phelps's terms, a “gold standard” against which the method could, in theory, be validated. We know, however, that in many instances there is no such gold standard. You suggest in your editorial that individual clinical discretion is superior to consensus guidelines,1 but individual clinical decisions are the very subjects of measurement. Therefore, they cannot be the source for validation. In the absence of data from rigorous outcomes research, we lack any Archimedean perspective -- a third point in intellectual space -- from which to determine whether group consensus or individual variation produces the more “correct” results.

The consensus-panel method remains important as a normative exercise. It purports not merely to identify instances of bad medical practice but also to define good and bad practice. This normative dimension cannot be subjected to an analysis of sensitivity and specificity, because it is its own gold standard. Appropriate care is what a panel of experts says is appropriate when the panel is assembled and instructed in the manner the method specifies.

This happens to be how juries determine the standard of care in malpractice cases. Often, there is no single, established custom; juries must choose which among competing customs they think is right. Although the legal standard of care purports to identify the prevailing custom, the legal process actually derives a standard of care by enforcing the result that a jury reaches after hearing certain evidence and deliberating under certain instructions. According to this view, it is not possible to prove the jury wrong. Its pronouncement is law.

Mark A. Hall, J.D.
Bowman Gray School of Medicine, Winston-Salem, NC 27157-1063

1 References
  1. 1

    Kassirer JP. The quality of care and the quality of measuring it. N Engl J Med 1993;329:1263-1265
    Full Text | Web of Science | Medline

Author/Editor Response

Dr. Phelps replies:

To the Editor: Park and Brook conclude that the 2.4 percent estimated inappropriateness rate for coronary-artery bypass grafts in New York provides a “tight upper bound” on the false positive rate that applies generally, asserting “The methods are . . . generally the same for other procedures, so the misclassification rates should be similar.” Their conclusion contains three logical errors. First, estimated inappropriateness rates depend on the quality of the clinical data, which surely varies according to the particular clinical and geographic setting. Second, the criteria used by any single panel may or may not represent well the universe of possible panels' results, the distribution of which is relevant for an understanding of the value of the appropriateness method. Black cites two studies showing how the composition of a panel and basic definitions of agreement affect appropriateness criteria. Repeated applications of a single panel's appropriateness criteria, as described by Dubois, do not illuminate the issue. Instead, researchers must employ multiple independent panels (a minimum of three1) to measure all relevant factors. Eventually, it is desirable to measure actual patient outcomes by appropriateness criteria. Third, the ultimate accuracy of any appropriateness criterion depends on the quality of the scientific evidence available to panels as they construct ratings of appropriateness. The scientific evidence available for judging the appropriateness of coronary-artery bypass grafts (the source of the 2.4 percent upper bound) exceeds that available for many other procedures to which appropriateness criteria have been or could be applied. For these reasons, no logic supports the extrapolation of the 2.4 percent maximal error rate to other appropriateness studies. The question must be studied separately in each clinical setting.

Hall raises a different point, asserting that, like juries in legal settings, appropriateness panels define the truth and hence are their own gold standard. He overlooks a critical difference between legal cases and the use of medical interventions: strong scientific evidence (for example, data from outcomes studies or randomized, controlled trials) can provide a gold standard unlike that provided by juries or, indeed, unlike appropriateness criteria based on only weak scientific evidence.

Dubois, Park, and Brook believe, and I concur, that using appropriateness-based guidelines may well improve clinical care. This does not mean that current medical opinion, no matter how carefully codified, should ultimately be substituted for better science to support guidelines. Remember that gastric freezing, routine tonsillectomies, and thymus irradiation, among other now-discredited procedures, were once standard practice. We should continue to learn the strengths and weaknesses of appropriateness methods and, more important, continue to develop a stronger scientific basis for medical decision making, including outcomes studies, randomized, controlled trials, meta-analyses, clinical epidemiologic studies, and formal decision models to structure all the available evidence.

Charles E. Phelps, Ph.D.
University of Rochester Medical Center, Rochester, NY 14642

1 References
  1. 1

    Walter SD, Irwig LM. Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review. J Clin Epidemiol 1988;41:923-937
    CrossRef | Web of Science | Medline

Author/Editor Response

Dr. Kassirer replies:

I have no doubt that procedures are used inappropriately by many physicians, and in previous commentaries I explored some of the reasons for such misuse1-3. Like Phelps, however, I believe that the methods currently used to assess appropriateness still need independent validation. Such validation must come from some source other than consensus -- for example, from observations of the outcomes of decisions deemed to be appropriate, inappropriate, or something in between.

It is refreshing to learn that appropriateness criteria are being applied in a flexible fashion, at least by one commercial vendor. Yet companies that market these services do not always disclose the methods they use to judge the appropriateness of procedures. I worry that some physicians may be subjected to erroneous criticism on the basis of a flawed application of existing methods. To prevent such misinterpretations, the methods for these studies should be explicit and in the public domain. I believe these issues need to be addressed before current methods to detect inappropriate use are expanded4.

Jerome P. Kassirer, M.D.

4 References
  1. 1

    Kassirer JP, Pauker SG. Should diagnostic testing be regulated? N Engl J Med 1978;299:947-949
    Full Text | Web of Science | Medline

  2. 2

    Kassirer JP. Our stubborn quest for diagnostic certainty: a cause of excessive testing. N Engl J Med 1989;320:1489-1491
    Full Text | Web of Science | Medline

  3. 3

    Kassirer JP, Kopelman RI. Cognitive errors in diagnosis: instantiation, classification, and consequences. Am J Med 1989;86:433-441
    CrossRef | Web of Science | Medline

  4. 4

    Brook RH. Maintaining hospital quality: the need for international cooperation. JAMA 1993;270:985-987
    CrossRef | Web of Science | Medline

Citing Articles (1)

Citing Articles

  1. 1

    Declan O'Neill, Andrew Miles, Andreas Polychronls. (1996) Central dimensions of clinical practice evaluation: efficiency, appropriateness and effectiveness - I. Journal of Evaluation in Clinical Practice 2:1, 13-27
    CrossRef