ucsf banner

Draft Guidance with Comments

Comments in boldface

Comments on Docket No. 97D-0174

ICH: Draft Guideline on Statistical Principles for Clinical Trials

Center for Drug Development Science
Georgetown University Medical Center
Washington, DC

Department of Health and Human Services
Food and Drug Administration

International Conference on Harmonization; Draft Guideline on Statistical Principles for Clinical Trials; Notice of Availability


DEPARTMENT OF HEALTH AND HUMAN SERVICES

Food and Drug Administration

[Docket No. 97D-0174]

International Conference on Harmonisation; Draft Guideline on Statistical Principles for Clinical Trials; Availability

AGENCY: Food and Drug Administration, HHS.

ACTION: Notice.

SUMMARY: The Food and Drug Administration (FDA) is publishing a draft guideline entitled "Statistical Principles for Clinical Trials.'' The draft guideline was prepared under the auspices of the International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). The draft guideline is intended to provide recommendations to sponsors and scientific experts regarding statistical principles and methodology which, when applied to clinical trials for marketing applications, will facilitate the general acceptance of analyses and conclusions drawn from the trials.

DATES: Written comments by June 23, 1997.

ADDRESSES: Submit written comments on the draft guideline to the Dockets Management Branch (HFA-305), Food and Drug Administration, 12420 Parklawn Dr., rm. 1-23, Rockville, MD 20857. Copies of the draft guideline are available from the Drug Information Branch (HFD-210), Center for Drug Evaluation and Research, Food and Drug Administration, 5600 Fishers Lane, Rockville, MD 20857, 301-827-4573. Single copies of the draft guideline may be obtained by mail from the Office of Communication, Training and Manufacturers Assistance (HFM-40), Center for Biologics Evaluation and Research (CBER), 1401 Rockville Pike, Rockville, MD 20852-1448 or by calling the CBER Voice Information System at 1-800-835-4709 or 301-827-1800. Copies may be obtained from CBER's FAX Information System at 1-888-CBER-FAX or 301-827-3844.

FOR FURTHER INFORMATION CONTACT: Regarding the guideline: Robert T. O'Neill, Center for Drug Evaluation and Research (HFD-700), Food and Drug Administration, 5600 Fishers Lane, Rockville, MD 20857, 301-827-3195. Regarding the ICH: Janet J. Showalter, Office of Health Affairs (HFY-20), Food and Drug Administration, 5600 Fishers Lane, Rockville, MD 20857, 301-827-0864.

SUPPLEMENTARY INFORMATION: In recent years, many important initiatives have been undertaken by regulatory authorities and industry associations to promote international harmonization of regulatory requirements. FDA has participated in many meetings designed to enhance harmonization and is committed to seeking scientifically based harmonized technical procedures for pharmaceutical development. One of the goals of harmonization is to identify and then reduce differences in technical requirements for drug development among regulatory agencies.

ICH was organized to provide an opportunity for tripartite harmonization initiatives to be developed with input from both regulatory and industry representatives. FDA also seeks input from consumer representatives and others. ICH is concerned with harmonization of technical requirements for the registration of pharmaceutical products among three regions: The European Union, Japan, and the United States. The six ICH sponsors are the European Commission, the European Federation of Pharmaceutical Industries Associations, the Japanese Ministry of Health and Welfare, the Japanese Pharmaceutical Manufacturers Association, the Centers for Drug Evaluation and Research and Biologics Evaluation and Research, FDA, and the Pharmaceutical Research and Manufacturers of America. The ICH Secretariat, which coordinates the preparation of documentation, is provided by the International Federation of Pharmaceutical Manufacturers Associations (IFPMA).

The ICH Steering Committee includes representatives from each of the ICH sponsors and the IFPMA, as well as observers from the World Health Organization, the Canadian Health Protection Branch, and the European Free Trade Area.

On January 17, 1997, the ICH Steering Committee agreed that a draft guideline entitled "Statistical Principles for Clinical Trials'' should be made available for public comment. The draft guideline is the product of the Efficacy Expert Working Group of the ICH. Comments about this draft will be considered by FDA and the other regulatory agency members of the Efficacy Expert Working Group.

The draft guideline addresses principles of statistical methodology applied to clinical trials for marketing applications. The draft guideline provides recommendations to sponsors in the design, conduct, analysis, and evaluation of clinical trials of an investigational product in the context of its overall clinical development. The draft guideline also provides guidance to scientific experts in preparing application summaries or assessing evidence of efficacy and safety, principally from late Phase II and Phase III clinical trials. Application of the principles of statistical methodology is intended to facilitate the general acceptance of analyses and conclusions drawn from clinical trials.

This draft guideline represents the agency's current thinking on statistical principles for clinical trials of drugs and biologics. It does not create or confer any rights for or on any person and does not operate to bind FDA or the public. An alternative approach may be used if such approach satisfies the requirements of the applicable statute, regulations, or both.

Interested persons may, on or before June 23, 1997, submit to the Dockets Management Branch (address above) written comments on the draft guideline. Two copies of any comments are to be submitted, except that individuals may submit one copy. Comments are to be identified with the docket number found in brackets in the heading of this document. The draft guideline and received comments may be seen in the office above between 9 a.m. and 4 p.m., Monday through Friday.

An electronic version of this draft guideline is available on the Internet using the World Wide Web (WWW) (http://www.fda.gov/cder/guidance.htm) or through the CBER home page (http://www.fda.gov/cber/cberftp.html).

The text of the draft guideline follows:

Statistical Principles for Clinical Trials

Note: A Glossary of terms and definitions is provided as an annex to this guideline.

Table of Contents

I. Introduction
1.1 Background and Purpose
1.2 Scope and Direction

II. Considerations for Overall Clinical Development
2.1 Study Context
2.1.1 Development Plan
2.1.2 Confirmatory Trial
2.1.3 Exploratory Trial
2.2 Study Scope
2.2.1 Population
2.2.2 Primary and Secondary Variables
2.3 Design Techniques to Avoid Bias
2.3.1 Blinding
2.3.2 Randomization

III. Study Design Considerations
3.1 Study Configuration
3.1.1 Parallel Group Design
3.1.2 Cross-Over Design
3.1.3 Factorial Designs
3.2 Multicenter Trials
3.3 Type of Comparison
3.3.1 Trials to Show Superiority
3.3.2 Trials to Show Equivalence or Non-inferiority
3.3.3 Dose-Response Designs
3.4 Group Sequential Designs
3.5 Sample Size
3.6 Data Capture and Processing

IV. Study Conduct
4.1 Trial Monitoring
4.2 Changes in Inclusion and Exclusion Criteria
4.3 Accrual Rates
4.4 Sample Size Adjustment
4.5 Interim Analysis and Early Stopping
4.6 Role of Independent Data Monitoring Committee (IDMC)

V. Data Analysis
5.1 Prespecified Analysis Plan
5.2 Analysis Sets
5.2.1 All Randomized Subjects
5.2.2 Per Protocol Subjects
5.2.3 Roles of the All Randomized Subjects Analysis and the Per Protocol Analysis
5.3 Missing Values and Outliers
5.4 Data Transformation/Modification
5.5 Estimation, Confidence Intervals and Hypothesis Testing
5.6 Adjustment of Type I Error and Confidence Levels
5.7 Subgroups, Interactions and Covariates
5.8 Integrity of Data and Computer Software

VI. Evaluation of Safety and Tolerability
6.1 Scope of Evaluation
6.2 Choice of Variables and Data Collection
6.3 Set of Subjects to be Evaluated and Presentation of Data
6.4 Statistical Evaluation
6.5 Single Study versus Integrated Summary

VII. Reporting
7.1 Evaluation and Reporting
7.2 Summarizing the Clinical Database
7.2.1 Efficacy Data
7.2.2 Safety Data

Annex 1 Glossary

I. Introduction

1.1 Background and Purpose

The efficacy and safety of medicinal products should be demonstrated by clinical trials that follow the guidance in "Good Clinical Practice: Consolidated Guideline (E6)'' adopted by the ICH, May 1, 1996. The role of statistics in clinical trial design and analysis is acknowledged as essential in that ICH guideline. The proliferation of statistical research in the area of clinical trials coupled with the critical role of clinical research in the drug approval process and health care in general necessitate a succinct document on statistical issues related to clinical trials. This guideline is written primarily to attempt to harmonize the principles of statistical methodology applied to clinical trials for marketing applications submitted in Europe, Japan, and the United States.

As a starting point, this guideline utilized the CPMP (Committee for Proprietary Medicinal Products) Note for Guidance entitled "Biostatistical Methodology in Clinical Trials in Applications for Marketing Authorizations for Medicinal Products'' (December 1994). It was also influenced by "Guidelines on the Statistical Analysis of Clinical Studies'' (March 1992) from the Japanese Ministry of Health and Welfare and the U.S. FDA document entitled "Guideline for the Format and Content of the Clinical and Statistical Sections of New Drug Applications'' (July 1988). Some topics related to statistical principles and methodology are also embedded within other ICH guidelines, particularly those listed below. The specific guideline that contains related text will be identified in various sections of this document.

E1: The Extent of Population Exposure to Assess Clinical Safety
E2A: Clinical Safety Data Management: Definitions and Standards for Expedited Reporting
E2B: Clinical Safety Data Management: Data Elements for Transmission of Individual Case Safety Reports
E2C: Clinical Safety Data Management: Periodic Safety Update Reports for Marketed Drugs
E3: Structure and Content of Clinical Study Reports
E4: Dose-Response Information to Support Drug Registration
E5: Ethnic Factors in the Acceptability of Foreign Clinical Data
E6: Good Clinical Practice: Consolidated Guideline
E7: Studies in Support of Special Populations: Geriatrics
E8: General Considerations for Clinical Trials
E10: Choice of Control Group in Clinical Trials
M1: Standardization of Medical Terminology for Regulatory Purposes
M3: Nonclinical Safety Studies for the Conduct of Human Clinical Trials for Pharmaceuticals

This guideline is intended to give direction to sponsors in the design, conduct, analysis, and evaluation of clinical trials of an investigational product in the context of its overall clinical development. The document will also assist scientific experts charged with preparing application summaries or assessing evidence of efficacy and safety, principally from late Phase II and Phase III clinical trials.

1.2 Scope and Direction

The focus of this guideline is on statistical principles. It does not address the use of specific statistical procedures or methods. Specific procedural steps to ensure that principles are implemented properly are the responsibility of the sponsor. Integration of data across clinical trials is discussed, but is not a primary focus of this guideline. Selected principles and procedures related to data management or clinical trial monitoring activities are covered in other ICH guidelines and are not addressed here.

This guideline should be of interest to individuals from a broad range of scientific disciplines. However, it is assumed that the actual responsibility for all statistical work associated with clinical trials will lie with an appropriately qualified and experienced statistician, as indicated in the "ICH Guideline for Good Clinical Practice.'' The involvement of the statistician, in collaboration with other clinical trial professionals, is to ensure that statistical principles are applied appropriately in clinical trials supporting drug development. Thus, the statistician should have a combination of education/training and experience sufficient to implement the principles articulated in this guideline.

CDDS endorses the involvement of "an appropriately qualified and experienced statistician", especially one practicing advanced, state of the art statistical techniques, in an advisory or participatory role in designing, analyzing, and interpreting clinical trials. However, it is inappropriate for FDA to insist that the "responsibility for all statistical work associated with clinical trials" be restricted to "an appropriately qualified and experienced statistician", exclusive of other scientists that are knowledgeable and experienced with statistical methods. Rather, it is the quality, validity, and currency of the "statistical work" that should be subject to regulatory guidance and standards.

The "ICH Guideline for Good Clinical Practice'', section 5.4.1 recommends "The sponsor should utilize qualified individuals (e.g. biostatisticians, clinical pharmacologists, and physicians) as appropriate, throughout all stages of the trial process, from designing the protocol and CRF's and planning the analyses to analyzing and preparing interim and final clinical trials/s study reports." CDDS believes that this provides sufficient guidance for the professional qualifications of personnel involved in the statistical aspects of clinical trials and underscores the value of collaboration between statisticians and scientists that are experts in the subject matter that may be the object of statistical analyses.

However, if the Agency insists on regulating the professional qualifications of personnel involved in the statistical work of clinical trials, CDDS believes that such qualifications should include knowledge and experience in modern methods of clinical trial data analysis that extend beyond the scope ofthe traditional statistical principles described In this draft guideline (see subsequent comments). In addition, the statistician should embrace input from drug development scientists, such as clinical pharmacologists and therapeutic area experts so as to minimize pharmacologically naive statistical advice. By overemphasizing the role of the traditional statistician in assuring the statistical validity of a clinical trial, the draft guidance fails to recognize the importance of pharmacological and therapeutic area knowledge in statistical aspects of trial design, analysis, and interpretation (note 1).

All important details of the design, conduct, and proposed analysis of each clinical trial contributing to a marketing application should be clearly specified in a protocol written before the trial begins. The extent to which the procedures in the protocol are followed and the primary analysis is planned a priori will contribute to the degree of confidence in the final results and conclusions of the trial. The protocol and subsequent amendments should be approved by the responsible personnel, including the trial statistician. The trial statistician should ensure that the protocol and any amendments cover all relevant statistical issues clearly and accurately, using technical terminology as appropriate.

While CDDS agrees with the ideal recommendations (a) "all important details of the ... and proposed analysis of each clinical trial ... should be clearly specified in a protocol written before the trial begins" and (b) a prior specification of the primary analysis, FDA should encourage the most informative analyses to be applied to data from a completed trial, regardless of whether they were specified in advance. What if an inappropriate or suboptimal analysis is prespecified or advances in science support an improved analysis? While FDA should encourage a "best effort" at compliance in (a) and (b) above, flexibility and receptivity to optimal analysis should best protect the public health. Recognizing the vulnerability to unintentional biases that can derive from overzealous post hoc "mining" of data, as far as possible, all such analyses should be explained and justified as recommended below in comments concerning enhanced pre-trial descriptions of planned analyses or at the "blind analysis" stage defined in Annex 1 (Glossary).

The principles outlined in this guideline are primarily relevant to clinical trials conducted in the later phases of development, many of which are confirmatory trials of efficacy. In addition to efficacy, confirmatory trials may have as their primary variable a safety variable (e.g., an adverse event, a clinical laboratory variable, or an electrocardiographic measure) or a pharmacodynamic or pharmacokinetic variable (as in a confirmatory bioequivalence trial). Furthermore, some confirmatory findings may be derived from data integrated across studies, and selected principles in this guideline are applicable in this situation. Finally, although the early phases of drug development consist mainly of clinical trials that are exploratory in nature, statistical principles are also relevant to these clinical trials. Hence, the substance of this document should be applied as far as possible to all phases of clinical development.

The objectives of exploratory (learning) clinical trials undertaken in phase I (clinical pharmacology studies to investigate safety, ADME (Absorption, Disposition, Metabolism, and Elimination), disease state, and drug-drug interactions) and phase IIb (clinical pharmacology in patients to investigate dose-response on clinical or surrogate pharmacologic effects) are best achieved without all of the restrictions applied to confirmatory trials described in this draft guidance (e.g. intention to treat analysis, limited to prespecified analyses, etc). Nevertheless, we highly recommended that exploratory clinical trials be designed as scientifically rigorous investigations, employing whenever possible randomization, blinding, inclusion of controls, assessments of actual drug exposure, etc, and that resulting data be thoroughly analyzed using both prospective and post-hoc "explanatory" (we prefer the adjective explanatory to "exploratory", to highlight the scientific as opposed to purely pragmatic goal of the analysis) data analyses.

Many of the principles delineated in this guideline deal with minimizing bias and maximizing precision. As used in this guideline, the term "bias'' describes the systematic tendency of any factors associated with the design, conduct, analysis, and interpretation of the results of clinical trials to make the estimate of a treatment effect deviate from its true value. It is important to identify potential sources of bias to the extent possible so that attempts to limit such bias may be made. The presence of bias may seriously compromise the ability to draw valid conclusions from clinical studies.

Some sources of bias arise from the design of the trial, for example an assignment of treatments such that subjects at lower risk are systematically assigned to one treatment. Other sources of bias arise during the conduct and analysis of a clinical trial. For example, protocol violations and exclusion of subjects from analysis based upon knowledge of subject outcomes are possible sources of bias that may affect the accurate assessment of treatment effect. Because bias can occur in subtle or unknown ways and its effect is not measurable directly, it is important to evaluate the robustness of the results and primary conclusions of the trial. Robustness is a concept that refers to the sensitivity of the overall conclusions to various limitations of the data, assumptions, and analytic approaches to data analysis. Robustness implies that, if a variety of analyses of the data that take into account changing assumptions were to be performed, the treatment effect and primary conclusions of the trial would be consistent. The interpretation of statistical measures of uncertainty of the treatment effect and treatment comparisons should involve consideration of the potential contribution of bias to the p-value, confidence interval, or inference.

This guideline largely refers to the use of frequentist methods when discussing hypothesis testing and/or confidence intervals. However, the use of Bayesian or other approaches may be considered when the reasons for their use are clear and when the resulting conclusions are sufficiently robust compared to alternative assumptions.

In view of advances in understanding of the sciences of drug action in man (clinical pharmacology) and drug development science that have occurred in recent decades, and the extent of knowledge derived in phases 1 and 2, wholesale insistence on standard frequentist procedures applied to empirical phase 3 confirmatory testing should be relaxed. These advances form the foundation for a mechanistic understanding of the clinical effects of most new drugs that can now be efficiently incorporated into clinical trials that comprise the basis for providing substantial evidence of effectiveness. A relatively full accounting of the underlying scientific knowledge cannot be made, in many cases, without resorting to a full-probability-model-based analysis; that is, a Bayesian, or at a minimum, likelihood-based analysis. Application of Bayesian methods should therefore be encouraged; such methods, to be acceptable, should achieve valid frequentist performance standards under reasonable (pharmacologically sound) assumptions.

II. Considerations for Overall Clinical Development

2.1 Study Context

2.1.1 Development Plan

The broad aim of the process of clinical development of a new drug is to find out whether there is a dose range and schedule at which the drug can be shown to be simultaneously safe and effective, to the extent that the risk-benefit relationship is acceptable. The particular subjects who may benefit from the drug and the specific indications for its use also need to be defined.

Satisfying these broad aims usually requires an ordered program of clinical trials, each with its own specific objectives. This should be specified in a clinical plan, or a series of plans, with appropriate decision points and flexibility to allow modification as knowledge accumulates. A marketing application should clearly describe the main content of such plans, and the contribution made by each trial. Interpretation and assessment of the evidence from the total program of trials involves synthesis of the evidence from the individual trials (see section 7.2). This is facilitated by ensuring that common standards are adopted for a number of features of the trials, such as dictionaries of medical terms, definition and timing of the main measurements, handling of protocol deviations, and so on. A statistical overview or meta-analysis may be informative when medical questions are addressed in more than one trial. Where possible, this should be envisaged in the plan so that the relevant trials are clearly identified and any necessary common features of their designs are specified in advance. Other major statistical issues (if any) that are expected to affect a number of trials in a common plan should be addressed in that plan.

2.1.2 Confirmatory Trial

A confirmatory trial is a controlled trial in which a hypothesis is stated in advance and evaluated. As a rule, confirmatory trials are necessary to provide firm evidence of efficacy or safety. In such trials, the key hypothesis of interest follows directly from the trial's primary objective, is always predefined, and is the hypothesis that is subsequently tested when the trial is complete. In a confirmatory trial, it is equally important to estimate with due precision the size of the effects attributable to the treatment of interest and to relate these effects to their clinical significance.

The above definition of "confirmatory trial" is very broad, extending beyond the scope of the usual meaning applied in the context of a phase 3 confirmatory trial for formal documentation of a drug effectiveness (see Sheiner, 1997 ). Defining, as in this document, a confirmatory trial as one for which a hypothesis is stated in advance as opposed to one for which there is prior evidence in support of that hypothesis, is confounding two very different situations. If "confirmatory" is restricted to situations where there is a pre-existing hypothesis with prior supporting evidence then the conditions for use of Bayesian approaches are in place. If a confirmatory trial is any trial for which a hypothesis has been stated in advance (with no constraints on the basis for the hypothesis) it is reasonable to employ frequentist approaches. In addition to defining "confirmatory trial" more precisely, the guideline should define what constitutes reasonable evidence for a prior hypothesis that would support Bayesian approaches to trial design and analysis. That is, when and in what circumstances must we set aside our prior evidence and resort to a purely empirical test, and when may we let our prior evidence enter into our "confirmatory" trail design and analysis? While a spectrum of opinion is of course possible on this point, we are concerned to see that there is no discussion of it, especially in light of the FDA's draft "Guidance for Industry--Providing Clinical Evidence of Effectiveness for Human Drug and Biological Products" which is particularly concerned with the balancing of various sources of evidence in forming a final judgment of effectiveness.

Confirmatory trials are intended to provide firm evidence in support of claims. Therefore, adherence to their planned design and procedures is particularly important; unavoidable changes should be explained and documented, and their effect examined. A justification of the design of each such trial and of all other statistical aspects, such as the planned analysis, should be set out in the protocol. Each trial should address only a limited number of questions.

To say that each trial should address only a limited number of questions is an unwarranted restriction. Rather, FDA should emphasize that the questions to be addressed by a particular trial should not impact or be confounded by each other. In fact, in some cases, CDDS suggests that more questions (of a congruent nature) should be asked in a single clinical trial. For example, enrolling a diverse population in a large clinical trial (optimally with stratification) rather than a narrowly defined population, provides an opportunity to simultaneously ask several relevant questions, regarding gender, age, ethnic and other demographic effects on treatment outcomes.

Firm evidence in support of claims requires that the results of the confirmatory trials demonstrate that the investigational product under test has clinical benefits. The confirmatory trials should therefore be sufficient to answer each key clinical question relevant to the efficacy or safety claim clearly and definitively. In addition, it is important that the basis for generalization to the intended patient population is understood and explained; this may also influence the number and type of centers and/or trials needed. The results of the confirmatory trial(s) should be robust. In some circumstances, the weight of evidence from a single confirmatory trial may be sufficient.

To reiterate our previous point, is(are) the confirmatory trial(s) to be the sole basis, disregarding other supportive evidence, for concluding that the investigational product under test has clinical benefits? Regarding this point, and the more specific one of reliance on a single confirmatory trial to confirm effectiveness, reference is made to extensive comments made by CDDS on the FDA's draft "Guidance for Industry--Providing Clinical Evidence of Effectiveness for Human Drug and Biological Products" and submitted to the Agency on May 30, 1997.

3. Exploratory Trial

As introduced above in our third comment in section 1.2, the word "explanatory" is preferable to "exploratory" in conveying the intent of this type of trial and applicable data analytic techniques.

The rationale and design of confirmatory trials nearly always rests on earlier clinical work carried out in a series of exploratory studies. Like all clinical trials, these exploratory studies should have clear and precise objectives. However, in contrast to confirmatory trials, their objectives may not always lead to simple tests of predefined hypotheses. In addition, exploratory trials may sometimes require a more flexible approach to design so that changes can be made in response to accumulating results. Their analysis may entail data exploration; tests of hypothesis may be carried out, but the choice of hypothesis may be data dependent. Such trials cannot be the basis of the formal proof of efficacy, although they may contribute to the total body of relevant evidence.

It is inadvisable to deny as a matter of principle the admissibility as "formal proof of efficacy" (undefined) data and conclusions from "explanatory" (exploratory) trials, even when hypotheses and analyses may be data dependent. As articulated in the CDDS commentary on FDA's draft evidence of effectiveness guidance evidence of effectiveness should be derived from all clinical trials in a development program, interpreted as a whole. So-called exploratory (explanatory) trials may provide scientifically sound information that is consistent supporting evidence of effectiveness.

Any individual trial may have both confirmatory and exploratory aspects. For example, in most confirmatory trials the data are also subjected to exploratory analyses which serve as a basis for explaining or supporting their findings and for suggesting further hypotheses for later research. The protocol should make a clear distinction between the aspects of a trial that will be used for confirmatory proof and the aspects that will provide data for exploratory analysis.

See our second comment in Section 1.2 above. Lack of a priori designation of explanatory and confirmatory elements of a clinical trial should not interfere with acceptance of valid analyses and interpretations that are consistent with the study design and scientific principles. It should not be necessary or required for a protocol to specify in advance all aspects of the trial that will provide data for explanatory analysis. Data from any and all aspects of the trial should be considered available for explanatory analyses. It is not the conduct of the explanatory analysis that should be subject to FDA consideration but rather the regulatory uses of possible interpretations.

2.2 Study Scope

2.2.1 Population

In the earlier phases of drug development, the choice of subjects for a clinical trial may be heavily influenced by the wish to maximize the chance of observing specific clinical effects of interest. Hence, they may come from a very narrow subgroup of the total patient population for which the drug may eventually be indicated. However, by the time the confirmatory trials are undertaken, the subjects in the trials should more closely mirror the intended users. In these trials, it is generally helpful to relax the inclusion and exclusion criteria as much as possible within the target indication, while maintaining sufficient homogeneity to permit a successful trial to be carried out. No individual clinical trial can be expected to be totally representative of future users because of the possible influences of geographical location, the time when it is conducted, the medical practices of the particular investigator(s) and clinics, and so on. However, the influence of such factors should be reduced wherever possible and subsequently discussed during the interpretation of the trial results.

The objective of encouraging heterogeneity in confirmatory trials may be in tension with the earlier precaution regarding the number of questions to be asked in such a trial. If diverse populations are to be included in the trial, not only is representativeness of the intended user population enhanced, but useful information may emerge from an appropriate analysis of the impact of those factors contributing to the heterogeneity on the results of the trial (e.g. antihypertensive effectiveness in blacks vs. whites).

2.2.2 Primary and Secondary Variables

The primary variable ("target'' variable, primary endpoint) should be the variable capable of providing the most clinically relevant and convincing evidence directly related to the primary objective of the trial. There should generally be only one primary variable. This will usually be an efficacy variable, because the primary objective of most confirmatory trials is to provide strong scientific evidence regarding efficacy. Safety/tolerability may sometimes be the primary variable, and will always be an important consideration.

Newer, Bayesianly motivated methods of analysis can deal directly with multiple outcomes and not destroy frequentist validity (see also recent papers by Adrian Smith, Amy Racine-Poon, and Els Goetghebeur in JASA and JRSSB).

Measurements relating to quality of life and health economics are further potential primary variables. The selection of the primary variable should reflect the accepted norms and standards in the relevant field of research. The use of a reliable and validated variable with which experience has been gained either in earlier studies or in published literature is recommended. There should be sufficient evidence that the primary variable can provide a valid and reliable measure of some clinically relevant and important treatment benefit in the subject population described by the inclusion and exclusion criteria. The primary variable should generally be the one used when estimating the sample size (see section 3.5).

In many cases, and especially when treatment is directed at a chronic rather than an acute process, the approach to assessing subject outcome may not be straightforward and should be carefully defined. For example, it is inadequate to specify mortality as a primary variable without further clarification; mortality may be assessed by comparing proportions alive at fixed points in time, or by comparing overall distributions of survival times over a specified interval. Another common example is a recurring outcome. The measure of treatment effect may again be a simple dichotomous variable (any occurrence during a specified interval), time to first occurrence, or rate of occurrence (events per time units of observation), to give a few possibilities. The assessment of functional status over time in studying treatment for chronic disease presents other challenges in selection of the primary variable. There are many possible approaches, such as comparisons of the assessments done at the beginning and end of the interval of observation, comparison of slopes calculated from all assessments throughout the interval, or comparisons of the proportions of subjects exceeding or declining beyond a prespecified threshold. To avoid multiplicity concerns, it is critical to specify in the protocol the precise definition of the primary variable as it will be used in the statistical analysis. In addition, the clinical relevance of the specific primary variable selected and the validity of the associated measurement procedures will generally need to be addressed and justified in the protocol.

The primary variable should be specified in the protocol, along with the rationale for its selection. Redefinition of the primary variable after unblinding will almost always be unacceptable, since the biases this introduces are difficult to assess. When relevant, the validity and reliability of the primary variable should be described. Secondary variables are either supportive measurements related to the primary objective or measurements of effects related to the secondary objectives. Their predefinition in the protocol is also important, as well as an explanation of their relative importance and roles in interpretation of trial results. When the clinical effect defined by the primary objective is to be measured in more than one way, the protocol should identify one of the measurements as the primary variable on the basis of clinical relevance, importance, objectivity, and/or other relevant characteristics, whenever such selection is feasible. Another strategy that may be useful in some situations is to integrate or combine the multiple measurements into a single or "composite'' variable, using a predefined algorithm. Indeed, the primary variable sometimes arises as a combination of multiple clinical measurements (e.g., the rating scales used in arthritis, psychiatric disorders, and elsewhere). This approach addresses the multiplicity problem without requiring adjustment for multiple comparisons. The method of combining the multiple measurements should be specified in the protocol, and an interpretation of the resulting scale should be provided in terms of the size of a clinically relevant benefit. When composite variables are used as primary variables, the individual components of these variables are often analyzed separately. When a rating scale is used as a primary variable, it is especially important to address factors such as content validity, inter- and intrarater reliability, and sensitivity for discriminating different medical conditions.

In some cases, "global assessment'' variables are developed to measure the overall safety, overall efficacy, and/or overall usefulness of a treatment. This type of variable integrates objective variables and the investigator's overall impression about the state or change in the state of the subject, and is usually a scale of ordered categorical ratings. Global assessments of overall effectiveness are well established in many therapeutic areas, especially psychotropic drugs and nonsteroidal anti-inflammatory drugs.

Global assessment variables generally have a subjective component. When a global assessment scale is used as a primary or secondary variable, fuller details should be included in the protocol with respect to:

(1) The relevance of the global scale to the primary objective of the trial;

(2) The basis for the validity of the scale;

(3) How to utilize the data collected on an individual subject to assign him/her to a unique category of the global assessment scale;

(4) How to uniquely categorize subjects with missing data. If objective variables are considered by the investigator when making a global assessment, then those objective variables should be considered additional primary or, at least, important secondary variables.

Overall usefulness integrates components of both benefit and risk and reflects the decision making process of the treating physician, who must weigh benefit and risk in making product use decisions. A problem with global usefulness scales is that their use could in some cases lead to the result of two products being declared equivalent despite having very different profiles of beneficial and adverse effects. For example, judging the global usefulness of a treatment as equivalent or superior to an alternative may mask the fact that it has little or no efficacy but fewer adverse effects. Therefore, if usefulness is used as a primary variable, it is important to consider specific efficacy and safety outcomes separately as additional primary variables.

It may sometimes be desirable to use more than one primary variable, each of which (or a subset of which) could be a sufficient basis for marketing approval, to cover the range of effects of the therapies. The planned manner of interpretation of this type of evidence should be carefully spelled out. For example, it should be clear whether an impact on any of the variables, some minimum number of them, or all of them, would be considered necessary for approval. The primary hypothesis or hypotheses should be clearly stated with respect to the primary variables identified and the approach to testing the hypotheses described. This should include specification of the statistical parameters being tested (e.g., mean, percentage, distribution). The effect on the Type I error should be explained because of the potential for multiple comparison problems (see section 5.6); the method of controlling Type I error should be given in the protocol. The extent of intercorrelation among the proposed primary variables may be considered in evaluating the impact on Type I error. If the success of the trial depends upon demonstrating effects on all of the designated primary variables, then there is no need for adjustment of the Type I error, but the impact on Type II error and sample size needs should be carefully considered.

When direct assessment of the clinical benefit to the subject through observing actual clinical efficacy is not practical, indirect criteria (surrogate variables) may be considered. Commonly accepted surrogate variables are used in a number of indications where they are believed to be reliable predictors of clinical benefit. There are two principal concerns with the introduction of any proposed surrogate variable. First, it may not be a true predictor of the clinical outcome of interest. For example, it may measure treatment activity along one particular pathway, but may not provide full information on the range of actions and ultimate effects of the treatment, whether positive or negative. There have been many instances where treatments showing a highly positive effect on a proposed surrogate have ultimately been shown to be detrimental to the subjects' clinical status; conversely, there are cases of treatments conferring clinical benefit without measurable impact on proposed surrogates. Additionally, proposed surrogate variables may not yield a quantitative measure of clinical benefit that can be weighed directly against adverse effects. Statistical criteria for validating surrogate variables have been proposed, but the experience with their use is relatively limited. In practice, the strength of the evidence for surrogacy depends upon the biological plausibility of the relationship, the demonstration in epidemiological studies of the prognostic value of the surrogate for the clinical outcome, and evidence from clinical trials that treatment effects on the surrogate correspond to effects on the clinical outcome. Relationships between clinical and surrogate variables for one product do not necessarily apply to a product with a different mode of action for treating the same disease.

The Agency is encouraged to publish a full rationale, principles and policy on use and validation of surrogate endpoints that may substitute for clinical endpoints, such as the unpublished, draft document "Principles to Guide the Use of Surrogate Endpoints for Assessing Drug Efficacy", developed in 1991 by the FDA Task Force on Use of Surrogate Endpoints as a Basis for Drug Approval.

Dichotomization or other categorization of continuous or ordinal variables may sometimes be desirable. Criteria of "success'' and "response'' are common examples of dichotomies that should be specified precisely in terms of, for example, a minimum percentage improvement (relative to baseline) in a continuous variable or a ranking categorized as at or above some threshold level (e.g., "good'') on an ordinal rating scale. The reduction of diastolic blood pressure below 90 mmHg is a common dichotomization. Categorizations are most useful when they have clear clinical relevance. The criteria for categorization should be predefined and specified in the protocol, as knowledge of trial results could easily bias the choice of such criteria. Because categorization normally implies a loss of information, a consequence will be a loss of power in the analysis; this should be accounted for in the sample size calculation.

2.3 Design Techniques to Avoid Bias

The two most important design techniques for avoiding bias in clinical trials are blinding and randomization, and these should be a normal feature of most controlled clinical trials intended to be included in a marketing application. Most such trials follow a double-blind approach in which treatments are prepacked in accordance with a suitable randomization schedule and supplied to the trial center(s) labeled only with the subject number and the treatment period, so that no one involved in the conduct of the trial is aware of the specific treatment allocated to any particular subject, not even as a code letter. This approach will be assumed in section 2.3.1 and most of section 2.3.2, exceptions being considered at the end. The protocol should also specify procedures aimed at minimizing any anticipated irregularities in study conduct that might impair a satisfactory analysis, including various types of protocol violations, withdrawals, and missing values. The protocol should consider ways both to reduce frequency of such problems and to handle the problems that do occur in the analysis of data.

2.3.1 Blinding

Blinding is intended to limit the occurrence of conscious and unconscious bias in the conduct and interpretation of a clinical trial arising from the influence that knowledge of treatment may have on the recruitment and allocation of subjects, their subsequent care, the attitudes of subjects to the treatments, the assessment of end points, the handling of withdrawals, the exclusion of data from analysis, and so on. The essential aim is to prevent identification of the treatments until all such opportunities for bias have passed.

A double-blind trial is one in which neither the subject nor any of the investigator or sponsor staff involved in the treatment or clinical evaluation of the subjects is aware of the treatment received. This includes anyone determining subject eligibility, evaluating endpoints, or assessing compliance with the protocol. This level of blinding is maintained throughout the conduct of the trial; only when the data are cleaned to an acceptable level of quality will appropriate personnel be unblinded. If any of the sponsor staff who are not involved in the treatment or clinical evaluation of the subjects are required to be unblinded to the treatment code (e.g., bioanalytical scientists, auditors, those involved in serious adverse event reporting), the sponsor should have adequate standard operating procedures (SOP's) to guard against inappropriate dissemination of treatment codes. In a single-blind trial the investigator and/or his staff are aware of the treatment but not the subject. In an open-label trial the identity of treatment is known to all. The double-blind trial is the optimal approach. This requires that the treatments to be applied during the trial cannot be distinguished in any way (appearance, taste, etc.) either before or during administration, and that the blind is maintained appropriately during the whole trial.

Difficulties in achieving the double-blind ideal can arise because: (1) The treatments may be of a completely different nature, for example, surgery and drug therapy; (2) two drugs may have different formulations and, although they could be made indistinguishable by the use of capsules, changing the formulation might also change the pharmacokinetic and/or pharmacodynamic properties, so that bioequivalence of the formulations may need to be established; (3) the daily pattern of administration of two treatments may differ. One way of achieving double-blind conditions under these circumstances is to use a "double dummy'' technique. This technique may sometimes force an administration scheme that is sufficiently unusual to influence adversely the motivation and compliance of the subjects. Ethical difficulties may also interfere with its use when, for example, it entails dummy operative procedures. Nevertheless, extensive efforts should be made to overcome these difficulties.

In some clinical trials, although double blinding is planned, it may be partially compromised by apparent treatment induced effects. In such cases, blinding may be improved by blinding investigators to certain test results (e.g., selected clinical laboratory measures). Similar approaches (see below) to minimizing bias in open-label trials should be considered in trials where unique or specific treatment effects may lead to unblinding individual patients.

If a double-blind trial is not feasible, then the single-blind option should be considered. In some cases only an open-label trial is practically or ethically possible. Single-blind and open-label trials provide additional flexibility, but it is particularly important that the investigator's knowledge of the next treatment should not influence the decision to enter the subject; this decision should precede knowledge of the randomized treatment. Also, under either of these circumstances, clinical assessments should be made by medical staff who are not involved in treating the subjects and who remain blind to treatment. In single-blind or open-label trials, every effort should be made to minimize the various known sources of bias and primary variables should be as objective as possible. The reasons for the degree of blinding adopted, as well as steps taken to minimize bias by other means, should be explained in the protocol.

Breaking the blind (for a single subject) should be considered only when knowledge of the treatment assignment is deemed essential by the subject's physician for the subject's care. Any intentional or unintentional breaking of the blind should be reported and explained at the end of the trial, irrespective of the reason for its occurrence. The procedure and timing for revealing the treatment assignments should be documented.

In this document, the blind review of data refers to the checking of data during the period of time between trial completion (the last observation on the last subject) and the breaking of the blind. If specific sponsor staff need to be unblinded during this period to ensure the integrity of the database or the suitability of statistical assumptions, appropriate SOP's should be developed to describe how the treatment code will be protected from broader dissemination.

2.3.2 Randomization

Randomization introduces a deliberate element of chance into the assignment of treatments to subjects in a clinical trial. During subsequent analysis of the trial data, it provides a sound statistical basis for the quantitative evaluation of the evidence relating to treatment effects. It also tends to produce treatment groups in which the distributions of prognostic factors (known and unknown) are similar. In combination with blinding, randomization helps to avoid possible bias in the selection and allocation of subjects arising from the predictability of treatment assignments.

The randomization schedule of a clinical trial documents the random allocation of treatments to subjects. In the simplest situation, it is a sequential list of treatments (or treatment sequences in a crossover trial) or corresponding codes by subject number. The logistics of some trials, such as those with a screening phase, may make matters more complicated, but the unique preplanned assignment of treatment, or treatment sequence, to subject should be clear. Different trial designs should have different procedures for generating randomization schedules. The randomization schedule should be capable of being reproduced (if the need arises). Whenever possible, this should be accomplished through the use of the same random number table, or the same computer routine and seed for its random number generator.

Although unrestricted randomization is an acceptable approach, some advantages can generally be gained by randomizing subjects in blocks. This helps to increase the comparability of the treatment groups particularly when subject characteristics may change over time, as a result, for example, of changes in recruitment policy. It also provides a better guarantee that the treatment groups will be of nearly equal size. In cross-over trials, it provides the means of obtaining balanced designs with their greater efficiency and easier interpretation. Care should be taken to choose block lengths that are sufficiently short to limit possible imbalance, but long enough to avoid predictability towards the end of the sequence in a block. Investigators should generally be blind to the block length; the use of two or more block lengths, randomly selected for each block, can achieve the same purpose. (Theoretically, in a double-blind trial predictability does not matter, but the pharmacological effects of drugs often provide the opportunity for intelligent guesswork.)

In multicenter trials, the randomization procedures should be organized centrally. It is advisable to have a separate random scheme for each center, i.e., to stratify by center or to allocate several whole blocks to each center. More generally, stratification by important prognostic factors measured at baseline (e.g., severity of disease, age, sex, etc.) may sometimes be valuable in order to promote balanced allocation within strata; this has greater potential benefit in small trials. The use of more than two or three stratification factors is rarely necessary, is less successful at achieving balance, and is logistically troublesome. Where it is necessary, the use of a dynamic allocation procedure (see below) may help to achieve balance across all factors simultaneously, provided the rest of the trial procedures can be adjusted to accommodate an approach of this type.

The next subject to be randomized into a study should always receive the treatment corresponding to the next free number in the appropriate randomization schedule (in the respective stratum, if randomization is stratified). The appropriate number and associated treatment for the next subject should only be allocated when entry of that subject to the randomized part of the trial has been confirmed. These tasks will normally be carried out by staff at the investigator's center, who will then dispense the relevant blinded trial supplies.

Details of the randomization which facilitate predictability (e.g., block length) should not be contained in the study protocol. The randomization schedule itself should be filed securely by the sponsor or an independent party in a manner that ensures that blindness is properly maintained throughout the trial. Access to the randomization schedule during the trial should take into account the possibility that, in an emergency, the blind may have to be broken for any subject, either partially or completely. The procedure to be followed, the necessary documentation, and the subsequent treatment and assessment of the subject should all be described in the protocol.

Dynamic allocation is an alternative randomization procedure in which the allocation of treatment to a subject is influenced by the current balance of allocated treatments and, in a stratified trial, by the stratum to which the subject belongs and the balance within that stratum. Every effort should be made to retain the double-blind status of the trial. For example, knowledge of the treatment code may be restricted to a central trial office from where the dynamic allocation is controlled, generally through telephone contact. This in turn permits additional checks of eligibility criteria and establishes entry into the trial, features that can be valuable in certain types of multicenter trials. The usual system of prepacking and labeling drug supplies for double-blind trials can then be followed, but the order of their use is no longer sequential. It is desirable to use appropriate computer algorithms to keep personnel at the central trial office blind to the treatment code. The complexity of the logistics and potential impact on the analysis should be carefully evaluated when considering dynamic allocation.

III. Study Design Considerations

3.1 Study Configuration

3.1.1 Parallel Group Design

The most common clinical trial design for confirmatory trials is the parallel group design in which subjects are randomized to one of two or more arms, each arm being allocated a different treatment. These treatments will include the investigational product at one or more doses, and one or more control treatments, such as placebo and/ or an active comparator. The assumptions underlying this design are less complex than for most other designs. However, there may be additional features of the design which complicate the analysis and interpretation (e.g., covariates, repeated measurements over time, interactions between design factors, protocol violations, dropouts, and withdrawals).

3.1.2 Cross-Over Design

In the cross-over design, each subject is randomized to a sequence of two or more treatments and hence acts as his own control for treatment comparisons. This simple maneuver is attractive primarily because it reduces the number of subjects and, usually, the number of assessments needed to achieve a specific power, sometimes to a marked extent. In the simplest 2x2 cross-over design, each subject receives each of two treatments in randomized order in two successive treatment periods, often separated by a washout period. The most common extension of this entails comparing n(>2) treatments in n periods, each subject receiving all n treatments. Numerous variations exist, such as designs in which each subject receives a subset of n(>2) treatments, or designs in which treatments are repeated within a subject.

Cross-over designs have a number of problems which can invalidate their results. The chief difficulty concerns carryover, that is, the residual influence of treatments in subsequent treatment periods. In an additive model, the effect of unequal carryover will be to bias direct treatment comparisons. In the 2x2 design, the relevant contrast cannot be statistically distinguished from the interaction between treatment and period, and the test for either of these lacks power because it is a "between subject'' contrast. This problem is less acute in higher order designs, but cannot be entirely dismissed.

Therefore, when the cross-over design is used, it is important to avoid carryover. This is best done by selective and careful use of the design on the basis of adequate knowledge of both the disease area and the new medication. The disease under study should be chronic and stable. The relevant effects of the medication should develop fully within the treatment period. The washout periods should be sufficiently long for complete reversibility of drug effect. The fact that these conditions are likely to be met should be established in advance of the trial by means of prior information and data.

The above discussion of carry-over effects underscores the need for statisticians consulting on clinical trials to work collaboratively with clinical pharmacologists so that knowledge of a drug's pharmacokinetics, pharmacodynamics and effect kinetics can be coupled with that of the disease being treated in designing a cross-over trial. This allows for the trial design to rely on scientifically acceptable assumptions (established by prior research), and frequentist validity of analysis need only apply under these assumptions.

A common, and generally satisfactory, use of the 2x2 cross-over design is to demonstrate the bioequivalence of two formulations of the same medication. In this particular application in healthy volunteers, carryover effects on the relevant pharmacokinetic variable are rather unlikely to occur if the wash-out time between the two periods is sufficiently long. However, it is still important to check this assumption during analysis on the basis of the data obtained, for example, by demonstrating that no drug is detectable at the start of each period.

There are additional problems that need careful attention in cross-over trials. The most notable of these are the complications of analysis and interpretation arising from the loss of subjects. Also, the potential for carryover leads to difficulties in assigning adverse events that occur in later treatment periods to the appropriate treatment. These and other issues are described in the ICH E4 topic on "Dose-Response Information to Support Drug Registration.'' The cross-over design should generally be restricted to situations where losses of subjects from the trial are expected to be small.

The above concern regarding negative effects of drop-outs can be met by analyzing a cross-over trial with drop-outs using Bayesian techniques and "multiple imputation" This is an example of an analysis method that should give a frequentist no problems, that utilizes the data more fully than standard methods, is extensively used in other disciplines, and yet has not been applied in drug regulation. It is our hope that this guideline (ICH E9) will indicate useful new approaches as well as those traditionally used.

3.1.3 Factorial Designs

In a factorial design, two or more treatments are evaluated simultaneously in the same set of subjects through the use of varying combinations of the treatments. The simplest example is the 2x2 factorial design in which subjects are randomly allocated to one of the four possible combinations of two treatments, A and B. These are: A alone; B alone; both A and B; neither A nor B. In many cases this design is used for the specific purpose of examining the interaction of A and B. The statistical test of interaction is model dependent and may lack power to detect an interaction if the sample size was calculated based on the test for main effects. This consideration is important when this design is used for examining the joint effects of A and B, in particular, if the treatments are likely to be used together.

Another important use of the factorial design is to establish the dose-response characteristics of a combination product, e.g., one combining treatments C and D, especially when the efficacy of each monotherapy has been established at some dose in prior studies. A number, m, of doses of C is selected, usually including a zero dose (placebo), and a similar number, n, of doses of D. The full design then consists of mn treatment groups, each receiving a different combination of doses of C and D. The resulting estimate of the response surface may then be used to help identify an appropriate combination of doses of C and D for clinical use.

Estimating the parameters of a response surface from a factorial design, a drug combination clinical trial constitutes empirical modeling, whether wholly data dependent or supported by a priori assumptions. FDA is encouraged to recognize this approach as a valid strategy for both effectiveness confirmation and to simultaneously identify optimal dosage combinations for clinical use, even when the optimal dosages lie unstudied within the boundaries of studied doses defining the response surface. We also encourage FDA to recommend the use of response surface models based on pharmacology for such analyses, rather than the entirely empirical models often chosen by data analysts unfamiliar with the application domain.

In some cases, the 2x2 design may be used to make efficient use of clinical trial subjects by evaluating the efficacy of the two treatments with the same number of subjects as would be required to evaluate the efficacy of either one alone. This strategy has proved to be particularly valuable for very large mortality studies. The efficiency of this approach depends upon the absence of interaction between treatments A and B so that the effects of A and B on the primary efficacy variables follow an additive model, hence the effect of A is virtually identical whether or not it is additional to the effect of B. As for the cross-over trial, evidence that this condition is likely to be met should be established in advance of the trial by means of prior information and data.

FDA should also include consideration of statistical principles applied to confirmatory trials employing dose-titration designs (with analysis using mixed-effects modeling as discussed in the ICH E4 document), and effect or concentration controlled trials.

3.2 Multicenter Trials

Multicenter trials are carried out for two main reasons. First, a multicenter trial is an accepted way of evaluating a new medication more efficiently; under some circumstances, it may present the only practical means of accruing sufficient subjects to satisfy the trial objective within a reasonable timeframe. Multicenter trials of this nature may, in principle, be carried out at any stage of clinical development. They may have several centers with a large number of subjects per center or, in the case of a rare disease, they may have a large number of centers with very few subjects per center.

Second, a trial may be designed as a multicenter (and multi- investigator) trial primarily to provide a better basis for the subsequent generalization of its findings. This arises from the possibility of recruiting the subjects from a wider population and of administering the medication in a broader range of clinical settings, thus presenting an experimental situation that is more typical of future use. In this case, the involvement of a number of investigators also gives the potential for a wider range of clinical judgement concerning the value of the medication. Such a trial would be a confirmatory trial in the later phases of drug development and would be likely to involve a large number of investigators and It might sometimes be conducted in a number of different countries to facilitate generalizability even further.

Missing in the above discussion of attributes of the modern multicenter clinical trial is consideration of the enhanced confidence in the statistical findings that such a trial has in comparison with a single center or single investigator trial. In effect, a contemporary mutlicentered clinical trial comprises several or many independent clinical trials, using uniform procedures, integrated into one large trial. The enhanced confidence derived from such a study derives from both statistical power and confidence inspiring cross replication, when extant.

If a multicenter trial is to be meaningfully interpreted and extrapolated, then the manner in which the protocol is implemented should be clear and similar at all centers. Furthermore, the usual sample size and power calculations depend upon the assumption that the differences between the compared treatments in the centers are unbiased estimates of the same quantity. It is important to design the common protocol and to conduct the trial with this background in mind. Procedures should be standardized as completely as possible. Variation of evaluation criteria and schemes can be reduced by investigator meetings, by the training of personnel in advance of the study, and by careful monitoring during the study. Good design should generally aim to achieve the same distribution of subjects to treatments within each center and good management should maintain this design objective. Trials which avoid excessive variation in the numbers of subjects per center and trials which avoid a few very small centers have advantages if it is later found necessary to examine the heterogeneity of the treatment effect from center to center, because they reduce the differences between different weighted estimates of the treatment effect. (This point does not apply to trials in which all centers are very small and in which center does not feature in the analysis.) Failure to take these precautions, combined with doubts about the homogeneity of the results, may, in severe cases, reduce the value of a multicenter trial to such a degree that it cannot be regarded as giving convincing evidence for the sponsor's claims.

The problem of high variation in the numbers of subjects per center can be approached by modern "hierarchical Bayesian analyses"

In the simplest multicenter trial, each investigator will be responsible for the subjects recruited at one hospital, so that "center'' is identified uniquely by either investigator or hospital. In many trials, however, the situation is more complex. One investigator may recruit subjects from several hospitals; one investigator may represent a team of clinicians (subinvestigators) who all recruit subjects from their own clinics at one hospital or at several associated hospitals. Whenever there is room for doubt about the definition of center in a statistical model, the statistical section of the protocol (see section 5.1) should clearly define the term (e.g., by investigator, location, or region) in the context of the particular trial. In most instances, centers can be satisfactorily defined through the investigators. (ICH Guideline E6 provides relevant guidance in this respect.) In cases of doubt, the aim should be to define centers to achieve homogeneity in the important factors affecting the measurements of the primary variables and the influence of the treatments. Any rules for combining centers in the analysis should be justified and specified prospectively in the protocol where possible, but in any case decisions concerning this approach should always be taken blind to treatment, for example, at the time of the blind review. It is sometimes possible to characterize the centers by historical measures of response to the control treatment or to other standard treatments, and this information may help to support decisions concerning the combination of centers for analysis.

The statistical model to be adopted for the comparison of treatments should be described in the protocol. The main treatment effect may be investigated first using a model that allows for center differences, but does not include a term for center by treatment interaction. In the absence of a true center by treatment interaction, the routine inclusion of interaction terms in the model reduces the efficiency of the test for the main effects. In the presence of a true center by treatment interaction, the interpretation of the main treatment effect is controversial.

In some studies, for example, some large mortality studies with very few subjects per center, there may be no reason to expect the centers to have any influence on the primary or secondary variables because they are unlikely to represent influences of clinical importance. In other studies, it may be recognized from the start that the limited numbers of subjects per center will make it impracticable to include the center effects in the statistical model. In these cases, it is not appropriate to include a term for center in the model, because in this situation randomization is rarely stratified by center.

If positive treatment effects are found in a trial with appreciable numbers of subjects per center, there should generally be a subsequent exploration of treatment by center interaction, as this may affect the generalizability of the conclusions. Marked treatment by center interaction may be identified by graphical display of the results of individual centers or by analytical methods, such as a significance test of the interaction. When using such a statistical significance test, it is important to recognize that this generally has low power in a trial designed to detect the main effect of treatment.

If a treatment by center interaction is found, this should be interpreted with care and vigorous attempts should be made to find an explanation in terms of other features of trial management or subject characteristics. Such an explanation will usually define the appropriate further analysis and interpretation. In the absence of an explanation, marked quantitative interactions imply that alternative estimates of the treatment effect may be needed, giving different weights to the centers, in order to substantiate the robustness of the estimates of treatment effect. It is even more important to understand the basis of any marked qualitative interactions, and failure to find an explanation may necessitate further clinical trials before the treatment effect can be reliably predicted.

3.3 Type of Comparison

3.3.1 Trials to Show Superiority

Scientifically, efficacy is most convincingly established by demonstrating superiority to placebo in a placebo-controlled trial, by showing superiority to an active control treatment, or by demonstrating a dose-response relationship. This type of trial is referred to as a "superiority'' trial (see section 5.2.3). In this guideline, superiority trials are generally assumed unless explicitly stated otherwise.

For serious illnesses, when a therapeutic treatment that has been shown to be efficacious by superiority trial(s) exists, a placebo-controlled trial may be considered unethical. In that case, the scientifically sound use of the active control should be considered. The appropriateness of placebo control versus active control should be considered on a study-by-study basis.

3.3.2 Trials to Show Equivalence or Noninferiority

In some cases, an investigational product is compared to a reference treatment without the objective of showing superiority. This type of trial is divided into two major categories according to its objective; one is an "equivalence'' trial and the other is a "noninferiority'' trial.

Bioequivalence trials fall into the former category. In some situations, clinical equivalence trials are also undertaken for other regulatory reasons, such as demonstrating the clinical equivalence of a generic product to the marketed product when the compound is not absorbed and therefore not present in the blood stream.

Many active control trials are designed to show that the efficacy of an investigational product is no worse than that of the active comparator, and hence fall into the latter category. Another possibility is a "relative potency assay,'' which is a study where multiple doses of the investigational drug are compared with the recommended dose or multiple doses of the standard drug.

Active control equivalence or noninferiority trials may also incorporate a placebo, thus pursuing multiple goals in one trial, for example, establishing superiority to placebo, thereby validating the study design and evaluating the degree of similarity of efficacy and safety to the active comparator. There are well-known limitations associated with the use of the active control equivalence (or noninferiority) trials that do not incorporate a placebo. These relate to the implicit lack of any measure of internal validity (in contrast to superiority trials), thus making external validation necessary. The equivalence (or noninferiority) trial is not conservative in nature, so many flaws in the design or conduct of the trial will tend to bias the results towards a conclusion of equivalence. For these reasons, the design features of such trials should receive special attention.

FDA should acknowledge that confirmation of an expected dose-response relationship among two or more doses of an active comparator validates the study similarly to that of showing a significant difference from placebo.

Active comparators should be chosen with care. An example of a suitable active comparator would be a widely used therapy whose efficacy in the relevant indication has been clearly established and quantified in well-designed and well-documented superiority trial(s) and that can be reliably expected to exhibit similar efficacy in the contemplated active control study. To this end, the new trial should have the same important design features (primary variables, the dose of the active comparator, eligibility criteria, etc.) as the previously conducted superiority trials in which the active comparator clearly demonstrated clinically relevant efficacy.

It is vital that the protocol of a trial designed to demonstrate equivalence ornoninferiority contain a clear statement that this is its explicit intention. An equivalence margin should be specified in the protocol; this margin is the largest difference which can be judged as being clinically acceptable. For the active control equivalence trial, both the upper and the lower equivalence margins are needed, while for the active control non-inferiority trial, only the lower margin is needed. There should be clinical justification for the choice of equivalence margins.

Statistical analysis is generally based on the use of confidence intervals (see section 5.5). For equivalence trials, the two-sided 1-2<greek-a> (alpha) confidence limits should be used. Equivalence is inferred when the entire confidence interval falls within the equivalence margins. This is equivalent to the method of using two simultaneous one-sided tests to test the (composite) null hypothesis that the treatment difference is outside of the equivalence margins versus the (composite) alternative that the treatment difference is within the limits. With this method, the Type I error is controlled at a level of a;. For noninferiority trials, the one-sided 1- a; interval should be used. The confidence interval approach has a one-sided hypothesis test counterpart testing the null hypothesis that the treatment difference (investigational product minus control) is equal to the lower equivalence margin versus the alternative that the treatment difference is greater than the lower equivalence margin. Sample size calculations should be based on these methods (see section 3.5). The choice of a; should be a consideration separate from the choice of a one-sided or two-sided test.

The above hypothesis-testing orientation of equivalence trial analyses should be expanded to include evaluation of equivalence or nonsuperiority via analysis of dose-response relationships when the study design includes two or more active doses of the investigational and or reference products.

It is inappropriate to conclude equivalence or noninferiority based on observing a nonsignificant test result of the null hypothesis that there is no difference between the investigational product and the active comparator.

There are also special issues in the choice of analysis sets. Subjects who withdraw or drop out of the treatment group or the comparator group will tend to have a lack of response, hence the analysis of all randomized subjects may be biased toward demonstrating equivalence (see section 5.2.3).

3.3.3 Dose-Response Designs

How response is related to the dose of a new investigational product is a question to which answers may be obtained in all phases of development and by a variety of approaches (see ICH E4). Dose- response studies may serve a number of objectives, among which the following are of particular importance: The confirmation of efficacy; the investigation of the shape and location of the dose- response curve; the estimation of an appropriate starting dose; the identification of optimal strategies for individual dose adjustments; the determination of a maximal dose beyond which additional benefit would be unlikely to occur. These objectives should be addressed using the data collected at a number of doses under investigation, including a placebo (zero dose) wherever appropriate. For this purpose, the application of estimation procedures, including the construction of confidence intervals and of graphical methods is as important as the use of statistical tests. The hypothesis tests that are used may need to be tailored to the natural ordering of doses or to particular questions regarding the shape of the dose-response curve (e.g., monotonicity). The details of the planned statistical procedures should be given in the protocol.

The dose-response relationship is complex to estimate for which Bayesian methods are well suited, because the objective is to summarize a dose-response relationship (not just test the null relationship as confirmation of effectiveness). Moreover, dose-response relationships are complex in that they have a scientific (physiological-pharmacological) causal basis only at the level of the individual patient; population dose-response is simply the marginal dose-response and may not have any valid scientific description except as the average of individual dose-response. Dose-response thus naturally involves a hierarchical model view, motivating use of likelihood or Bayesian methods. We would contest the suggestion that any amount of "tailoring" of hypothesis tests can do justice to dose response data, and would discourage the FDA from suggesting otherwise.

3.4 Group Sequential Designs

Group sequential designs are used to facilitate the conduct of interim analysis (see section 4.5). While group sequential designs are not the only acceptable types of designs permitting interim analysis, they are the most commonly applied because it is more practicable to assess grouped subject outcomes at periodic intervals during the trial than on a continuous basis as data from each subject become available. The statistical methods should be fully specified in advance of the availability of information on treatment outcomes and subject treatment assignments (i.e., blind breaking, see section 4.5). An independent data monitoring committee (IDMC) may be used to conduct the interim analysis of data arising from a group sequential design (see section 4.6). While the design has been most widely and successfully used in large, long-term trials of mortality or major nonfatal endpoints, its use is growing in other circumstances. In particular, it is recognized that safety must be monitored in all trials, therefore, the need for formal procedures to cover early stopping for safety reasons should always be considered.

3.5 Sample Size

The number of subjects in a clinical trial should always be

large enough to provide a reliable answer to the questions addressed. This number is usually determined by the primary objective of the trial. If the sample size is determined on some other basis, this should be made clear and justified. For example, a trial sized on the basis of safety questions or requirements may need larger numbers of subjects than one sized on the basis of efficacy questions. (See, for example, ICH E1A "Population Exposure: The Extent of Population Exposure to Assess Clinical Safety.'')

When determining the appropriate sample size, the following items should be specified: A primary variable; the test statistic; the null hypothesis; the alternative ("working'') hypothesis at the chosen dose(s) (embodying consideration of the treatment difference to be detected or rejected at the dose and in the subject population selected); the probability of erroneously rejecting the null hypothesis (the Type I error) and the probability of erroneously failing to reject the null hypothesis (the Type II error); as well as the approach to dealing with treatment withdrawals and protocol violations. In some instances, the event rate is of primary interest for evaluating power, and assumptions should be made to extrapolate from the required number of events to the eventual sample size for the trial.

The approach recommended above for determining sample size is oriented to simple hypothesis testing type clinical trials. While a dose-response trial may be viewed as a test of the zero dose-response hypothesis, generally the objective is to estimate the parameters of the dose response function (specifically, the population distribution of the parameters of individual dose-response), not only for the purpose of establishing effectiveness (rejection of the zero dose-response hypothesis via non-significant parameter estimates) but also for the purpose of confirmation of the shape of the dose-response relationship and for interpolation to determine optimal doses. Hence, FDA is encouraged to include a discussion of sample size determination in the context of the estimation objectives of the dose-response trial.

The method by which the sample size is calculated should be given in the protocol, together with the estimates of any quantities used in the calculations (such as variances, mean values, response rates, event rates, difference to be detected). The basis of these estimates should also be given. It is important to investigate the sensitivity of the sample size estimate to a variety of deviations from these assumptions and this may be facilitated by providing a range of sample sizes appropriate for a reasonable range of deviations from assumptions.

This is an area where simulations may be useful in defining the allowable range of variability for crucial assumptions within a given sizing strategy, as an alternative to sizing the study to account for the maximum possible variability in the trial. A risk judgement could then be made by the sponsor (perhaps in collaboration with the Agency) to accept a trial of smaller size if the outcome is within a prespecified range.

In confirmatory studies, assumptions should normally be based on published data or on the results of earlier studies. The treatment difference to be detected may be based on a judgement concerning the minimal effect that has clinical relevance in the management of patients or on a judgement concerning the anticipated effect of the new treatment, where this is larger. Conventionally, the probability of Type I error is set at 5 percent or less or as dictated by any adjustments made necessary for multiplicity considerations; the precise choice is influenced by the prior plausibility of the hypothesis under test and the desired impact of the results. The probability of Type II error is conventionally set at 20 percent or less; it is in the sponsor's interest to keep this figure as low as feasible, especially in the case of studies that are difficult or impossible to repeat.

FDA is encouraged to provide a justification for the 5% or less Type I error level prescribed above and to consider under what conditions a 1- or 2-tailed test is appropriate. In addition, when two independent confirmatory clinical trials are required or undertaken, the impact of such replication (or substantiation) on the overall Type I error regarding effectiveness should be considered.

Sample size calculations should refer to the number of subjects required for the primary analysis. If this is the "all randomized subjects'' set, estimates about the effect size may need to be reduced compared to the per protocol set. This is due to the diluting effect of the inclusion of treatment withdrawals. The assumptions of variability may also need to be revised.

FDA should consider more generally the impact of protocol deviations on sample sizing based on the "all randomized subjects'' set, especially the dilutional effect of less than full compliance with ingestion or administration of assigned treatments, and the serious biases that may result if compliance is related to drug effects; i.e., outcomes. Consideration should be given to how to incorporate into trial design and analysis quantitative information on non-compliance with treatment assignments, and the causes for compliance deviations, obtained during the course of a clinical trial.

The sample size of an equivalence trial or a noninferiority trial (see section 3.3.2) should normally be based on the objective of obtaining a confidence interval for the treatment difference that shows that the treatments differ at most by a clinically acceptable difference. For equivalence trials, the power is usually assessed at a true difference of zero but can be underestimated if the true difference is not zero. For noninferiority trials, the power is usually assessed at an expected (nonzero) difference, but can be underestimated if the true difference is less than expected. The choice of a "clinically acceptable'' difference needs justification, and may be smaller than the "clinically relevant'' difference referred to above in the context of superiority trials designed to establish that a difference exists.

The sample size in a group sequential trial cannot be fixed in advance because it depends upon the play of chance in combination with the chosen stopping rule and the true treatment difference. The design of the stopping rule should take into account the consequent distribution of the sample size, usually embodied in the expected and maximum sample sizes.

When event rates are lower than anticipated or variability is larger than expected, methods for sample size reestimation are available without unblinding data or making treatment comparisons (see section 4.4).

3.6 Data Capture and Processing

The collection of data and transfer of data from the investigator to the sponsor can take place through a variety of media, including paper case record forms, remote site monitoring systems, medical computer systems, and electronic transfer. Whatever data capture instrument is used, the form and content of the information collected should be in full accordance with the protocol and should be established in advance of the conduct of the clinical trial. It should focus on the data necessary to implement the analysis plan, including the context information (such as timing assessments relative to dosing) necessary to confirm protocol compliance or identify important protocol deviations. "Missing values'' should be distinguishable from the "value zero'' or "characteristic absent.''

The process of data capture, through to database finalization, should be carried out in accordance with good clinical practice (GCP) (see ICH E6, section 5). Specifically, timely and reliable processes for recording data and rectifying errors and omissions are necessary to ensure delivery of a quality database and the achievement of the trial objectives through the implementation of the analysis plan.

IV. Study Conduct

4.1 Trial Monitoring

Careful conduct of a clinical trial according to the protocol has a major impact on the credibility of the results. Careful monitoring can ensure that difficulties are noticed early and their occurrence or recurrence minimized.

There are two distinct types of monitoring that generally characterize confirmatory clinical trials sponsored by the pharmaceutical industry. Both types of trial monitoring, in addition to entailing different staff responsibilities, involve access to different types of study data and information, thus different principles apply for the control of potential statistical and operational bias.

One type of monitoring concerns the oversight of the quality of the trial, including whether the protocol is being followed, acceptability of data being accrued, success of planned accrual targets, checking the design assumptions, etc. (see sections 4.2 to 4.4). This type of monitoring does not require access to information on comparative treatment effects, nor unblinding of data, and therefore has no impact on Type I error. The monitoring of a trial for this purpose is the responsibility of the sponsor and can be carried out by the sponsor or an independent group selected by the sponsor. The period for this type of monitoring usually starts with the selection of the study sites and ends with the collection and cleaning of the last subject's data.

The other type of trial monitoring involves breaking the blind to make treatment comparisons. It therefore involves the accruing of comparative treatment results, which requires that a protocol (or appropriate amendments prior to a first analysis) contain statistical plans to prevent certain types of bias. This type of trial monitoring involves unblinded (i.e., key breaking) access to treatment group assignment (actual treatment assignment or identification of group assignment) and comparative treatment group summary information. This type of monitoring is discussed in sections 4.5 and 4.6.

4.2 Changes in Inclusion and Exclusion Criteria

Inclusion and exclusion criteria should remain constant, as specified in the protocol, throughout the period of subject recruitment. Occasionally, however, changes may be appropriate; in long-term studies, for example, growing medical knowledge either from outside the trial or from interim analyses may suggest a change of entry criteria. Changes may also result from the discovery by monitoring staff that regular violations of the entry criteria are occurring, or that seriously low recruitment rates are due to over- restrictive criteria. Changes should be made without breaking the blind and should always be described by a protocol amendment that should cover any statistical consequences, such as sample size adjustments arising from different event rates, or modifications to the analysis plan, such as stratifying the analysis according to modified inclusion/exclusion criteria.

4.3 Accrual Rates

In studies with a long time-scale for the accrual of subjects, the rate of accrual should be monitored; if it falls appreciably below the projected level, the reasons should be identified and remedial actions taken to protect the power of the trial and allay concerns about selective entry and other aspects of quality. In a multicenter trial, these considerations apply to the individual centers.

4.4 Sample Size Adjustment

In long-term trials, there will usually be an opportunity to check the assumptions which underlie the original design and sample size calculations. This may be particularly important if the trial specifications have been made on preliminary and/or uncertain information. An interim check conducted on the blinded data may reveal that overall response variances, event rates, or survival experience are not as anticipated. A revised sample size may then be calculated using suitably modified assumptions, and should be justified and documented in a protocol amendment and in the final report. The steps taken to preserve blindness and the consequences, if any, for the Type I error and the width of confidence intervals should be explained. The potential need for reestimation of the sample size should be envisaged in the protocol whenever possible (see section 3.5).

4.5 Interim Analysis and Early Stopping

Any analysis intended to compare treatment arms with respect to efficacy or safety at any time prior to formal completion of a trial is an interim analysis. Because the number, methods, and consequences of these comparisons affect the interpretation of the trial, all interim analyses should be carefully planned in advance and described in the protocol, or otherwise specified in amendments prior to unblinded access to treatment comparison data. When an interim analysis is planned with the intention of deciding whether or not to terminate a trial, this is usually accomplished by the use of a group sequential design that employs statistical monitoring schemes as guidelines (see section 3.4). The goal of such an interim analysis is to stop the trial early if the superiority of the treatment under study is clearly established, if the demonstration of a relevant treatment difference has become unlikely, or if unacceptable adverse effects are apparent. Generally, boundaries for monitoring efficacy require more evidence to terminate a trial early (i.e., more conservative) than do boundaries to terminate a trial for safety reasons. When the trial design and monitoring objective involve multiple endpoints, then this aspect of multiplicity may also need to be taken into account.

The schedule of interim analyses, or at least the considerations which will govern its generation, should be stated in the protocol or a protocol amendment before the time of the first interim analysis; as flexible statistical methods are available to conduct interim analyses according to a variety of needs (early or late in a trial), the stopping guidelines and their properties should be clearly stated in the protocol or amendments. This material should be written or approved by the data monitoring committee, when the study has one (see section 4.6). Deviations from the planned procedure always bear the potential of invalidating the study results. If it becomes necessary to make changes to the trial, any consequent changes to the statistical procedures should be specified in an amendment to the protocol at the earliest opportunity, especially discussing the impact on any analysis and inferences that such changes may cause. The procedures selected should always ensure that the overall probability of Type I error is controlled.

The execution of an interim analysis should be a completely confidential process because unblinded data and results are potentially involved. All staff involved in the conduct of the trial should remain blind to the results of such analyses because of the possibility that their attitudes to the trial will be modified and cause changes in recruitment patterns or biases in treatment comparisons. This principle applies to the investigators and their staff and to staff employed by the sponsor that come into contact with clinic staff or subjects. Investigators should be informed only about the decision to continue or to discontinue the trial, or to implement modifications to trial procedures.

Most clinical trials intended to support the efficacy and safety of an investigational product should proceed to full completion of planned sample size accrual; trials should be stopped early only for ethical reasons or if the power is no longer acceptable. However, it is recognized that drug development plans involve the need for sponsor access to comparative treatment data for a variety of reasons, such as planning other studies or when only a subset of trials will involve the study of serious life-threatening outcomes or mortality which may need sequential monitoring of accruing comparative treatment effects for ethical reasons. In either of these situations, plans for interim statistical analysis should be in place in the protocol or in protocol amendments prior to the unblinded access to comparative treatment data in order to deal with the potential statistical and operational bias that may be introduced.

For many clinical trials of investigational products, especially those that have major public health significance, the responsibility for monitoring comparisons of efficacy and/or safety outcomes should be assigned to an external, independent group, often called an independent data monitoring committee (IDMC), a data and safety monitoring board, or a data monitoring committee, whose responsibilities should be clearly described.

When a sponsor assumes the role of monitoring efficacy or safety comparisons and therefore has access to unblinded comparative information, particular care should be taken to protect the integrity of the trial and the sharing of information. The sponsor should ensure and document that the internal monitoring committee has complied with written SOP's and that minutes of decision making meetings are maintained.

Any interim analysis that is not planned in the protocol or specified in an amendment to the protocol prior to unblinding the data (with or without the consequences of stopping the trial early) may flaw the results of a trial and possibly weaken confidence in the conclusions drawn. Therefore, such analyses should be avoided. If unplanned interim analysis is conducted, the study report should explain why it was necessary and the degree to which blindness had to be broken, and provide an assessment of the potential magnitude of bias introduced and the impact on the interpretation of the results.

4.6 Role of Independent Data Monitoring Committee (IDMC)

(see sections 1.25 and 5.5.2 of ICH Guideline E6)

An IDMC may be established by the sponsor to assess at intervals the progress of a clinical trial, safety data, and critical efficacy variables and recommend to the sponsor whether to continue, modify, or terminate a trial. The IDMC should have written operating procedures and maintain records of its meetings. The independence of the IDMC is intended to control the sharing of important comparative information and to protect the integrity of the clinical trial from adverse impact resulting from access to trial information. The IDMC is a separate entity from an institutional review board (IRB) or an ethics board, and its composition should include clinical trial scientists knowledgeable in the appropriate disciplines, including statistics.

When there are sponsor representatives on the IDMC, their role should be clearly defined in the operating procedures of the committee (for example, covering whether or not they can vote on key issues). Since these sponsor staff would have access to unblinded information, the procedures should also address the control of dissemination of interim trial results within the sponsor organization.

V. Data Analysis

5.1 Prespecified Analysis Plan

When designing a clinical trial, the principal features of the eventual statistical analysis of the data should be described in the statistical section of the protocol. This section should include all features of the proposed confirmatory analysis of the primary variable(s) and the way in which anticipated analysis problems will be handled. In the case of exploratory trials, this section could describe more general principles and directions.

Subsequently, a statistical analysis plan may be written as a separate document. In this document, a more technical and detailed elaboration of the principal features stated in the protocol may be included. The statistical analysis plan is usually an internal document and may include detailed procedures for executing the statistical analysis. The statistical analysis plan should be reviewed and possibly updated as a result of the blind review of the data (see section 7.1 for definition).

While CDDS endorses the concept of prespecified analysis planning, FDA should encourage the most informative analyses to be applied to data from a completed trial, regardless of whether they were specified in advance (see second CDDS comment in Section 1.2).

Nevertheless, since documentation of all analyses that eventually are applied to all study results is usually meager, incomplete, or absent, CDDS recommends that detailed plans for all anticipated analyses of all observations be outlined in the protocol in advance of the study (not "subsequently" to the design phase of trial planning, as if an "afterthought", as proposed above) . Simulation of study results during the trial design phase may be helpful in identifying appropriateness of data analytic techniques and even the value (or lack thereof) of candidate observations. In addition to improving the data analysis of clinical trials, imposition of this documentary requirement may have a positive impact on over-observed trials by identifying for elimination those observations that may have insufficient value or may be unanalyzable or uninterpretable.

In the case of explanatory trials, it is not sufficient to consider only "more general principles and directions". It is equally important to anticipate the data analyses during the planning phase since input from clinical pharmacologists and disease area experts is essential in selecting the most informative designs and questions to be explored by the analyses.

If the blind review suggests changes to the principal features stated in the protocol, these should be documented in a protocol amendment. Otherwise, it will suffice to update the statistical analysis plan with the considerations suggested from the blind review. Only results from analyses envisaged in the protocol (including amendments) can be regarded as confirmatory.

The statistical methodology, including when in the clinical trial process methodology decisions were made, should be clearly described in the statistical section of the clinical study report (see ICH E3).

5.2 Analysis Sets

The set of subjects whose data are to be included in the main analyses should be defined in the statistical section of the protocol. In addition, documentation for all subjects for whom study procedures (e.g., run-in period) were initiated may be useful. The content of this subject documentation depends on detailed features of the particular trial, but at least demographic and baseline data on disease status should be collected whenever possible.

If all subjects randomized into a clinical trial satisfied all entry criteria, followed all trial procedures perfectly with no losses to followup, and provided complete data records, then the set of subjects to be included in the analysis would be self-evident. The design and conduct of a trial should aim to approach this ideal as closely as possible, but, in practice, it is doubtful if it can ever be fully achieved. Hence, the statistical section of the protocol should address any anticipated problems prospectively in terms of how these affect the subjects and data to be analyzed. The protocol should also specify procedures aimed at minimizing any anticipated irregularities in study conduct that might impair a satisfactory analysis, including various types of protocol violations, withdrawals, and missing values. The protocol should consider ways both to reduce the frequency of such problems and to handle the problems that occur in the analysis of data. The blind review of data to identify possible amendments to the analysis plan due to the protocol violations should be carried out before unblinding. It is desirable to identify any important protocol violation with respect to the time when it occurred, its cause, and its influence on the trial result. The frequency and type of protocol violations, missing values, and other problems should be documented in the study report and their potential influence on the trial results should be described (see ICH E3).

Decisions concerning the analysis set should be guided by the following principles: (1) To minimize bias and (2) to avoid inflation of Type I error.

5.2.1 All Randomized Subjects

The intention-to-treat principle implies that the primary analysis should include all randomized subjects. In practice, this ideal may be difficult to achieve, for reasons to be described. Hence, analysis sets referred to as "all randomized subjects'' may not, in fact, include every subject. For example, it is common practice to exclude from the all randomized set any subject who failed to take at least one dose of trial medication or any subject without data post randomization. No analysis is complete unless the potential biases arising from these exclusions are addressed and can be reasonably dismissed.

In many clinical trials, the "all randomized subjects'' approach is conservative and also gives estimates of treatment effects that are more likely to mirror those observed in subsequent practice. Randomization prevents biased allocation of subjects to treatments and provides the foundation of statistical tests. The problems associated with the analysis of all randomized subjects lie in the handling of protocol violations and the subtleties that this can involve.

The "estimates of treatment effects that are more likely to mirror those observed in subsequent practice" derived from analysis of "all randomized subjects" (including full or partial treatment non-compliers) usually yields downwards biased population effect size estimates for both beneficial and toxic effects, teaching little about what the individual recipient of the drug should expect, especially one who is fully compliant. This may lead to "mislabeling" in that downwards biased estimates of both beneficial and toxic effects published in an approved label represents untruthful advice for the prescriber or recipient of the drug. Moreover, we would be interested in the evidence that supports the view expressed in the phrase just quoted: it seems equally plausible that compliance with a newly released medicine, now authoritatively asserted to be effective, will be greater than that seen in a clinical trial of what is viewed at the time as a medicine of unknown efficacy. FDA should provide guidance on methods to estimate what future individual patients should expect to experience regarding a drug's efficacy and safety, rather than relying upon estimates of effects based on population averages.

While the intention-to-treat approach may appear to be conservative from the point of view of protection against false proof of effectiveness, the typically underestimated adverse reaction rates create a falsely optimistic safety estimate and thus is dangerous. The FDA should reconcile this problem by providing guidance on how to estimate corrected treatment effects (both beneficial and toxic) in the face of observed full or partial medication non-compliance. Stated as a question: how does the FDA suggest one estimate the effect of a drug in the individual patient when taken as prescribed, as opposed to being satisfied only with confirmation of the effect of the prescription policy in a population; or does the FDA consider this to be an unimportant or non-regulatory question?

There are two types of major protocol violations. One is violation of entry criteria. The second is violation of the protocol after randomization. Subjects who fail to satisfy an objective entry criterion measured prior to randomization, but who enter the trial, may be excluded from analysis without introducing bias into the treatment comparison, assuming all subjects receive equal scrutiny for eligibility violations. (This may be difficult to ensure if the data are unblinded.) Not all entry criteria are sufficiently objective for this to be done satisfactorily. Reasons for excluding subjects from the analysis of all randomized subjects should be justified.

Other problems occur after randomization (error in treatment assignment, use of excluded medications, poor compliance, loss to followup, missing data, and other protocol violations). These problems are especially difficult when their occurrence is related to treatment assignment. It is good practice to assess the pattern of such problems with respect to frequency and time to occurrence among treatment groups. Subjects withdrawn from treatment may introduce serious bias and, if they have provided no data after withdrawal, there is no obvious solution. Severe protocol violation, such as use of excluded medication, may also introduce serious bias into measurements after such a violation. The necessary inclusion of such subjects in the analysis may require some redefinition of the primary variable or some assumptions about the subjects' outcomes.

Rather than simply warn of difficulty and bias when a commonplace post-randomization problem occurs such as poor compliance, FDA should encourage quantitative assessment of medication compliance and provide guidance on incorporation of such observations into trial data analyses, e.g. by providing guidance on how to estimate corrected treatment effects (both beneficial and toxic) in the face of observed full or partial medication non-compliance.

Measurements of primary variables made at the time of the loss to followup of a subject for any reason or at the time of a severe protocol violation, or subsequently collected in accordance with the protocol, are valuable in the context of all randomized subjects analysis. Their use in analysis should be described and justified in the statistical section of the protocol and their collection described elsewhere in the protocol. However, the use of imputation techniques can lead to biased estimates of treatment effects, particularly when the likelihood of the loss of a subject is related to treatment or response. Any other methods to be employed to ensure the availability of measurements of primary variables for every subject in the all randomized subjects analysis should be described.

While empirical imputation techniques such as "last observation carried forward" often conferring conservative biases are to be discouraged, FDA should encourage pharmacologically valid modeling as an imputation technique or for incorporation into advanced data analyses approaches such as mixed effects modeling. Pharmacologically valid modeling and multiple imputation techniques include modeling based on pharmacokinetic, pharmacodynamic, and clinical effect observations made in other scientific investigations in the drug development program. Model-based analysis methods are particularly applicable here.

Because of the unpredictability of some problems, it may sometimes be preferable to defer detailed consideration of the manner of dealing with irregularities until the blind review of the data at the end of the study and, if so, this should be stated in the protocol.

Not mentioned here is the potential value of measurements of compliance or the use of population pharmacokinetic data (e.g. prespecified measurements of plasma or urine drug levels during the trial) and the contribution of such measurements to analysis of effectiveness and safety. While this does not eliminate the potential for bias due to differences in compliance attributable to treatment effects, it provides a method for estimating the magnitude of any such effects and their potential impact on the outcome of the trial. Additional data on compliance and "surrogates" can contribute to analyses that will place bounds on the possible true treatment effects; i.e. such measurements enable assessment of the sensitivity of as-randomized estimates to possible non-compliance.

5.2.2 Per Protocol Subjects

The "per protocol'' set of subjects, sometimes described as the "valid cases,'' the "efficacy'' sample, or the "evaluable subjects'' sample, defines a subset of the data used in the all randomized subjects analysis and is characterized by the following criteria:

(i) The completion of a certain prespecified minimal exposure to the treatment regimen;

(ii) The availability of measurements of the primary variable(s);

(iii) The absence of any major protocol violations, including the violation of entry criteria where the nature of and reasons for these protocol violations should be defined and documented before breaking the blind.

This set may maximize the opportunity for a new treatment to show additional efficacy in the analysis, and most closely reflects the scientific model underlying the protocol. However, it may or may not be conservative, depending on the study, and may be subject to bias (possibly severe) because the subjects adhering most diligently to the study protocol may not be representative of the entire study population.

Despite the inherent potential for a biased estimate of efficacy and side effects, analysis of the "per protocol'' set of subjects affords a simple, although biased, assessment of safety that is crucial when quantitative observations of medication compliance are available to affirm level of exposure. However, the implicit suggestion that the analysis be carried out by re-assigning patients to "treatment" groups according to treatment received and then analyzing as though these were the randomized assignments is certainly not the only, nor the best method of attempting to determine true treatment effects; the use of an instrumental variable (e.g. the assignment to different doses) analysis can be considered in this instance.

Protocol Analysis

In general, it is advantageous to demonstrate a lack of sensitivity of the principal trial results to alternative choices of the set of subjects analyzed. In confirmatory trials, it is usually appropriate to plan to conduct both all randomized subjects and per protocol analyses, so that any differences between them can be the subject of explicit discussion and interpretation. In some cases, it may be desirable to plan further exploration of the sensitivity of conclusions to the choice of the set of subjects analyzed. When the all randomized subjects and the per protocol analyses come to essentially the same conclusions, confidence in the study results is increased, bearing in mind, however, that the need to exclude a substantial proportion of subjects from the per protocol analysis throws some doubt on the overall validity of the study.

The value of concurrence of conclusions derived from analysis of both subject sets (all randomized and per protocol subjects) depends upon the objectives and assumptions of the trial. A pure test of prescribing policy is tested adequately by using the intention to treat principle (all randomized subjects); concurrence of analysis of per protocol subjects provides little added confidence since overall effect size estimates are biased, albeit often in opposite directions. However, when an trial objective is quantitative estimation of outcome magnitudes ascribable to actual exposure to one or more active doses of a drug, both subject sets may yield biased estimates of actual treatment effects. Concurrence here provides only enhanced confidence of a simple test of treatment policy(s) but does not correct biased effect size estimates.

All randomized subjects and per protocol analyses play different roles in superiority trials (which seek to show the investigational product to be superior) and in equivalence or noninferiority trials (which seek to show the investigational product to be comparable, see section 3.3.2). In superiority studies, the all randomized subjects analysis usually tends to avoid the optimistic estimate of efficacy which may result from a per protocol analysis, since the noncompliers included in an all randomized subjects analysis will generally diminish the overall treatment effect. However, in an equivalence or noninferiority trial, the all randomized subjects analysis is no longer conservative and its role should be considered very carefully.

5.3 Missing Values and Outliers

Missing values represent a potential source of bias in a clinical trial. Hence, every effort should be undertaken to fulfill all the requirements of the protocol concerning the collection and management of data. However, in reality there will almost always be some missing data. A study may be regarded as valid, nonetheless, provided the methods of dealing with missing values are sensible, particularly if those methods are predefined in the analysis plan of the protocol. Predefinition of methods may be facilitated by updating this aspect of the analysis plan during the blind review. Unfortunately, no universally applicable methods of handling missing values can be recommended. An investigation should be made concerning the sensitivity of the results of analysis to the method of handling missing values, especially if the number of missing values is substantial.

Incorporation of missing values in trial analyses may be approached using multiple imputation techniques.

A similar approach should be adopted to exploring the influence of outliers, the statistical definition of which is, to some extent, arbitrary. Clear identification of a particular value as an outlier is most convincing when justified medically as well as statistically, and the medical context will then often define the appropriate action. Any outlier procedure set out in the protocol should not favor any treatment group a priori. Once again, this aspect of the analysis plan can be usefully updated during blind review. If no procedure for dealing with outliers was foreseen in the study protocol, one analysis with the actual values and at least one other analysis eliminating or reducing the outlier effect should be performed and differences between their results discussed.

FDA's requests for investigations "concerning the sensitivity of the results of analysis to the method of handling missing values" and "to exploring the influence of outliers" can be satisfied by simulation of clinical trial results under varying distributions of missing values and outliers, using candidate missing values or outliers techniques. FDA may wish to provide a list other known methods employable to satisfy these recommendation. FDA should consider Bayesian approaches to outliers such as using "long tailed" (non-normal error distributions).

5.4 Data Transformation/Modification

The decision to transform key variables prior to analysis is best made during the design of the trial on the basis of similar data from earlier clinical trials. Transformations (e.g., square root, logarithm) should be specified in the protocol and a rationale provided, especially for the primary variable(s). The general principles guiding the use of transformations to ensure that the assumptions underlying the statistical methods are met are to be found in standard texts; conventions for particular variables have been developed in a number of specific clinical areas. The decision on whether and how to transform a variable should be influenced by the preference for a scale that facilitates clinical interpretation.

Similar considerations apply to other data modifications sometimes used to create a variable for analysis, such as the use of change from baseline, percentage change from baseline, the "area under the curve'' of repeated measures, or the ratio of two different variables. Subsequent clinical interpretation should be carefully considered, and the modification should be justified in the protocol. Closely related points are made in section 2.2.2.

Data transformations may also be justified (or deliberately avoided) based on consideration of an underlying biological model for the data, as well as the data error structure (variance model). Care should be taken to avoid unnecessary data transformations that confound, obscure, or mislead interpretations of trial results that are designed and or interpreted in the light of known biological or variance models.

5.5 Estimation, Confidence Intervals, and Hypothesis Testing

The statistical section of the protocol should specify the hypotheses that are to be tested and/or the treatment effects that are to be estimated to satisfy the objectives of the trial. The statistical methods to be used to accomplish these tasks should be described for the primary (and preferably the secondary) variables, and the underlying statistical model should be made clear. Estimates of treatment effects should be accompanied by confidence intervals, whenever possible, and the way in which these will be calculated should be identified. The plan should also describe any intentions to use baseline data to improve precision and to adjust estimates for potential baseline differences, for example, by means of analysis of covariance. The reporting of precise p-values (e.g., "P=0.034'') should be envisaged in the plan, rather than exclusive reference to critical values (e.g., "P<0.05''). It is