|Values are valid only on day of printing.|
Validation is much more than a regulatory requirement. The guidelines that the Clinical Laboratory Improvement Amendments (CLIA) promulgated are—by the very nature of regulations—minimum requirements. No effort was made to address what are best practices. Mayo Clinic’s intent is to offer referral laboratory services that meet the most exacting scientific and medical practice expectations. That goal demands the establishment and implementation of development and validation activities that set performance targets beyond minimum requirements, document the achievement of those objectives, and maintain performance over time. The validation process must include the use of properly collected and processed specimens, while ensuring accurate analytic procedures and result interpretation.
This article describes the essence of the concepts and practices applied to validate our tests that are used to evaluate patients, whether those patients are at Mayo Clinic or any of the hospitals and clinics we serve.
Analytic validation serves 2 purposes. First, it ensures that the staff can perform the test within the required performance characteristics (set for clinical practice). Second, it clearly documents that the laboratory is meeting performance goals and regulatory requirements. In this article, we will not discuss in detail the second purpose. In brief, documentation must contain sufficient background and detail such that experimental designs are clear, and the data must be appropriately presented to support conclusions. Laboratories may have a variety of unique forms of documentation to meet their internal needs, resources, and clinical practice expectations.
Assay development, validation, and verification each are unique processes. During assay development (includes assay optimization and kit evaluation), an analytical process or idea is defined and optimized into a robust, reproducible procedure that delivers results as intended for its clinical purpose in that laboratory. A standard operating procedure (SOP) is written, the reagents are optimized, and the assay found to be reproducible. After successful completion of the assay development phase, and prior to implementation, the next step is a verification/validation study.
Assay verification is the demonstration that a laboratory is capable of replicating a standard assay procedure, at a minimum, within the manufacturer’s performance specifications, for a US Food and Drug Administration (FDA)-cleared test that is used according to the manufacturer’s established procedure. To meet regulatory requirements, the testing laboratory must document the accuracy, precision, and reportable range of the assay on its specific testing system (within the manufacturer’s stated parameters). To make use of the manufacturer’s reference range, the laboratory must verify that the clinical population the laboratory serves falls within that stated range.
Validation, in contrast, is required for tests that are modified from the FDA-cleared version, and for laboratory-developed tests (LDTs), including many of the tests offered by Mayo Medical Laboratories. During verification or validation, the laboratory is not trying to “find out” anything new: the facts of the analytical performance should already be known before entering the validation cycle. The necessary components of assay validation are described within this article.
While the rest of this article will focus on the validation of non-FDA-approved tests, the concepts apply to verification of FDA-approved tests as well.
Evaluation of Method Performance
Modified test system, laboratory (in-house) developed assay, test system for which performance specifications are not defined - validation of:
The regulatory minimum requirements of accuracy, precision, and reportable range (for FDA-approved kits), plus analytical sensitivity and analytical specificity (for modified kits or for LDTs) are described, as well as other parameters pertinent to proper implementation of a test.
Accuracy: Three types of accuracy assessment are most commonly used:
A method comparison study is often presented as an accuracy assessment. This is appropriate only when one knows that the comparator itself is accurate. To use a method comparison study as a measure of accuracy, clinical data—which is the basis for interpretation of results—must be traceable to that method. If not, that method is a poor assessment of accuracy. When doing a method comparison, include samples from patients that span the entire analytical range. Use unweighted regression analysis (eg, Passing-Bablok), and set acceptance criteria such that the bias is within experimental imprecision limits for the test. (see Figure) If an accepted reference calibrator (pure or of known purity free from potential interferents) is available, the accuracy should be documented based on that information. Recovery of spiked pure material should be consistent across samples to show that matrix differences do not affect the analysis. The spike should not be the calibrator used in the assay and it should be spiked into the actual sample matrix, not an artificial matrix. Note, however, that recovery is dependent on equivalence of the spiked material with the endogenous analyte in all aspects, including binding to endogenous binders. For a protein, this can be problematic, since rarely does the spiking agent have the same tertiary structure or substitution level (ie, glycosylation, phosphorylation, or sulfation) as the native substance. Hence, poor recovery of a spike does not necessarily correlate with poor accuracy. If an SRM or CRM is available, analysis of that sample in the test system is a corollary of accuracy.
These graphs are derived from a set of data comparing serum cortisol results, in mcg/dL, at 2 different labs. The upper graph uses the Passing-Bablok approach, an unweighted regression analysis, while the lower graph utilizes a linear regression method, which gives more weight to the points at the extremes. The difference in slope and intercept is considerable. In this case, the unweighted regression more accurately reflects the relative performance of the laboratories in the region of interest, below 30 mcg/dL.
Precision: Between-run estimates of imprecision should include across-operator assessments, so that several persons running the test contribute to the data. Often, the precision estimates from artificial quality control (QC) pools are better than for actual patient samples, so authentic samples should be included in the validation (and in routine QC, if possible). And, at least 1 QC material that was not supplied by the test manufacturer should always be used.
Linearity: This evaluation is often misused because it is done without conscious thought as to its purpose. Consider the goal:
How the linearity experiment is performed is dependent upon which of these hypotheses is being tested. Most commonly, the desired conclusion is that one can reliably dilute samples that are above the upper limit of the AMR, so as to extend the reporting range of the test, or dilute samples whose volume is too low to run directly. In either case, one must either use a specimen for dilution that is devoid of the analyte (a tough task unless the analyte is a drug), or show that using the diluent has no matrix effect. Documenting the AMR using linearity is a challenge because samples are usually scarce at the extreme upper end of the curve for many analytes. If samples cannot be obtained to verify linearity, any sample for which the signal is beyond the linear range must be diluted to a region of the curve previously shown to be linear. Subsequently, that high sample can be used to extend the linearity.
Stability: Analyte stability is typically determined during the development phase. It is not really part of assay validation, since—with rare exceptions—stability is not a function of the method of analysis. Stability assessment requires evaluation of various storage conditions (temperature and time). The conditions evaluated should reflect the reality of one’s own business. For example, if it is impossible for one’s clients to freeze and store a specimen at -70º C, it is not necessary to evaluate specimens prepared in that manner. But, conditions that reflect usual specimen transit times and temperatures should be evaluated. The real purpose of stability assessment is to demonstrate equivalence of the test result with the level of the analyte in the physiologic matrix as it is in the body. Therefore, one must be careful about assuming that no analyte changes occur while the sample is being prepared for the study. For example, the use of anticoagulant in a plasma sample allows immediate specimen processing, while serum is prepared from a blood sample left at room temperature for 30 to 60 minutes to allow the blood to clot. This would allow one to determine if there is a time-dependent change on the analyte.
Specimen Type Comparisons: Each specimen type that will be accepted for a specific assay must be validated. At Mayo Medical Laboratories, we commonly evaluate multiple specimen types to allow for add-on tests and common multiple analyte co-orders. The comparison between serum and plasma should be performed, since both are in common use and, unfortunately, are considered interchangeable by many nonlaboratory health care staff. Even if one finds that plasma, for example, is not acceptable, it is useful to know the qualitative relationship of plasma to serum. If the analyte has a reservoir in a circulating cell (eg, myeloperoxidase in neutrophils), then the level found in serum will be dramatically different from that found in EDTA or other plasma specimens, due to the “squeezing” of the cells during clot formation. So, decide which level to measure, total or circulating levels, and validate the appropriate matrix.
Analytical Specificity: There are 2 types of specificity issues—one for endogenous look-alikes and the other for exogenous substances, usually drugs. Assays should be tested for the drugs commonly used in the patient population to be tested, and the drugs should be tested at the maximum expected concentration. The look-alikes require testing with pure materials spiked into the test matrix, a task made more complex by the frequently encountered problem of poor solubility or lack of equilibrium. In any event, this interference data is determined during development and need not be a component of the validation phase, unless interferents are removed during the assay by operator-dependent techniques such as an extraction step. Immunoassays pose a unique challenge. For example, one may observe a dilution curve for a single specimen that is nonlinear, even though the experimental data demonstrates that the assay itself is linear. This indicates that there is an interferent with differing cross-reactivity to the capture antibody from the calibrator, but it doesn’t identify the interferent. When analyzing for the presence or concentration of an antibody towards an endogenous substance (autoantibody), nonlinearity is often observed. A common cause for this is the natural dissociative process that occurs upon dilution of a solution of a ligand and its binding agent, complicated by the polyclonality most commonly seen in such autoantibodies. Lower affinity antibodies lose their antigens at different rates from higher affinity antibodies. Therefore, expect nonlinearity, document it, and be grateful if it is not seen. But keep in mind that not all patient specimens are created equal. One may dilute linearly while another may not.
Analytical Sensitivity: If the lower or upper limit of the analytical measurement range is a critical parameter for results interpretation, then there should be a QC pool at that level and it must be a component of the validation. There should also be a QC pool at any level at which a critical clinical decision is to be made. The lower limit of detection (LLOD) is defined as the lowest detectable non-zero signal (this may also be called the limit of blank, or LOB). The lower limit of quantification (LLOQ) is defined as the lowest concentration that can reliably be reported. One comment on LLOD: laboratories are required to document the LLOD, seen as a certain degree of separation between the blank signal and the lowest detectable non-zero signal, but it doesn’t mean a great deal as long as the LLOD is significantly less than the LLOQ. If the LLOD approximates the LLOQ, it must be part of the validation and redesign of the assay should be seriously considered to increase the separation.
Reference Ranges: Establishing a reference range is not the same as verifying a reference range. For an FDA-approved kit, if the manufacturer has quoted a range, the laboratory must simply verify that its population gives results similar to that obtained by the manufacturer in order to adopt that range. Check with a statistician (or reference publications) regarding the number of samples required1,4. This will depend on the characteristics and completeness of the data supplied by the manufacturer. Evaluate not only the range itself, but the distribution of the population data. Distribution can be either Gaussian (the data are normally distributed symmetrically around the mean) or non-Gaussian (ie, skewed). When the data are distributed in a Gaussian manner, the reference range is calculated as the mean ± 2 standard deviations, which encompasses 95% of the observations in healthy individuals. When determining a reference range from non-Gaussian distributed data, the data can be mathematically transformed, eg, to logarithms, to yield a normal or Gaussian distribution. Then the geometric mean (based on the log-transformed data) ± 2 standard deviations can be used for reference range determination. However, in most instances, percentiles are used to create reference ranges from non-Gaussian data. The top 97.5 and bottom 2.5 percentiles are used as the limits of the reference range. A mean calculated from a skewed distribution, then calculating the range using the mean ± 2 standard deviations, will not be a proper assessment of the range. Most importantly, consider the abnormal range and how that population may overlap with the normal population. The laboratory is responsible for defining the ranges such that clinical decisions are optimal. The laboratorian must, in concert with clinical colleagues, determine how to select the ranges to optimize clinical sensitivity or specificity, or both, so that actions taken by the clinicians are appropriate. It is important to remember that action limits are not equivalent to reference ranges or normal limits. For some analytes, values within the normal range (eg, high normal) can be very clinically relevant, whereas for other analytes, values may not be clinically relevant unless increased several-fold above the upper limit of the reference range. One last point: the reference range need not be the range found in a normal population. For example, for hemoglobin A1c, the key range in most laboratories is the range used as the treatment target, such as <7.0%, not the range found in nondiabetics, or 4.0% to 6.0%.
The performing laboratory is responsible for ensuring that sufficient information is available to direct personnel on how to properly collect, prepare, handle, and ship the specimen to the performing laboratory, wherever it is. However, this does not mean that the laboratory must (or is even able to) determine all sources of preanalytical variation and their consequences. For example, the performing laboratory need not evaluate all specimen types for a specific test. Nor must the laboratory evaluate all possible transportation or storage temperatures. At a minimum, the laboratory must be able to state 1 acceptable specimen type and 1 transport temperature at which validation has shown the analyte to be stable.
Optimally, and over time, data are gathered on alternate specimen types, different storage and transit temperatures, preservatives tried (proven or disproven), lag times between phlebotomy and separation, and a myriad of other aspects of the handling between phlebotomy and analysis. Because Mayo Medical Laboratories receives specimens from across the world, we routinely determine acceptable alternate specimen types and at least 3 temperatures: ambient, refrigerate, and frozen at -20ºC, and for 3 freeze/thaw cycles. The goal is to identify conditions under which the specimen can be collected and transported to Mayo Medical Laboratories within 7 days and remain stable.
Lewis Carroll said “If you don’t know where you are going, any road will get you there.” This is most appropriate to this topic. Each component of the performance should have a target set for it prior to beginning validation or even prior to beginning development. The targets should be set based on the clinical use of the test. Include additional important issues in the criteria parameters that are relevant to the use of the test, but not covered here, such as cost, turnaround time, and freedom from patent restrictions.
The dependence upon a laboratory information system (LIS) is universal, and across the United States, very few in-house–developed computer systems are still in use. All manufacturers must provide evidence of acceptable validation of the information technology (IT) systems they sell, assuring that the output is reliable, if the input is. In brief, the laboratory is responsible for ensuring that:
Annual reviews are mandatory and must include a performance review of the test. QC and proficiency performance are a minimum, but the laboratory should also review the validation documentation to identify any “fragile” components of the test, address them individually, and assess all aspects of performance for “slippage.” Commercial samples for proficiency testing are not available for all tests and when an in-house program is used, diligence is required to ensure that the program being used adequately reflects an objective assessment of test performance without selection bias and without conscious, or subconscious, testing bias. This process should also consider the clinical questions and the literature that have arisen since the last review to see if the acceptance criteria established previously are still valid.
Again, an easy answer: yes, unequivocally. Personalized medicine begins when the diagnosis is made. Laboratories will be used to characterize disease in the individual, select patient-specific pharmacologic therapy and dose, monitor therapeutic effects, and detect toxicity. Interpretation of the laboratory data postdiagnosis will require knowledge of biological variation, physiologically active forms of the analyte, and the biochemical responses to an insult such as administration of a drug. These areas are expanding our classical view of analytical and clinical validation, and represent whole new vistas for research.
The experiments performed for validation of a non-FDA-approved test must be designed based on good science and the practice of medicine, and not solely guided by minimum regulatory expectations. Like any scientific experiment, have the question to be answered clearly in mind when designing the experiment, particularly for complex tasks and confusing data such as can be obtained from a linearity experiment. The validation process is end-to-end, from phlebotomy to result interpretation. As laboratorians, we are responsible for understanding, evaluating, and documenting all parts of the validation process, thus assuring that reliable results are reported.
Authored by Lawrence K. Oliver, PhD