Archives of Physical Medicine and Rehabilitation
Volume 90, Issue 8 , Pages 1340-1348, August 2009

Clinical Interpretation of Computerized Adaptive Test–Generated Outcome Measures in Patients With Knee Impairments

  • Ying-Chih Wang, OTR, PhD

      Affiliations

    • Focus On Therapeutic Outcomes Inc, Knoxville, TN
    • Sensory Motor Performance Program, Rehabilitation Institute of Chicago, Chicago, IL
    • Corresponding Author InformationReprint requests to Ying-Chih Wang, OTR, PhD, Rehabilitation Institute of Chicago, Sensory Motor Performance Program, 345 E Superior, Ste 1312, Chicago, IL, 60611
  • ,
  • Dennis L. Hart, PT, PhD

      Affiliations

    • Focus On Therapeutic Outcomes Inc, Knoxville, TN
  • ,
  • Paul W. Stratford, PT, MS

      Affiliations

    • School of Rehabilitation Science and Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
  • ,
  • Jerome E. Mioduski, MS

      Affiliations

    • Focus On Therapeutic Outcomes Inc, Knoxville, TN

Article Outline

Abstract 

Wang Y-C, Hart DL, Stratford PW, Mioduski JE. Clinical interpretation of computerized adaptive test–generated outcome measures in patients with knee impairments.

Objective

To describe meaningful interpretations of functional status (FS) outcomes measures estimated using a body-part specific computerized adaptive test (CAT).

Design

A prospective observational cohort study.

Setting

Outpatient physical therapy clinics (291 clinics) in 30 U.S. states.

Participants

Sample of 21,896 patients with knee impairments receiving outpatient physical therapy.

Interventions

Not applicable.

Main Outcome Measure

FS estimated using CAT administration.

Results

We investigated 4 approaches to clinically meaningful interpretations of outcomes data: (1) 95% confidence interval for each score estimate, (2) percentile rank of FS scores, (3) responsiveness, and (4) functional staging. Overall, precision of a single score was estimated by FS score ±5. Based on score distribution, percentile ranks at 25th, 50th, and 75th percentiles corresponded to intake FS scores of 33, 42, and 51 and discharge FS scores of 51, 61, and 74, respectively. Results showed that 9 or higher FS change units represented statistically and clinically important improvement. Patients were classified into 6 hierarchical levels of FS using functional staging.

Conclusions

Results suggest how CAT-generated outcomes measures can be interpreted to assist clinicians and patients during rehabilitation.

Key Word: Rehabilitation

List of Abbreviations: CAT, computerized adaptive test, CI, confidence interval, FS, functional status, GROC, Global Rating of Change, IRT, item response theory, LEFS, Lower Extremity Functional Scale, MCII, minimal clinically important improvement, MDC90, 90% confidence interval of the minimal detectable change, MDC95, 95% confidence interval of the minimal detectable change, PRO, patient-reported outcome, RCI, reliable change index, ROC, receiver operating characteristic, RSM, Rating Scale Model

 

MEASUREMENT OF PATIENT-REPORTED outcomes in health care continues to evolve, with health status questionnaires increasingly used in medical research and clinical practice.1, 2, 3, 4, 5 PROs provide a self-report of patients' perceived health status, which allows clinicians to demonstrate effectiveness of intervention and monitor patient treatment.4, 5 The Institute of Medicine6 and National Cancer Institute have recommended patient-centered measurement techniques.7 During the past decade, the number of initiatives designed to promote quality or paying for performance is increasing,8 one of which has been simulated in outpatient rehabilitation using PROs.9 These initiatives make clinical interpretation of PRO scores important.5 Many outcomes measures have been tested for their reliability (internal consistency, test-retest, intrarater and interrater reliability) and validity (content, construct, predictive validity), but relatively few PRO measures have been shown to present clinically meaningful information, which impedes clinician use during patient treatment and erodes face validity in payment initiatives.10

The current study builds on previous work in which we developed, simulated, and applied a body part–specific CAT11, 12 using retrospective data analyses. The primary difference between traditional fixed-length questionnaires and CAT is that the adaptive tests tailor PRO item administration, so each patient responds to selected items dynamically according to their ability estimates, which minimizes the number of items administered while measure precision is maintained. Here, we examine clinically meaningful interpretations of FS measures estimated using the condition-specific CAT for patients with knee impairments seeking rehabilitation in outpatient therapy clinics11, 12 participating with Focus On Therapeutic Outcomes, Inc, an international medical rehabilitation outcomes database management company.13, 14 FS, as estimated using the knee CAT,11 is operationally defined as the patient's perception of ability to perform functional tasks. The item bank was developed using items from the LEFS,15 a scale with strong psychometrics15, 16, 17, 18 and broad clinical and research acceptance as well as clinical meaning.19 FS, as assessed using the knee CAT, represents the activity dimension of the World Health Organization's International Classification of Functioning, Disability, and Health.20

The purpose of this study was to assess clinical meaning of change in FS measures estimated using the knee CAT. We assessed clinical interpretation of the FS measures by (1) constructing a 95% CI for each score point estimate, (2) establishing percentile ranks of FS scores, (3) using 2 threshold approaches to defining individual patient-level change (ie, statistically reliable change and clinically important change), and (4) using a functional staging approach. The first 3 methods provided statistical indices, and the fourth method provided graphical presentation to guide interpretation of the patient's improvement in functional status. These methods (ie, SE, percentile, responsiveness indices, functional staging) have been suggested5, 21, 22 to derive more meaningful outcome measures in rehabilitation settings, especially using a CAT administration.5

Back to Article Outline

Methods 

Data Collection 

Data were collected with Patient Inquiry computer software,a,13, 14 which was used for routine data collection during patient management in Focus On Therapeutic Outcomes, Inc, participating clinics. Patients seeking rehabilitation entered demographic data using a computer in the clinic prior to initial evaluation. Data for this study were selected from the CAT database if the patient had knee impairment and the patient completed the knee CAT. When the patient completed the CAT prior to initial evaluation, those data were labeled “intake.” When the patient completed the CAT at the end of rehabilitation, the data were labeled “discharge.” FS score change was determined by subtracting intake from discharge scores (FS score change=discharge FS–intake FS). Clinical staff entered demographic data at intake and discharge. This project was approved by the Institutional Review Board for the Protection of Human Subjects from Focus On Therapeutic Outcomes, Inc.

Knee Computerized Adaptive Testing 

Development,11 simulation,11 and use12 of the knee CAT have been described. Briefly, the adaptive test started by administering the most informative23 item at median level difficulty (ie, walking 2 blocks). Patient FS ability estimates with associated SEs were calculated using maximum likelihood estimation employing a Newton-Raphson estimation technique.24 There were 2 stopping rules: (1) SE for the provisional ability was less than 4 out of 100 FS units (ability estimates were scaled 0–100 with higher measures representing higher functioning), and (2) each change in provisional ability estimates for the last 3 administered items was less than 1 out of 100. If a stopping rule was not met, the computer selected the most informative item given the current FS estimate.

Knee CAT items were presented by asking the patient, “Today, do you or would you have any difficulty at all with,” followed by activities such as “walking 2 blocks” or “performing light activities around your home.” Five response categories were used: (1) extreme difficulty, (2) quite a bit of difficulty, (3) moderate difficulty, (4) a little bit of difficulty, and (5) no difficulty. In addition, the patient could elect “not applicable” for any item, which was recorded as missing data and not used in FS estimation.

Results from our previous study11 supported that the knee CAT met essential IRT assumptions of unidimensionality and local independence.23 Test information functions and SEs supported FS measure precision. The knee CAT used on average 7 items to produce precise estimates of FS that adequately covered the content range with negligible floor and ceiling effects and discriminated well patients of known clinical groups. Patients who were older, had more chronic symptoms, had surgery history, had more comorbidities, and did not exercise prior to receiving rehabilitation reported lower discharge FS than other patients after controlling for intake FS.12

Approaches to Deriving Meaningful Measurements 

Interpreting a single scale score 

CAT-estimated FS scores represented point estimates of lower-extremity FS. A more informative estimate of the patients' ability levels can be described by constructing the 95% CI associated with the point estimate FS score (ie, point estimate±1.96·SE). The width of the CI band represents an estimate of the precision of the measure. If a patient obtained a discharge FS score overlapping the 95% CI range of the intake FS score, the score change may not represent improvement but measurement error.

Establishing the percentile rank of functional status score 

Percentile ranks of point estimate FS scores were calculated. The percentile rank of a raw score is commonly interpreted as the percentage of examinees in the norm group who scored below the score of interest. The knee CAT was designed to be administered to patients with knee impairments that result in reduced lower extremity function. Therefore, the norm for our percentile comparisons is similar people with knee impairments. Although a more general norm of people with no functional deficit secondary to a knee impairment may seem appropriate, we felt a more general norm would probably respond “no difficulty” to LEFS items or obtain nearly maximum scores on the knee CAT (ie, FS scores are close to 100). To accommodate differences in functional status scores at intake compared with discharge, we generated 2 percentile ranks: a percentile rank for patients at intake, and 1 at discharge. The ranks refer to the proportion of scores in a distribution that a specific score is greater than or equal to and can be used to compare an individual score with similar patients at 2 times during treatment. Percentile ranks provide additional information regarding the relative location of the patient's condition along the functional continuum related to a similar group of patients.

Using 2 threshold approaches to defining individual patient-level change 

We used 2 threshold approaches to defining individual patient-level change: statistically reliable change and clinically important change.21

Statistically reliable change reflects the statistical significance of individual change and is commonly estimated using the standard error of measurement. The standard error of measurement is calculated as

Where is the SD of score at intake and is the reliability coefficient. Here we used the RCI,22, 25 a responsiveness index based on the standard error of measurement concept, to assess statistically reliable change. The RCI is a test of a longitudinal change between intake and discharge with the following mathematical formula:
where X2 is the score at discharge and X1 is the score at intake. If the RCI is 1.96 or larger, change is considered statistically reliable or significant (at P<.05).22 As computed, the minimal score change required for a statistically reliable change (ie, RCI) is equivalent to the upper limit of MDC95, which represents the smallest threshold for identifying change greater than measurement error.26

Clinically important change27, 28, 29 is another threshold approach used to assess individual change. Patients were expected to improve their FS measures by different amounts, so as recommended,30 we employed an anchored-based longitudinal method31, 32 using GROC to assess responsiveness.28 Patient self-reported GROC data were collected after CAT administration at discharge. If clinicians elected at intake to have their patients answer questions regarding their improvement perceived at discharge, patients were asked how much better or worse their lower extremity functional status was on a 15-point Likert scale (−7 to 7) at discharge.28 We calculated a nonparametric ROC33 curve that compared the patient's self-report of GROC to the FS change (ie, FS score change) to estimate the MCII12 in FS measures. We dichotomized patients by their GROC scores as patients who did not improve (ie, GROC scores<3) versus patients who improved (ie, GROC scores≥3).12, 34 The ROC analysis generates a series of the sensitivity and 1 minus specificity values across the FS score range, and the optimal MCII discrimination threshold could be identified using the largest average specificity/sensitivity. The MCII was estimated (1) using all patients regardless of intake FS measure, and (2) using patients grouped by quartile of baseline FS scores.12 As defined, MCII represents the minimal improvement score in FS that patients perceived as beneficial.26 This specific result had been presented in the previous study.12 Here, we used the result to assist in clinical interpretation of the FS derived from the knee CAT.

Using functional staging approach 

Functional staging presents a graphical image of the clinically meaningful interpretations of outcomes within the context of questions asked. Functional staging has been used to describe the structure of outcomes measures,35 influence practice,36 assist therapists formulating short-term and long-term goals,37 and classify patients into different functional status categories.5, 38, 39, 40

To classify lower-extremity function, we employed the classification by Perry et al41 of walking as the framework to construct the staging levels. As defined,41 lower-extremity function can be classified into several functional levels: physiologic ambulator, limited household ambulator, independent household ambulator, limited community ambulator, and independent community ambulator. If a person can walk for exercise only, the person is classified as a physiologic ambulator (stage 1). A limited household ambulator is a person who is able to walk between rooms (stage 2). An independent household ambulator refers to a person who can walk continuously for distances that are considered reasonable for inside the home but are limited by endurance, strength, or safety to walk in community (stage 3). A person who can walk outside the home is referred to as community ambulator. A limited community ambulator can walk regularly in the home and occasionally in the community (stage 4). An independent community ambulator can walk for distances of least 400m (.25mi) independently in the community without safety concerns (stage 5). To distinguish further the functional performance of patients at the high end of functioning, we added active community ambulator for patients who not only can walk a mile with no difficulty but also can run on even ground with little difficulty (stage 6). The higher functional staging indicates that a patient can move more independently in the patient's own surroundings, including home and community.

After characterizing the operational definition of each functional stage, we determined the cut-scores between functional stages along the 0 to 100 FS continuous scale using the following procedures. First, we analyzed the original knee CAT item bank11 using IRT23 employing the Andrich42 RSM. Within the RSM, each item is characterized by its item characteristic curve, which illustrates the probability of endorsing an item's response at a given level of ability. Using the RSM, we estimated the threshold probabilities between each pair of contiguous item responses across the continuum of ability. The threshold probabilities represent the point on the ability continuum where the probability of endorsing one of the pair of contiguous responses is .50. As defined, the threshold probabilities provide the data necessary to develop the cut-scores between different functional stages.

We then determined 5 cut-scores for our 6 stages of functional staging by matching the operational definition of each staging to the threshold probabilities of select items. Because our functional staging was determined based on the framework of the classification of walking ability, we selected 3 specific items as our primary interest: walking between rooms, walking a mile, and running on even ground.

Based on the conceptual framework,41 the cut-scores between a physiologic and a limited household ambulator, a limited household ambulator and an independent household ambulator, and an independent household ambulator and a limited community ambulator, were determined by finding the thresholds between the 1 and 2, 2 and 3, and 3 and 4 responses for the item “walking between rooms.” The cut-score between a limited community ambulator and an independent community ambulator was determined by finding the threshold between the 2 and 3 responses for the item “walking a mile.” Finally, the cut-score between an independent community ambulator and an active community ambulator was determined by finding the threshold between the 3 and 4 responses for the item “running on even ground.” We made the classifications a priori before the IRT analysis.

Back to Article Outline

Results 

Subjects 

A sample of 21,896 patients with knee impairments receiving outpatient physical therapy in 291 clinics in 30 U.S. states (2005–2007) was analyzed. Table 1 presents the patient characteristics at rehabilitation intake.

Table 1. Patient Characteristics at Rehabilitation Intake (N=21,896)
CharacteristicMean ± SD(Min, Max)Value
Age (y)50±17(18, 95)
Sex (female %)55
FS at intake44±15(0, 100)
FS at discharge61±18(0, 100)
Diagnoses (%)
Soft tissue disorders of muscle, synovium, tendon, bursa, or enthesopathies (ICD-9-CM codes 725–729)24
Postsurgical conditions (CPT codes 20150–29999, including repair, torn collateral ligament)21
Arthropathies (ICD-9-CM codes 710–716, including osteoarthoses, rheumatoid arthritis)3
Disorders of the bone and cartilage (ICD-9-CM codes 730–739, including chondromalacia)3
Sprains and strains (ICD-9-CM codes 840–848, including sprain of medial collateral ligament, unspecified sprain or strain)2
Fractures (ICD-9-CM 800–829, including patellar fractures)1
Other (not otherwise classified)1
Missing45

Abbreviations: CPT, current procedural terminology; ICD-9-CM, International Classification of Diseases–9th Revision–Clinical Modifications; Max, maximum; Min, minimum.

Diagnoses are groups of ICD-9-CM codes or surgical CPT codes.

Enthesopathies are disorders of peripheral ligamentous or muscular attachments.

Interpreting a Single Scale Score 

Figure 1 presents the estimated SEs and percentage of patients associated with each point estimate score at intake, which were similar to discharge results. Under IRT models, SEs are dependent on level of FS, and extreme FS scores are likely to have larger SEs, demonstrating that less information is available to estimate item/person parameters with precision. The average SE for all levels of FS was 3.1, but the average SE for the 92% of patients with FS intake scores between 20 and 70 was 2.5. To obtain the CI, we doubled 1.96·SE. For all levels of FS, the 95% CI was FS point estimate ±6, and for patients with FS scores between 20 and 70, the 95% CI was FS point estimate ±5.

  • View full-size image.
  • Fig 1. 

    SE and percent of patients associated with each point estimate score at intake. Abbreviations: FS score, FS score estimated by the knee CAT; Percent, percentage of patients measured; SE, SE of the estimate associated with each point estimate.

Establishing the Percentile Rank of Functional Status Score 

The mean ± SD FS scores at intake and discharge were 44±15 and 61±18, respectively (N=21,896). Based on score distribution at intake and discharge, the percentile ranks at the 25th, 50th, and 75th percentiles corresponded to the FS scores of 33, 42, and 51, and 51, 61, and 74, at intake and discharge, respectively (table 2).

Table 2. Knee CAT Percentile Ranks Based on Intake and Discharge FS Scores (0–100 Scale)
ScorePRiPRdScorePRiPRdScorePRiPRd
≤2040466123669365
2581486626689468
30183507029709671
32224527534729775
34286547838749778
36338568244779881
38379588549859988
4041116087539810095
424815628957≥99100100
445217649161

NOTE. Values are percentages.

Abbreviations: Score, FS score estimated by the knee CAT at either intake or discharge; PRi, percentile rank at intake; PRd, percentile rank at discharge.

Using Threshold Approaches to Define Individual Patient-Level Change 

Statistically reliable change 

Derived from our previous full-length LEFS dataset of 949 patients with knee impairments, the internal consistency reliability coefficient was 0.96.11 With the SD at intake of 14.96, the standard error of measurement of the scale was 2.99 FS score units, and the minimal score change (discharge FS–intake FS) for RCI to be 1.96 or larger was 8.29 FS score units. Consequently, RCI analyses showed that 9 or more FS change units represented statistically reliable change.

Clinically important change 

There were 3270 patients with both GROC and FS change data: 428 (13%) reported no change (ie, GROC scores<3), and 2842 (87%) reported improvement (ie, GROC scores≥3). Results from ROC analyses showed that 9 or more FS change units represented the MCII.12 When patients were grouped by baseline FS measures and 4 separate ROC analyses were conducted (1 per quartile of FS intake measures), MCII was dependent on intake FS, with patients perceiving improvement with fewer FS units as intake FS scores increased. Results suggested that 13 or more, 9 or more, 6 or more, and 4 or more FS change scores represented clinically meaningful improvement for patients in the first (intake FS, 0–34), second (intake FS>34–45), third (intake FS>45–54), and fourth (intake FS>54–100) quartile of FS intake measures, respectively.

Using Functional Staging Approach 

Functional staging of lower-extremity function is displayed in figure 2. This figure shows the expected response (the gray-scale horizontal bars) to a given item as a function of the underlying lower-extremity ability (ie, FS) estimated by the knee CAT. In this figure, all the knee CAT items are listed in descending order of difficulty in the left column.11 Beneath the figure is the FS score continuum ranging from 0 to 100 (higher values represent more functioning toward the right) separating by different levels of functional staging from stage 1 (left or lower functioning) to stage 6 (right or higher functioning). Lower stages (eg, stage 1–2) describe the patient's lower-extremity function as limited in walking between rooms, and higher stages (eg, stage 5–6) indicate patients are more independent walking within the community. The threshold probability results identified 19 as the cut-score between functional stages 1 and 2, 29 as the cut-score between functional stages 2 and 3, and so forth.

Using the functional staging method, we can compare the patient's FS score to the functional stages to interpret better the patient's FS score. For example, patients classified in FS stage 2 (scores 20–29) report being limited household ambulators having quite a bit of difficulty walking between rooms and report being unable to walk 2 blocks or a mile. Patients classified in FS stage 3 (scores 30–37) are classified as independent household ambulators and report having moderate difficulty walking between rooms, having a lot of difficulty walking 2 blocks, and being unable to walk for a mile. Table 3 provides a simple guideline to interpret the FS stage levels.

Table 3. Functional Staging: Expected Performance at Each Functional Stage Level
Stage IStage IIStage IIIStage IVStage VStage VI
FS score range0–1920–2930–3738–4748–62>62
ActivityPhysiologic ambulatorLimited household ambulatorIndependent household ambulatorLimited community ambulatorIndependent community ambulatorActive community ambulator
Running on even groundUnableUnableUnableUnableModerate difficultyA little difficulty
Walking for a mileUnableUnableUnableA lot of difficultyA little difficultyNo difficulty
Walking 2 blocksUnableUnableA lot of difficultyModerate difficultyA little difficultyNo difficulty
Walking between roomsUnableA lot of difficultyModerate difficultyA little difficultyNo difficultyNo difficulty

In our sample, the percentages of patients in each functional stage at intake and discharge were 3.6% and 0.3% (stage 1), 11.8% and 2.1% (stage 2), 19.9% and 6.5% (stage 3), 27.3% and 15.1% (stage 4), 28.6% and 37.2% (stage 5), and 8.7% and 38.8% (stage 6), respectively.

Clinical Example 

To illustrate how to use these strategies to enhance clinical interpretation, consider 1 patient, age 62 years, with a knee sprain. Intake FS score was 32, and discharge FS was 70 (FS score change, 39). She considered her overall improvement as “a great deal better” and reported the global rating of change of +6. We plotted all her CAT responses in figure 3. The structure of figure 3 is equivalent to that in figure 2 but with circled patient's responses: light gray circles identify responses at intake; dark gray circles are responses at discharge.

  • View full-size image.
  • Fig 3. 

    Clinical example. The patient's responses are circled on the figure: lighter gray circles identify the responses at intake, and darker gray circles identify responses at discharge. Notation: 1, extreme difficulty or unable to perform activity; 2, quite a bit of difficulty; 3, moderate difficulty; 4, a little bit of difficulty; 5, no difficulty; the “:” is the threshold cut-score between contiguous responses per item. Abbreviation: FSCH, FS discharge–FS admission.

The 95% CI estimate of her intake score was 27 to 37 (intake FS±5). Compared with other patients with a variety of knee impairments, the patient's intake percentile rank was 22, indicating that her lower-extremity function exceeded only 22% of similar patients. The functional staging algorithm classified the patient as an independent household ambulator (stage 3).

At discharge, the 95% CI estimate for the patient's discharge score was 65 to 75 (discharge FS±5), which supports significant improvement over her intake FS score. Compared with other patients at discharge, this patient's percentile rank at discharge was 73. The functional staging classification suggested the patient was now an active community ambulator (stage 6). With an improvement of 39 FS score units, the patient's improvement was considered statistically reliable (RCI>1.96) and clinically meaningful (FS score change>13 adjusted by quartile).

Clinical Application 

By adding clinical interpretation of functional outcome measures, the additional information could provide an expanded knowledge base the clinical team could use in the process of care planning, including setting short-term and long-term goals. For example, short-term or long-term goals can be established in several ways: (1) improve 1 SD of FS score beyond patients' baseline FS measures, (2) progress 10 percentiles based on the FS score distribution, (3) advance FS change units equal to or exceeding the minimum clinically important improvement perceived by the patient, or (4) reach the next level of functional staging.

Back to Article Outline

Discussion 

CATs are common in standardized testing for licensure, certification, and admission tests43, 44 but have only recently begun to be used to collect routine clinical data in busy outpatient rehabilitation clinics.9, 12, 45, 46, 47, 48, 49, 50 Results of the current study followed recommended approaches5, 21, 22 to derive clinically meaningful interpretations of outcomes measures. Results suggest the potential is good for CATs to generate estimates efficiently of functional status that are precise, valid, sensitive to change and responsive, and CAT ability estimates can be interpreted in clinically meaningful ways. These conclusions are encouraging for clinicians, managers of outpatient rehabilitation clinics, and researchers who cannot be burdened with onerous outcomes data collection processes but need to report patient outcomes.

To enhance the clinical interpretability of outcomes measures, numerous methods have been proposed, most of which emphasize statistically reliable, meaningful change.21, 22 However, there does not appear to be a consensus on what constitutes a responsive measure, nor how responsive a measure should be.51 Consequently, researchers commonly report multiple responsiveness indices and select a criterion to support their conclusion of what change represents meaningful change. In our study, we used 2 threshold approaches to define individual patient-level change: (1) statistically reliable change estimated by RCI, and (2) clinically important change estimated by MCII using patients' perceived GROC. We estimated the standard error of measurement using the internal consistency reliability coefficient (.96), which was slightly higher than test-retest reliability reported by Binkley et al15 (r=.94). Both our RCI and MCII responsiveness index showed that 9 or more FS change units represented statistically reliable and clinically meaningful improvement.

Several studies have found that MCII is dependent on baseline scores.12, 49, 50, 52, 53, 54, 55, 56, 57 For example, Stratford et al54 demonstrated that MCIIs varied between 3 and 13 units of a 0 to 80 scale depending on the range of scores at baseline. Intuitively, although we may consider that patients with very low scores would find a little bit of improvement more meaningful, results from the GROC analysis demonstrated the opposite. Patients with lower intake FS scores needed larger FS score change (ie, more improvement) to perceive meaningful improvement (ie, GROC≥3), while patients whose intake FS was high perceived meaningful improvement with a smaller FS score change. Although the underlying reason is still unknown, one can consider several possible explanations. First, when baseline scores (intake FS) are negatively correlated with change (FS score change), regression to the mean is common.58 However, regression to the mean may imply that equal amounts of FS score change would be associated with equal amounts of GROC, which we did not see. Second, it is logical that patients with lower intake FS scores have the opportunity to improve more than patients who have high intake FS scores, which would imply the potential for differences in MCII related to intake FS scores. Third, there may be differences in perceptions of improvement when a patient moves from a dysfunctional state to a more functional state, which would imply differences in MCII that are dependent on intake FS scores. In any event, further research on the relation between intake FS scores and MCII is recommended.

Under IRT models, measurement reliability varies by score level. Hence, the RCI calculations based on a reliability estimate for the whole scale may be misleading. In the current study, we used a single internal consistency reliability coefficient to estimate the standard error of measurement and RCI, which could be compared with IRT-based estimates of change. The minimal score changes for RCI to be 1.65 or larger (equivalent to MDC90) and 1.96 or larger (equivalent to MDC95) were 7.0 and 8.3, respectively. We then computed the MDC90 and MDC95 per FS range to take advantage of different IRT-based SEs for each score level using the same data.12 On average, the upper CI limits of MDC90 and MDC95 values for all patients were 11.6 and 13.8, respectively, which were influenced by the SEs for extreme high and low FS scores. Nonetheless, the MDC90 and MDC95 upper CI limit values for the 92% of patients with FS intake scores between 20 and 70 were 6.2 and 7.3, respectively. IRT-based estimates are slightly better than estimates based on entire scale reliability estimates, but the differences were not dramatic. These findings suggest that the knee CAT was more precise and sensitive to change in the middle range of FS scores. Future studies should investigate issues related to sensitivity to change at extreme scores.

A similar concern is that the estimated SEs vary across the score range under the IRT model. Very high or extremely low functional status scores are likely to have a larger standard error of measurement, because less information can be used to estimate these scores. As a result, SEs at extremes of measurement ranges tend to make conclusions about patient ability difficult.

One of the strengths of this study is the functional staging performed to enhance clinical interpretation of the knee CAT-generated outcomes measures. Functional staging allows clinical interpretation through visual display of IRT methods. IRT methods allowed us to quantify the functional staging cut-scores based on the rating scale thresholds, and thus we could predict expected responses based on different staging levels. When displayed, clinicians can apply IRT results directly. In the process of developing the functional staging, we used an a priori conceptual model to determine cut-scores of functional staging classification. Similar to the study by Tao et al40 that applied an exploratory bookmark analysis and used multiple judges examining item threshold values for functional staging, researchers risk identifying arbitrary thresholds based on polytomous responses that might not be clinically relevant. Therefore, there is a need to validate functional staging classification systems relevant to clinical interpretation. Examination of classification accuracy may be validated by asking staging questions in a dichotomous format (eg, “Are you able to walk for exercise only?” “Are you able to walk between rooms?” “Is your ability to walk in the community limited because of fatigue, lack of strength, or concerns about safety?”). In addition, further studies should validate our functional staging classification and check to see whether different functional staging classification systems should be established for distinct diagnostic groups of patients.

Study Limitations 

There are several limitations of this study. First, this study was a secondary analysis of data prospectively collected via a proprietary database management company, Focus On Therapeutic Outcomes, Inc. The researchers were not in control of the data collection procedure. To ensure the quality of data, we checked data entry errors carefully when data were imported into the database and rechecked during data analysis. In addition, generalizability of results may be limited to participating clinics because there was the potential for patient selection biases related to differences in characteristics of patients treated in participating clinics compared with clinics not participating. However, based on the fact that the knee CAT was administered in 291 outpatient physical therapy clinics in 30 U.S. states and the data set was large, we believe the potential for patient selection bias was reduced, although the true impact is unknown. Results presented in this study were based on a sample broadly defined as patients with knee impairment. Given that identification of medical or surgical diagnoses was optional in the data collection, there was a significant amount of missing diagnostic data. Future studies should endeavor to use more complete data sets.

Last, we mentioned that clinical goals could be established based on multiple methods using these data, such as improvement of 1 SD change from baseline, improving 10 percentiles from intake, change greater than MCII, or change of 1 level in functional staging. Choosing one method over another would lead to different criteria for monitoring patient improvement. Future studies should be conducted to investigate which methods are better than others to assist in patient treatment while using CAT-generated outcomes measures.

Back to Article Outline

Conclusions 

We demonstrated how CAT outcomes measures can be interpreted to assist clinicians and patients with knee impairments during outpatient rehabilitation. The knee CAT currently is used routinely in many outpatient rehabilitation clinics across the United States and Israel, attesting to its efficiency and usability. Our results should improve the clinical interpretation of the measures and stimulate future studies.

Supplier

Back to Article Outline

References 

  1. Fliege H, Becker J, Walter OB, Bjorner JB, Klapp BF, Rose M. Development of a computer-adaptive test for depression (D-CAT). Qual Life Res. 2005;14:2277–2291
  2. Kosinski M, Bjorner JB, Ware JE, Sullivan E, Straus WL. An evaluation of a patient-reported outcomes found computerized adaptive testing was efficient in assessing osteoarthritis impact. J Clin Epidemiol. 2006;59:715–723
  3. Ware JE, Bjorner JB, Kosinski M. Practical implications of item response theory and computerized adaptive testing: a brief summary of ongoing studies of widely used headache impact scales. Med Care. 2000;38(9 Suppl):II73–II82
  4. de Vet HC, Terwee CB, Ostelo RW, Beckerman H, Knol DL, Bouter LM. Minimal changes in health status questionnaires: distinction between minimally detectable change and minimally important change. Health Qual Life Outcomes. 2006;4:54
  5. Jette AM, Tao W, Norweg A, Haley S. Interpreting rehabilitation outcome measurements. J Rehabil Med. 2007;39:585–590
  6. Institute of Medicine. Crossing the quality chasm: a new health system for the 21st century. Washington (DC): National Academy Pr; 2001;
  7. U.S. Department of Health and Human Services FDA Center for Drug Evaluation and Research. Guidance for industry: patient-reported outcome measures: use in medical product development to support labeling claims: draft guidance. Health and Quality of Life Outcomes. 2006;4:79–98
  8. Johnson DA. Pay for performance: ACG guide for physicians. Am J Gastroenterol. 2007;102:2119–2122
  9. Hart DL, Connolly JB. Pay-for-performance for physical therapy and occupational therapy: Medicare Part B services. Health and Human Services/Centers for Medicare and Medicaid Services http://www.cms.hhs.gov/TherapyServices/downloads/P4PFinalReport06-01-06.pdf2006;Accessed April 29, 2009
  10. Liang MH. Longitudinal construct validity: establishment of clinical meaning in patient evaluative instruments. Med Care. 2000;38(9 Suppl):II84–II90
  11. Hart DL, Mioduski JE, Stratford PW. Simulated computerized adaptive tests for measuring functional status were efficient with good discriminant validity in patients with hip, knee, or foot/ankle impairments. J Clin Epidemiol. 2005;58:629–638
  12. Hart DL, Wang YC, Stratford PW, Mioduski JE. Computerized adaptive test for patients with knee impairments produced valid and responsive measures of function. J Clin Epidemiol. 2008;61:1113–1124
  13. Dobrzykowski EA, Nance T. The Focus On Therapeutic Outcomes (FOTO) Outpatient Orthopedic Rehabilitation Database: results of 1994–1996. J Rehabil Outcomes Meas. 1997;1:56–60
  14. Swinkels IC, van den Ende CH, de Bakker D, et al. Clinical databases in physical therapy. Physiother Theory Pract. 2007;23:153–167
  15. Binkley JM, Stratford PW, Lott SA, Riddle DL. The Lower Extremity Functional Scale (LEFS): scale development, measurement properties, and clinical application (North American Orthopaedic Rehabilitation Research Network). Phys Ther. 1999;79:371–383
  16. Alcock GK, Stratford PW. Validation of the Lower Extremity Functional Scale on athletic subjects with ankle sprains. Physiother Can. 2002;54:233–240
  17. Stratford PW. Getting more from the literature: estimating the standard error of measurement from reliability studies. Physiother Can. 2004;56:27–30
  18. Stratford PW, Binkley JM, Watson J, Heath-Jones T. Validation of the LEFS on patients with total joint arthroplasty. Physiother Can. 2000;52:97–105
  19. Stratford PW, Hart DL, Binkley JM, Kennedy DM, Alcock GK, Hanna SE. Interpreting lower extremity functional status scores. Physiother Can. 2005;57:154–162
  20. World Health Organization. International classification of functioning, disability, and health. Geneva: World Health Organization; 2001;
  21. Schmitt JS, Di Fabio RP. Reliable change and minimum important difference (MID) proportions facilitated group responsiveness comparisons using individual threshold criteria. J Clin Epidemiol. 2004;57:1008–1018
  22. Hays RD, Brodsky M, Johnston MF, Spritzer KL, Hui KK. Evaluating the statistical significance of health-related quality-of-life change in individual patients. Eval Health Prof. 2005;28:160–171
  23. Lord FM. Applications of item response theory to practical testing problems. Hillsdale: Lawrence Erlbaum Associates; 1980;
  24. Linacre JM. Estimating measures with known polytomous item difficulties. Rasch Meas Trans. 1998;12:638
  25. Jacobson NS, Truax P. Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. J Consult Clin Psychol. 1991;59:12–19
  26. Beaton DE, Bombardier C, Katz JN, Wright JG. A taxonomy for responsiveness. J Clin Epidemiol. 2001;54:1204–1217
  27. Deyo RA, Patrick DL. The significance of treatment effects: the clinical perspective. Med Care. 1995;33(4 Suppl):AS286–AS291
  28. Jaeschke R, Singer J, Guyatt GH. Measurement of health status: ascertaining the minimal clinically important difference. Control Clin Trials. 1989;10:407–415
  29. Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care. 1989;27(3 Suppl):S178–S189
  30. Stratford PW, Riddle DL. Assessing sensitivity to change: choosing the appropriate change coefficient. Health Qual Life Outcomes. 2005;3:23
  31. Crosby RD, Kolotkin RL, Williams GR. Defining clinically meaningful change in health-related quality of life. J Clin Epidemiol. 2003;56:395–407
  32. Hsieh YW, Wang CH, Wu SC, Chen PC, Sheu CF, Hsieh CL. Establishing the minimal clinically important difference of the Barthel Index in stroke patients. Neurorehabil Neural Repair. 2007;21:233–238
  33. Deyo RA, Centor RM. Assessing the responsiveness of functional scales to clinical change: an analogy to diagnostic test performance. J Chronic Dis. 1986;39:897–906
  34. Stratford PW, Binkley FM, Riddle DL. Health status measures: strategies and analytic methods for assessing change scores. Phys Ther. 1996;76:1109–1123
  35. Arnould C, Penta M, Renders A, Thonnard JL. ABILHAND-Kids: a measure of manual ability in children with cerebral palsy. Neurology. 2004;63:1045–1052
  36. Kielhofner G, Dobria L, Forsyth K, Basu S. The construction of keyforms for obtaining instantaneous measures from the occupational performance history interview rating scales. Occup Ther J Res. 2005;25:1–10
  37. Woodbury ML, Velozo CA. Potential for outcomes to influence practice and support clinical competency. OT Pract. 2005;10:7–8
  38. Stineman MG, Ross RN, Fiedler R, Granger CV, Maislin G. Staging functional independence validity and applications. Arch Phys Med Rehabil. 2003;84:38–45
  39. Stineman MG, Ross RN, Fiedler R, Granger CV, Maislin G. Functional Independence Staging: conceptual foundation, face validity, and empirical derivation. Arch Phys Med Rehabil. 2003;84:29–37
  40. Tao W, Haley SM, Coster WJ, Ni P, Jette AM. An exploratory analysis of functional staging using an item response theory approach. Arch Phys Med Rehabil. 2008;89:1046–1053
  41. Perry J, Garrett M, Gronley JK, Mulroy SJ. Classification of walking handicap in the stroke population. Stroke. 1995;26:982–989
  42. Andrich DA. A rating formulation for ordered response categories. Psychometrika. 1978;43:561–573
  43. In:  Mills CN,  Potenza MT,  Fremer JJ,  Ward WC editor. Computer-based testing: building the foundation for future assessments. Mahwah: Lawrence Erlbaum Associates; 2002;
  44. In:  Wainer H editors. Computerized adaptive testing: a primer. 2nd ed.. Mahwah: Lawrence Erlbaum Associates; 2000;
  45. Jette AM, Haley SM, Tao W, et al. Prospective evaluation of the AM-PAC-CAT in outpatient rehabilitation settings. Phys Ther. 2007;87:385–398
  46. Deutscher D, Hart DL, Horn SD, Dickstein R, Gutvirtz M. Implementing an integrated electronic outcomes and electronic health record process to create a foundation for clinical practice improvement. Phys Ther. 2008;88:270–285
  47. Gandek B, Sinclair SJ, Jette AM, Ware JE. Development and initial psychometric evaluation of the participation measure for post-acute care (PM-PAC). Am J Phys Med Rehabil. 2007;86:57–71
  48. Kopec JA, Badii M, McKenna M, Lima VD, Sayre EC, Dvorak M. Computerized adaptive testing in back pain: validation of the CAT-5D-QOL. Spine. 2008;33:1384–1390
  49. Hart DL, Wang YC, Stratford PW, Mioduski JE. Computerized adaptive test for patients with hip impairments produced valid and responsive measures of function. Arch Phys Med Rehabil. 2008;89:2129–2139
  50. Hart DL, Wang YC, Stratford PW, Mioduski JE. Computerized adaptive test for patients with foot or ankle impairments produced valid and responsive measures of function. Qual Life Res. 2008;17:1081–1091
  51. Husted JA, Cook RJ, Farewell VT, Gladman DD. Methods for assessing responsiveness: a critical review and recommendations. J Clin Epidemiol. 2000;53:459–468
  52. Stratford PW, Binkley J, Soloman P, Finch E, Gill C, Moreland J. Defining the minimum level of detectable change for the Roland-Morris questionnaire. Phys Ther. 1996;76:359–368
  53. Riddle DL, Stratford PW, Binkley JM. Sensitivity to change of the Roland-Morris back pain questionnaire: part 2. Phys Ther. 1998;78:1197–1207
  54. Stratford PW, Binkley JM, Riddle DL, Guyatt GH. Sensitivity to change of the Roland-Morris back pain questionnaire: part 1. Phys Ther. 1998;78:1186–1196
  55. Goldsmith CH, Boers M, Bombardier C, Tugwell P. Criteria for clinically important changes in outcomes: development, scoring and evaluation of rheumatoid arthritis patient and trial profiles (OMERACT Committee). J Rheumatol. 1993;20:561–565
  56. Lauridsen HH, Hartvigsen J, Manniche C, Korsholm L, Grunnet-Nilsson N. Responsiveness and minimal clinically important difference for pain and disability instruments in low back pain patients. BMC Musculoskelet Disord. 2006;7:82
  57. Ostelo RW, Deyo RA, Stratford P, et al. Interpreting change scores for pain and functional status in low back pain: towards international consensus regarding minimal important change. Spine. 2008;33:90–94
  58. Bland JM, Altman DG. Statistic notes: regression towards the mean. Br Med J. 1994;308:1499
  • a Focus On Therapeutic Outcomes, Inc, 2910 Tazewell Pike, Ste E, Knoxville, TN 37918.

 A commercial party having a direct financial interest in the results of the research supporting this article has conferred or will confer a financial benefit on one or more of the authors.

PII: S0003-9993(09)00286-X

doi:10.1016/j.apmr.2009.02.008

Archives of Physical Medicine and Rehabilitation
Volume 90, Issue 8 , Pages 1340-1348, August 2009