Archives of Physical Medicine and Rehabilitation
Volume 89, Issue 11 , Pages 2129-2139, November 2008

A Computerized Adaptive Test for Patients With Hip Impairments Produced Valid and Responsive Measures of Function

  • Dennis L. Hart, PT, PhD

      Affiliations

    • Department of Consulting and Research, Focus On Therapeutic Outcomes, Inc, Knoxville, TN
    • Corresponding Author InformationReprint requests to Dennis L. Hart, PT, PhD, Director of Consulting and Research, Focus On Therapeutic Outcomes, Inc, 551 Yopps Cove Rd, White Stone, VA, 22578-2403
  • ,
  • Ying-Chih Wang, OT, PhD

      Affiliations

    • Department of Consulting and Research, Focus On Therapeutic Outcomes, Inc, Knoxville, TN
    • Sensory Motor Performance Program, Rehabilitation Institute of Chicago, Chicago, IL
  • ,
  • Paul W. Stratford, PT, MS

      Affiliations

    • Department of Clinical Epidemiology and Biostatistics, School of Rehabilitation Science, McMaster University, Toronto, ON, Canada
  • ,
  • Jerome E. Mioduski, MS

      Affiliations

    • Department of Information Technology, Focus On Therapeutic Outcomes, Inc, Knoxville, TN

Article Outline

Abstract 

Hart DL, Wang Y-C, Stratford PW, Mioduski JE. A computerized adaptive test for patients with hip impairments produced valid and responsive measures of function.

Objectives

To describe the use of a computerized adaptive test (CAT) in routine clinical practice and evaluate content coverage and construct validity, sensitivity to change, and responsiveness of hip CAT functional status (FS) measures.

Design

Longitudinal, prospective observational cohort study.

Setting

Two hundred fifty-seven outpatient rehabilitation clinics in 31 states (United States).

Participants

Two samples were examined: intake and discharge rehabilitation FS data from patients (N=8714) treated for hip impairments between January 2005 and June 2007 and data from patients (N=444) used to develop the hip CAT were examined for comparison (2002–2004).

Interventions

Not applicable.

Main Outcome Measures

Hip functional status and global rating of change.

Results

The CAT used on average 7 items to produce precise estimates of FS that adequately covered the content range with negligible floor and slight ceiling effects. Test information functions and SEs supported FS measure precision. FS measures discriminated patients in clinically logical ways. Sixty-one percent of patients obtained discharge FS measures greater than or equal to minimal detectable change (95% confidence intervals). Change of 6 FS units (scale: 0–100) represented minimal clinically important improvement, which 64% of patients obtained.

Conclusions

The hip CAT was efficient; produced valid, responsive measures of FS for patients receiving therapy for hip impairments; and functioned well in routine clinical application but would benefit from more difficult items.

Key Word: Rehabilitation

List of Abbreviations: CAT, computerized adaptive test, CI, confidence interval, DIF, differential item functioning, FCI, Functional Comorbidity Index, FOTO, Focus On Therapeutic Outcomes, Inc., FS, functional status, GROC, global rating of change, IER, item exposure rate, LEFS, Lower Extremity Functional Scale, MCII, minimal clinically important improvement, MDC, minimal detectable change, MDC90, minimal detectable change at the 90% CI, MDC95, minimal detectable change at the 95% CI, PRO, patient-reported outcome, ROC, receiver operating characteristic analyses

 

THE MEASUREMENT OF PROs in healthcare is evolving with psychometricians capitalizing on the administrative efficiency and measure precision of CAT.1, 2, 3 The primary advantages of CAT administrations4, 5, 6 are that the adaptive tests tailor PRO item administration, so each patient responds to items whose location (ie, item's position on the latent continuum7, 8 or difficulty) approximates the patients' ability estimates, and the number of items administered is minimized.8, 9, 10, 11 In this era of clinical efficiency that is increasingly influenced by the need for the collection of PROs that may influence payment policy,12, 13, 14 measure precision and efficiency are important.

CAT has its origins in mental,15 educational,16 and military5 testing. Recently, CAT applications have been introduced in healthcare.3, 17 As described by Jette et al,18 the application of CATs in healthcare has been recommended for nearly a decade,19, 20, 21 but only recently have CATs been used in data simulations,22, 23, 24, 25, 26, 27 research demonstrations,24, 28, 29 and prospective data collections.18, 30, 31, 32 Although the development of CATs in healthcare in general is growing,1, 2, 3, 10, 33 the development and use of CATs in outpatient rehabilitation specifically18, 25, 26, 27, 30, 31, 34, 35 are rapidly advancing. Several factors positively influence that growth. Therapists commonly measure PROs because national associations recommend outcomes collection and analysis.36 Outcomes measures are common in outpatient rehabilitation.36 Most patients have functional limitations that improve with therapy and can be measured.18, 37, 38 Payers expect justification for treatment.39

The current study builds on previous work in which we developed, simulated, and prospectively tested body part-specific CATs25, 26, 27, 31 by using retrospective and prospective data analyses. Here, as we have previously performed for patients with knee impairments,31 we prospectively evaluated the practical and psychometric adequacy of a CAT used to assess the FS of patients with hip impairments seeking rehabilitation in outpatient therapy clinics participating with FOTO, an international medical rehabilitation outcomes database management company.40, 41 For the purpose of this study, we operationally define hip impairments as the constellation of medical conditions that are associated with functional deficits of the hip and immediate area (ie, pelvis and upper leg). As previously described,26 for the purpose of this study of patients with hip impairments, the latent trait of interest is lower-extremity FS, which we operationally define as the patients' perception of their ability to perform functional tasks described in the FS items. Therefore, FS represents the level of ability of the patient, which represents the patient's location on the underlying latent trait of functional status.7, 8 The item bank for the hip CAT26 was developed by using items from the LEFS,42 a scale with strong psychometric properties42, 43, 44, 45 and broad clinical and research acceptance.46 FS, as assessed by using LEFS items, represents the activity dimension of the World Health Organization's International Classification of Functioning, Disability and Health.47

Our purposes were to (1) assess the practicality of using a CAT in routine clinical practice; (2) perform a psychometric evaluation of content range coverage and test precision; and (3) assess known group construct validity, sensitivity to change, and responsiveness of the FS measures estimated by using the hip CAT. Analyses will be used to determine strengths and weaknesses of the CAT and help direct future research designed to improve the CAT's practicality and psychometric properties.

Back to Article Outline

Methods 

Design and Setting 

We conducted a longitudinal, prospective observational cohort study of patients with hip impairments before and at the conclusion of rehabilitation. The FOTO Institutional Review Board for the Protection of Human Subjects approved the project.

Participants 

To accomplish our purposes, we analyzed data from 2 samples. Primary analyses were conducted by using prospectively collected data (sample 1) from patients with hip impairments who represent a sample of convenience (table 1). Secondarily, some psychometric property comparisons were made between results of analyses on data from sample 1 and results from analyses on the original sample (sample 2) that was used to develop the hip CAT.26 Both samples represent data collected routinely from practices participating with the FOTO data-collection process.30, 41 Routine data collection before patient evaluation mandated that clinicians enter into the computer the body part associated with the impairment treated for each patient. The computer used this information to select which CAT to administer (eg, the hip CAT was administered to patients with hip impairments). Secondarily, clinicians could elect to identify medical or surgical diagnoses (described later). To create sample 1, we selected patients from the FOTO dataset whose clinicians identified them as being treated for a hip impairment. Therefore, all patients were identified as having a hip impairment. Some patients (50%) also had medical/surgical codes. For those with data, the most common diagnosis was pain in the hip (International Classification of Diseases–9th Revision, 719.45).48 The 7 clusters of medical/surgical codes displayed in table 1 represent 96% of patients with diagnostic codes. The researchers did not influence data collection. The patients received rehabilitation in 257 outpatient clinics in 31 states (U.S.) between January 2005 and June 2007. Clinics were participating with FOTO.40, 41 Analyses required that we select several subsets of patients from sample 1, and some analyses required that we use both sample 1 and 2. When subsets of patients were used, the selected patients represented all patients who had the data necessary for the analyses.

Table 1. Patient Characteristics at Rehabilitation Intake
CharacteristicValue
Diagnoses (%)
Soft tissue disorders of muscle, synovium, tendon, bursa, or enthesopathies (ICD-9 codes 725–729)30
Postsurgical conditions (CPT codes 20150–29999, including total hip replacement, open treatment of greater trochanteric fracture)6
Spine pathology (ICD-9 codes 720–724)3
Sprains and strains of hip (ICD-9 codes 840–848 including unspecified sprain or strain)2
Arthropathies (ICD-9 codes 710–716, including osteoarthroses, rheumatoid arthritis)2
Fractures (ICD-9 800–829 including femoral fractures)2
Disorders of the bone and cartilage (ICD-9 codes 730–739 including osteoporosis of femur)1
Other (not otherwise classified)1
Missing50
Age, mean ± SD (range) (y)56±17 (18–102)
18 to <45 (%)24
45 to 65 (%)39
>65 (%)30
Missing (%)7
Sex (% women)63
Missing7
Acuity of symptoms (%)
Acute (0–21d)15
Subacute (22–90d)26
Chronic (>90d)52
Missing7
Surgical history (%)
None74
One18
Two4
Three1
Four or more2
Missing1
Exercise history (%)
At least 3 times a week21
1 to 2 times a week12
Seldom or never19
Missing48
Payer source (%)
Indemnity2
Litigation<1
Medicaid2
Medicare part B14
Patient private pay<1
HMO12
PPO12
Workers' compensation3
Other4
Missing51
Number of functional comorbidities (%)
None21
One22
Two19
Three or more35
Missing3
Global rating of change (%)
Improved (≥3 to 7)14
Not improved (–7 to <3)3
Missing83

NOTE. N=8714 patients.

Abbreviations: CPT, current procedural terminology93; HMO, health maintenance organization; ICD-9, International Classification of Diseases–9th Revision48; max, maximum; min, minimum; PPO, preferred provider organization.

Diagnoses are groups of International Classification of Diseases–9th Revision–Clinical Modifications codes or surgical CPT codes.

Enthesopathies are disorders of peripheral ligamentous or muscular attachments.

Functional comorbidities are medical conditions shown to affect physical functioning.57

Data Collection 

Data were collected by using Patient Inquiry Software,a which the participating clinics used for routine data collection during patient management. Patients seeking rehabilitation entered demographic data before initial evaluation. The hip CAT was administered at 3 possible times during rehabilitation. Each patient completed the CAT before the initial evaluation; those data were labeled intake. When the CAT was administered at the end of rehabilitation, the data were labeled discharge. Therapists could elect to have patients complete the CAT during rehabilitation (between intake and discharge), and, if they did, those data were labeled status. Clinical staff entered demographic data at intake and discharge.

Hip Functional Status CAT 

The hip CAT has been described and simulated.26 Briefly, LEFS items represent functional activities, like making sharp turns while running fast. Patients are asked to rate their ability to perform each activity by using a rating scale of 5 levels from extreme difficulty or unable to perform activity to no difficulty.42 Unidimensionality and local independence of the 18-item bank from the LEFS were supported.26 Fit to the Andrich49 rating scale, a 1-parameter item-response theory model was supported. Item location parameters supported a clinically logical hierarchic structure of the item bank. The LEFS items showed DIF50 by body part (hip, knee, or foot/ankle) affected of the lower extremity (ie, item location parameter estimates varied by body part in clinically logical ways). DIF is present when the relationship between item responses and the functional status measured differs systemically between groups of patients after controlling for the patients' underlying FS.51 Therefore, the hip CAT was developed with items calibrated by using data from patients with hip impairments (sample 2),26 which makes the hip CAT a body-part–specific or condition-specific CAT.

The CAT was developed following the logic of Thissen and Mislevy52 by using software developed specifically for the purpose of CAT Development and Testing Software, version 2.1.0.53,a The adaptive test started by administering the most informative54 item (going up or down 10 stairs), which represented an item location of median functional status difficulty. Patient FS ability estimates with associated SE were calculated by using maximum likelihood estimation using a Newton-Raphson estimation technique.55 There were 2 stopping rules: (1) SE for the provisional ability was less than 4 out of 100 FS units (functional status estimates were scaled 0–100 by using a linear transformation with higher FS measures representing higher functioning), and (2) change in provisional ability estimates for the last 3 items was less than 1 out of 100. The SE of less than 4 represents less than .30 SD of the scale range. The CAT would stop when either stopping rule was met. Real-data simulations5 were conducted by using the original data in which all patients answered all LEFS items. The CAT was simulated by using the original data so FS levels estimated by using all LEFS items (computer administered) could be compared with FS levels estimated by using the CAT (computer adaptive). The simulation of the knee CAT suggested that on average 6 items would be used before a stopping rule was met. Known group construct validity and precision of the simulated CAT FS estimates were supported.26

During routine administration in the clinic, the hip CAT items were presented by the computer by using the 5 original LEFS response categories. In addition, the patient could elect not applicable for any item, which was recorded as missing data and not used in the item-exposure rate or FS estimation.

Over the testing time, the hip CAT was administered 12,676 times: 8813 times at intake and 3863 during rehabilitation or at discharge (table 2). There were 8714 patients who completed the hip CAT at intake. Of these patients, 99 patients returned for another rehabilitation episode over the testing time for 8813 intake episodes from 8714 patients. There were 3841 patients who completed the hip CAT at rehabilitation discharge. Of these patients, 22 patients completed status administrations. These data represent a completion rate (ie, percent of patients with intake data who also have discharge data) of 44%.30

Table 2. Scale Distributions and Minimal Detectable Change of the Hip CAT Functional Status Estimates
Patients With Admission DataPatients With Both Admission and Discharge DataUtilization DataPercent of Patients With Change Scores ≥MDC
AdmissionAdmissionDischargeDifferenceVisitsDurationMDC90MDC95
n=8708n=3838n=3838n=3838n=4566n=4386n=3838n=3838
49±15(0–100)50±14(0–98)63±18(0–100)13±15(−51–96)9±6(1–68)36±28(2–272)6461

NOTE. Values are mean ± SD (range) except for MDC90 and MDC95, which are the percent of patients with FS change estimates equal to or greater than 90% or 95% MDC.

Data Analyses 

Efficiency 

To evaluate the practicality of hip CAT administration, we assessed the CAT's efficiency, which we defined as the number of CAT items administered per assessment (sample 1).

Content range coverage and item usage 

We assessed the content range of the scale to determine how well the hip CAT item bank captured the range of FS reported by the patients (sample 1). We assessed floor and ceiling effects of the scale at intake and discharge. We operationally defined a floor effect as a measure from 0 to 5 and a ceiling effect as a measure from 95 to 100. We also tabulated the number of patients who responded by answering all items administered by the CAT as either all lowest response choices (the floor) or all highest (the ceiling) response choices to better assess how many patients were at the extremes of the measurement continuum. The percentage of a not applicable response used per item was assessed. We plotted the item exposure rate (ie, the ratio of the total number of times an item was administered over the total number of test occasions) against item location levels to determine if the adaptive test administered different items to patients with differing levels of FS.

Test precision 

We assessed test precision 3 ways. First, we used information, an important concept in item response theory that reflects measure precision at each functional status level,56 by plotting the test information function.7, 54 The test information function is the sum of the item information functions at each FS level along the latent trait continuum. The amount of information provided by a test at each FS level is inversely related to the error with which functional status is estimated at the given FS point.7 We plotted the test information function generated by using data from the original sample26 of patients (intake data from sample 2) used to develop the hip CAT (n=444) in which all patients answered all items and superposed 3 additional test information function plots by using the same data (item calibrations were anchored to all intake data from sample 2): (1) using the 7 most commonly administered items to all patients, (2) using the 7 most commonly administered items for patients with lower FS scores, and (3) using the 7 most commonly administered items for patients with higher FS scores. The most common items were identified by item exposure rate analyses using intake CAT data (sample 1). Although the test information function using 7 items should be of less magnitude compared with the test information function using 18 items, the shape of the 7-item test information function compared with the 18-item test information function should provide qualitative assessment of the level of test precision for the CAT compared with the original scale over the measurement range, and the test information functions for different groups of patients by level of ability will assist in understanding if the CAT is functioning adaptively.

Second, to quantify measure precision at the FS level, we averaged the SE of the person's FS estimates from the original data (n=444 from sample 2) at 10 levels (ie, 0–10, 11–20, . . . , 91–100) of the FS continuum. We plotted the 95% CI of the average SE estimates (average SE per FS level times 1.96) to assess measure precision.

Third, to quantify measure precision for CAT-generated measures of FS, we plotted SEs of the CAT-generated FS estimates by FS level.

Construct validity 

We used known group construct validity methods to assess the ability of the CAT-generated FS measures to discriminate groups of patients in clinically logical ways (sample 1 for patients with both intake and discharge data). The independent variables assessed included age, symptom acuity, surgical history, condition complexity, and prior exercise history. To perform the analyses, we used analyses of covariance in which the dependent variable was discharge FS, and the covariate was intake FS. Age was categorized as 18 to 44, 45 to 64, and 65 years or older. Symptom acuity, which we operationally defined as the number of calendar days from the date of onset of the condition being treated in therapy to the date of initial therapy evaluation, was categorized as acute (<22d), subacute (22–90d), and chronic (>90d). Surgical history was categorized as none, 1, 2, 3, or 4 or more surgeries related to the condition being treated. Condition complexity was assessed by using the FCI,57 which is a list of 18 medical conditions like arthritis, asthma, diabetes, and heart attack that have been shown to affect physical functioning. The FCI was developed specifically to assess the complexity of the patient's medical condition related to the prediction of functional ability in 2 samples of patients similar to those in the current study. To group patients by their condition complexity, the number of comorbidities was summed. In previous unpublished studies performed by using FOTO FCI data, the distribution of the number of comorbidities was not normative and, therefore, we tested the distribution of the number of comorbidities by using the Shapiro-Wilk W statistic.58 We determined patient classification groupings by quartiles of the number of comorbidities. Exercise history before receiving therapy was categorized as exercising 3 times a week or more, exercising 1 to 2 times a week, or exercising seldom or never. Each variable (age,37, 59, 60, 61 symptom acuity,18, 37, 59, 61, 62 surgical history,18, 37, 59, 60, 61 condition complexity,57, 59 prior exercise history,37, 61, 63, 64 and intake FS37, 59, 60, 61, 63) has been shown to influence discharge FS in similar datasets. Scheffé post hoc pairwise analyses were conducted for each significant independent variable of 3 or more levels.

Sensitivity to change 

Sensitivity to change is defined as the ability of an instrument to accurately detect change when it has occurred regardless of whether it is relevant or meaningful to the decision maker.65, 66, 67 Responsiveness deals with the clinical meaningfulness and importance of change in scores for the measure of interest.65, 66, 67, 68, 69 These terms are often used interchangeably to describe the ability of a measure to detect clinical change.42, 70, 71 Because the hip CAT was designed to assess clinically meaningful change within individual patients, we used the term sensitivity to change to denote the ability of a measure to detect change greater than measurement error in patients' status over time.42

Sensitivity to change was assessed in several ways. First, we calculated the MDC per FS ability. The MDC is the amount of change that is likely to be greater than the measurement error,42 which has been defined as meaningful change.72 To assess 90% CI and 95% CI MDCs, we obtained the SE per FS intake score by using original data (n=444; sample 2),26 multiplied the average SE per FS score by a z score corresponding to a 90% CI1.65 or 95% CI1.96 and multiplied the result by the square root of 2,42, 73 and calculated the percent of patients with FS change estimates equal to or greater than MDC90 or MDC95 in the current sample (sample 1 with patients who had both intake and discharge data). The MDCs represent the minimal change detectable beyond measurement error.65

Second, patients were expected to improve their FS measures by different amounts, so, as recommended by Stratford and Riddle,74 we used an anchored-based longitudinal method73, 75 using the GROC76 as the comparison standard to which FS change (discharge FS minus intake FS) was compared to estimate the MCII in FS measures (sample 1 with patients with both intake and discharge data as well as GROC data). Jaeschke et al76 defined the minimal clinically important difference as the smallest difference in the measure of interest that patients perceive as beneficial. Because our interest is assessing how the hip CAT FS estimates detect meaningful change (ie, greater than measurement error) in patients for the purpose of influencing interpretation of FS change during treatment and because the GROC item asked about improvement perceived, we described MCII as the minimal clinically important improvement77, 78, 79 representing a meaningful change in the FS from the perspective of the patient72 or responsiveness.66 As calculated, MCII represents the smallest observed change in those estimated to have important change.65

We assessed the MCII by analyzing the GROC data several ways. For each MCII assessment, patients were dichotomized by their GROC scores on a 15-point scale (–7 to 7) as patients who did not improve (ie, GROC scores <3) versus patients who improved (ie, GROC scores ≥3).72 Once patients were dichotomized, we estimated the MCII using all patients regardless of intake FS measure and using patients grouped by quartile of baseline FS scores.

For each of the 5 sets of data for GROC analyses, we used nonparametric ROC analyses to quantify the diagnostic accuracy80 of the FS change estimates to discriminate between patients whose functional status had changed in a clinically meaningful way compared with patients whose FS had not changed.81 The MCII cut point for each analysis was identified by selecting the FS change score with the largest average specificity and sensitivity values ([specificity + sensitivity]/2). Area under the ROC, SE, and 95% CI were used to describe the ROC results. The percent of patients whose FS change was equal to or greater than MCII was calculated for each of the 5 ROC methods.

Back to Article Outline

Results 

Efficiency 

Scale distributions, number of treatment visits, and treatment duration are displayed in table 2. Patients who had both intake and discharge data, compared with patients with just intake data, were older (57±17y vs 54±17y, t=7.7, df=7772, P<.001), exercised more (χ22=7.5, P=0.24), were more likely to be receiving Medicare Part B and less likely to be receiving Medicaid benefits (χ112=32.7, P=.001), were more likely to have had 1 surgical procedure to their hip (χ42=10.5, P=.032), and had higher FS intake scores (t8469=2.7, P=.015). Groups were not different by sex (χ12=.002, P=.967), acuity of symptoms (χ22=2.7, P=.259), number of comorbidities (χ32=2.6, P=.457), and medication use at intake for their condition (χ12=1.9, P=.166).

All data (n=12,676) were used for efficiency testing. When administered, on average ± SD, the CAT used 7.2±2.4 items (minimum=3, maximum=18, median=7). Fifteen (0.1%) patients used all 18 items.

Content Range Coverage and Item Usage 

Of the 8714 patients with intake FS estimates, 6 had a score of 0 and 66 (0.8%) had scores between 0 and 5, which we judged as a negligible floor effect. Of those same intake data, 1 patient had a score of 100, and 98 (1.1%) had scores between 95 and 100 for a negligible ceiling effect. For the 3841 patients with discharge data, 3 patients had a score of 0, and 6 had scores between 0 and 5, which we judged as a negligible floor effect. Nine (0.2%) patients had a score of 100, and 330 (8.6%) had scores between 95 and 100 for a negligible ceiling effect. Sixty-four (0.8%) and 2 (0.1%) patients selected all the lowest responses for all items administered by the CAT at intake and discharge, respectively, and 151 (1.8%) and 550 (15.4%) patients selected all the highest responses for all items administered by the CAT at intake and discharge, respectively, which represents a slight ceiling effect at discharge. The functional status score distributions at admission (ie, sample 1, patients with admission data) and discharge (ie, sample 1, patients with discharge data) are plotted in figure 1, suggesting a potential for a ceiling effect for patients at discharge.

  • View full-size image.
  • Fig 1. 

    Hip CAT functional status score distributions at admission (FS Intake) and discharge (FS Discharge). NOTE. Intake data from the 8714 patients with intake FS estimates and discharge data from the 3841 patients with discharge data.

Display of the item exposure rates (plot available on request) for different levels of FS (sample 1) supported the adaptive nature of the hip CAT is functioning as expected with easier items being administered to lower-functioning patients more often than harder items and vice versa for harder items and higher-functioning patients.

Patients used the not applicable response for each item. As a percent of item use for intake and discharge data, patients tended to select not applicable for harder items; the making sharp turns while running fast, running on uneven ground, running on even ground, and hopping were the 4 items with the most not applicable responses. The number of items used by the CAT tended to increase when patients used the not applicable response, particularly for older patients with higher FS scores.

Test Precision 

The test information functions regardless of the FS level (fig 2) had similar maximum FS abilities (FS=47 when using all items and FS=42 when using the 7 items with the largest item exposure rates). The test information functions cover the majority of the FS range similarly. These data were interpreted as similar patterns of information per FS continuum. In addition, the test information functions by patient ability peaked at appropriately higher and lower FS abilities, suggesting that the CAT discriminated patients by the level of FS. A plot of SEs of CAT-generated FS estimates (fig 3) suggested good measure precision for the majority of the FS range.

  • View full-size image.
  • Fig 2. 

    Test information functions. NOTE. Test information function curves were estimated by using item parameter estimates from the original data (n=444 sample 2).26 Abbreviations: Top 7 IER (Low Function Patients), test information function using the 7 most commonly used items in the hip CAT for patients with lower FS scores; Top 7 IER (High Function Patients), test information function using the 7 most commonly used items in the hip CAT for patients with higher FS scores; Full Test (P&P Version), test information function using all items in the hip CAT item bank for all patients; Top 7 IER (Entire Sample), test information function using the 7 most commonly used items in the hip CAT for all patients regardless of FS scores. Abbreviation: IER, item exposure rate.

The upper limits of 95% CIs for the SE (scale 0–100) and percent of patients per functional status range are plotted in figure 4 for patients (N=8714) completing the hip CAT at intake. Ninety-six percent of the patients had FS scores between 20 and 80, whereas the upper level 95% CIs for the SEs were between 4.3 and 8.7 (out of 100).

Construct Validity 

Hip CAT FS estimates discriminated patients in clinically logical ways for age, symptom acuity, surgical history, condition complexity, and prior exercise history (table 3). By clinically logical ways, we mean that FS reported by the following groups of patients would be expected: patients who were older, had more chronic symptoms, had more surgeries, had more comorbidities, and did not exercise before receiving rehabilitation, reported worse (ie, lower) discharge FS compared with other patients in each independent variable after controlling for intake FS. The number of functional comorbidities was not normally distributed (Shapiro-Wilk W <.001) so the number of comorbidities was grouped by quartiles (ie, 0, 1, 2, and 3 or more comorbidities).

Table 3. Known Group Construct Validity Assessment of Hip CAT Functional Status Change Estimates
VariableLevels of Independent VariablesFdf, P
Age group (y)18–4445–64≥65 F2,3573=52,<.001
n=802n=1491n=1284
67±.5263±.3861±.41
Symptom acuityAcuteSubacuteChronic F2,3571=51,<.001
n=575n=1031n=1969
67±.6164±.4561±.33
Surgical historyNoneOne or more F1,3834=33,P<.001
n=2795n=1042
64±.2861±.47
Comorbidities012≥3F3,3709=48,P<.001
n=769n=855n=752n=1338
68±.5465±.5063±.5360±.41
Exercise history3/wk or moreSeldom or never to 2–3/wk F1,3834=25,P<.001
n=1696n=2141
65±.3662±.32

NOTE. Covariate was intake functional status, and each was significant (P<.001); values are means ± SE, unless otherwise noted.

Abbreviations: Fdf, analysis of covariance F statistic for independent variable with df; P, the probability of the F statistic of the analysis of covariance.

Scheffé post hoc pairwise comparison probabilities all P<.01.

Scheffé post hoc pairwise comparisons probability for 1 or 2 comorbidities was P=.273, but all other pairwise comparisons probabilities were P<.01.

Post hoc analyses showed that discharge FS scores discriminated patients by surgical history (ie, none, 1, 2, 3, 4 or more) (F4,3831=10, P<.001), but there were no differences (P>.05) in discharge FS between patients who had 1 or more surgical procedures so patients were grouped by no surgery versus 1 or more surgeries for the final analysis. Similar results were observed for exercise history. The first analysis showed discharge FS estimates discriminated patients by exercise history (F2,3833=14, P<.001), but there was no difference (P=.24) in discharge FS for patients who exercised seldom or never (62±.41) compared with patients who exercised 1 to 2 times a week (63±.52) (adjusted least square means ± SE), so patients were grouped by exercising 3 times a week or more versus not exercising to exercising 2 times a week for the final analysis.

Sensitivity to Change and Responsiveness 

A majority (64% and 61%) of patients with FS discharge scores reported FS change scores equal to or greater than MDC90 and MDC95, respectively (see table 2). Plots of MDC90 and MDC95 per FS range (fig 5) showed MDC was dependent on the FS level. Ninety-six percent of the patients whose FS discharge scores reached MDC90 had intake FS scores between 20 and 80. Ninety-one percent of the patients whose FS discharge scores reached MDC95 had intake FS scores between 20 and 70. On average, the upper CI limit of MDC90 and MDC95 values for all patients were 14 and 16.7, respectively. However, the average MDC90 upper CI limit for the 96% of patients with FS intake scores between 20 and 80 was 7.1, and the average MDC95 upper CI limit for the 91% of patients with FS intake scores between 20 and 70 was 7.6.

There were 1564 patients with both GROC and FS change data. Of these, 254 (16%) reported no change, and 1310 (84%) reported improvement. Results of the ROC analyses are displayed in table 4. When using all patients, ROC analyses supporting 6 or more FS change units represented clinically meaningful improvement, and 2615 (64%) patients with discharge data reported FS change equal to or greater than the MCII. When patients were grouped by baseline FS measures and 4 ROC analyses were run (1 per quartile of FS intake measures), 2621 (65%) patients reported FS change scores equal to or greater than the MCII. Results suggested that hip FS measures were responsive, and the MCII was dependent on the intake FS, with patients perceiving improvement with fewer FS units as intake FS scores increased.

Table 4. Responsiveness Assessment Using ROC Analyses
Sample By Intake FS ScoreImproved (change ≥3) nNo Change (change <3) nROC Cut PointAUCSE95% CI≥MCII (%)
Allpatients13102546.716.016.684–.74864
IntakeFS0to402676611.714.035.645–.78362
IntakeFS>40to48322626.710.035.641–.77967
IntakeFS>48to59347564.750.030.692–.80865
IntakeFS>59to100348662.736.030.677–.79564

Abbreviation: AUC, area under the curve.

Back to Article Outline

Discussion 

CATs are common in standardized testing for licensure, certification, and admissions tests4, 6 but have only recently begun to be used to collect routine clinical data in busy outpatient rehabilitation clinics in the United States18, 31, 32 and Israel.30 Results of the current study are consistent with similar analyses of prospectively collected knee CAT data31 and support Jette et al's18 conclusions that the potential is good for CATs to efficiently generate estimates of functional status that are precise, valid, and sensitive to change. These conclusions are encouraging for clinicians, managers of outpatient rehabilitation clinics, and researchers who cannot be burdened with an onerous outcomes data collection process. With improved data-collection efficiency available with CAT administrations, clinicians, researchers, and policy makers will benefit as more constructs are assessed by using adaptive technology, which will improve the assessment of multi-dimensional patient outcomes, such as adjusting FS outcomes by level of fear avoidance of physical activities.82

The hip CAT was designed for the purpose of assessing FS change in patients with hip impairments receiving therapy in rehabilitation clinics. The research team started with an existing paper-and-pencil survey that was conceptually well designed and generated measures of FS with strong psychometric properties42, 43, 45 and applied modern psychometric methods to retrospectively collected data.26 Results in this study confirm the psychometric strength of the hip CAT suggested by simulations.26 Current analyses suggest that the CAT functions well in a large sample of prospectively collected data. Typically, 7 items were administered by the CAT producing precise FS estimates that adequately covered the range of FS, identified patients who improved, and discriminated patients in clinically logical ways. Floor effects were negligible, but data suggest the potential for a slight ceiling effect for patients at discharge. In addition, in a previous study,30 the average CAT survey entry time ± SD for patients (n=170 surveys) using the hip CAT was measured as 3:07±1:54min:s (median=2:46, minimum=0:50, maximum=19:04).

One concern for the hip CAT is the small, 18-item item bank. Although large CAT item banks are recommended as ideal,83 there is no standard definition for how many items represent a large item bank. In previous studies, CATs have been developed by using 25,27 30,84 37,2, 25 53,3 64,1 65,18 and 12018 items, and all research teams provided statistics supporting the fact that the CATs functioned efficiently and produced precise measures over the latent trait continuum. However, more important than the total number of items is the number of items that target the range of trait levels in the assessed samples.25, 27 The current analyses suggest that the 18-item hip CAT covered the FS content range adequately with the possible exception of the highest-functioning patients at discharge. The results suggest that it would be prudent to add more items targeting higher levels of FS. As previously shown,27, 85 combining existing instruments can produce an item bank that covers the content range with more precision and improved sensitivity to change. Work is underway that concurrently calibrates physical-functioning items from the Medical Outcomes 36-Item Short Form Health Survey physical functioning scale86 and lower-functioning items from Hart and Wright87 to the hip CAT items. Adding items to item banks for the purpose of expanding the coverage of trait assessment is common for item response theory-developed item banks.1, 3, 27, 85, 88, 89 The advantage of a large item bank is the availability of items that, although rarely used, are used to target levels of functional status when present. Therefore, large item banks assist in precisely measuring all examinees at all levels of FS.

The items used in the hip CAT address the activity dimension of the World Health Organization's International Classification of Functioning, Disability and Health.47 The items are worded for specific attribution related to hip impairment. We elected to develop a more condition-specific scale because earlier work comparing generic with condition-specific scales suggested greater sensitivity to change of FS measures estimated when using condition-specific scales,42 and we wanted the hip CAT FS measures to be as sensitive to change as possible because of its intended use as a routine outcomes instrument in busy rehabilitation clinics. Although the hip CAT estimated FS measures that were sensitive to change, the ceiling effect at discharge probably eroded the potential sensitivity to change of the FS measures.

Of interest, the LEFS items showed the DIF by body part treated in a sample of patients receiving outpatient therapy for lower-extremity impairments.26 For example, patients with hip impairments perceived squatting to be easier than patients with knee impairments, and patients with hip impairments perceived lifting an object like a bag of groceries from the floor to be harder than patients with knee, foot, or ankle impairments.26 There are many ways to assess DIF,50 which when present can erode measure validity.90 Given our previous results suggesting items assessing functional status are commonly affected by body part DIF,26, 91 meticulous testing for DIF including DIF by body part in FS item banks is recommended.

The paper-and-pencil version of the LEFS has been shown to produce FS estimates that are sensitive to change from intake to discharge with MDC90 of 9 scale points, which for the original LEFS 0 to 80 scale represents 11.25% of the scale range,42 without regard to SE estimates variability over the scale range. The current results and those of Jette et al18 show that SE estimates vary over the FS range. When SE is used to estimate the MDC, the MDC estimates vary over the FS range as shown by results of the current data.

Of more interest is the estimate of the percent of patients whose discharge FS estimates were greater than the MDC. Jette18 used a test-retest reliability estimate from a paper-and-pencil short form to calculate the MDC, and, in the current study, we used SE per FS range to calculate the MDC. Neither of these CATs has published test-retest reliability estimates. Jette reported that 66% of patients with lower-extremity impairments obtained MDC90 without regard to the variability of reliability estimates over the FS range. Using varying SEs over the FS range for the hip CAT data to calculate MDCs, 64% and 61% of patients obtained MDC90 and MDC95, respectively, which are similar to earlier studies18 despite the hip CAT assessing FS change over a longer treatment duration, the hip CAT items using disease attribution wording, and estimating SEs for specific FS ranges. None of these possibilities could be tested given the current design, and, therefore, we await future investigations.

Binkley et al42 reported that a 11.25% of the scale range (a 9-unit scale improvement) represents a meaningful change for FS estimates from the paper-and-pencil LEFS compared with 6% (MCII=6) of the scale range using the hip CAT FS estimates. No method of MCII assessment is devoid of limitations,75 and given that our study design was observational using retrospective data analyses, we used the data available for MCII assessment in the FOTO dataset. Our results suggest that the MCII is dependent on baseline FS measures, which was not unexpected.69 Adjusting for baseline FS, the percent of patients with FS change equal to or greater than the MCII were from 62% to 67%, which suggests that the FS estimates from the hip CAT were responsive.

The hip CAT and the activity measure for postacute care item bank and CAT assessment platform18 used similar item-selection routines but different item response theory models, theta estimation routines, and stopping rules. Despite these operational characteristic differences, practical applications of the CATs were strikingly similar; both averaged approximately 7 items and less than 3 minutes per CAT administration.

Patients had the option of selecting not applicable while using the hip CAT. When used, not applicable was more likely to be associated with more difficult items, which was not our intention for adding this response. We expected patients to answer that they cannot perform the activity if the activity appears too difficult. It appears the availability of the not applicable response and the limited item bank increased the average number of items used in the CAT, which warrants further study.

The hip CAT was developed by using the 1-parameter Andrich rating scale item response theory model.49 The rating scale model assumes similar item discrimination parameters and assumes that item location is the only item characteristic that influences examinee performance.7, 8, 11 There is evidence that 2-parameter item response theory models that account for differences in item-discrimination parameters are important in functional status items.89 Our initial assessments of practical implications of 1- versus 2-parameter item response theory models in similar outpatient rehabilitation data produced equivocal results with patient FS ability estimates from 1- and 2-parameter item response theory models being highly correlated.25, 27 Further investigations relating to the practical application of different item response theory models are warranted.

One advantage of using item response theory methods and CAT applications to assess PROs is the flexibility in the theta range score transformations that can be accomplished instantaneously via a computer. Measures can be transformed many ways. For example, hip CAT measures were linearly transformed from logits (mean ± SD, 0±1) into a 0 to 100 score range for the patients assessed. Others transformed logits into a norm-based mean ± SD of 50±1018 by using patients assessments as the norm. Others transform PRO scores such that the resultant scale range has a mean ± SD of 50±10 norm based on samples from the general U.S. population.92 In each, higher values represent more functioning. However, the more important issue is the interpretation of the scores. The 0 to 100 scale, which may be intuitively acceptable by clinicians, can be misunderstood as a percent of a construct that cannot be defined (ie, what is 100% of functional status?). Scales based on population norms provide comparisons commonly based on SDs of a sample, but interpretations of these measures have not been adopted readily by clinicians because of a lack of clinical meaningfulness for the scores and a lack of direct clinical applicability to individual patients.42 In the current study, we provided interpretations of the hip CAT scale that were meaningful to patients (ie, minimal clinically important improvement), which we believe encourages a reasonable method of measure interpretation.

The percent of patients with intake data who also had discharge data, or completion rate,30 was 44%. In an observational study, this raises the potential for patient selection bias.61 Patients who had both intake and discharge data compared with patients with just intake data were different by age, exercise history, surgical history, and intake FS intake scores, which might have influenced results. Plus, only 50% of the patients had diagnostic or surgical procedural codes, which erodes our ability to clearly describe the patients in the samples, and 83% of the patients did not have GROC data for ROC analyses, which erodes the generalizability of the responsiveness results beyond the sample studied. Future studies should endeavor to reduce this potential selection bias and patient description by using more complete data.

Finally, data from the hip CAT were collected by using a proprietary database management company, FOTO,40, 41 and, as in almost all investigations, there is the potential for bias. However, the use of proprietary database-management companies offers the opportunity for studies related to practical application and psychometric adequacy of CATs in large samples30, 31 that would not be available under routine extramural funding projects.

Study Limitations 

The datasets analyzed in this study represented routine daily clinical data. This type of data is affected by clinician expectations, clinic administrative issues, and patient realities over which the study design offered no control. Therefore, there were missing data describing the patients including the global rating of the change data. Although the sample size was admirable, the missing data might have affected the results. Future studies should endeavor to collect more complete data for reduced potential for bias related to these concerns.

Back to Article Outline

Conclusions 

Data from a large convenience sample of patients treated in outpatient rehabilitation clinics for impairments to their hip suggest that the 18-item hip CAT produced precise, valid, sensitive, and responsive FS measures efficiently. The hip CAT is currently used routinely in many outpatient rehabilitation clinics across the United States and Israel, which attests to its efficiency. Expansion of the item bank is warranted.

Supplier

Back to Article Outline

Acknowledgment 

The authors thank Karon F. Cook, PhD, for her insightful comments regarding statistical analyses, results, and manuscript edits.

Back to Article Outline

References 

  1. Fliege H, Becker J, Walter OB, Bjorner JB, Klapp BF, Rose M. Development of a computer-adaptive test for depression (D-CAT). Qual Life Res. 2005;14:2277–2291
  2. Kosinski M, Bjorner JB, Ware JE, Sullivan E, Straus WL. An evaluation of a patient-reported outcomes found computerized adaptive testing was efficient in assessing osteoarthritis impact. J Clin Epidemiol. 2006;59:715–723
  3. Ware JE, Bjorner JB, Kosinski M. Practical implications of item response theory and computerized adaptive testing: a brief summary of ongoing studies of widely used headache impact scales. Med Care. 2000;38(9 Suppl):II73–II82
  4. Mills CN, Potenza MT, Fremer JJ, Ward WC. Computer-based testing (Building the foundation for future assessments). Mahwah: Lawrence Erlbaum Associates; 2002;
  5. Sands WA, Waters BK, McBride JR. Computerized adaptive testing (From inquiry to operation). Washington (DC): American Psychological Assoc; 1997;
  6. Wainer H. Computerized adaptive testing (A primer). 2nd ed.. Mahway: Lawrence Erlbaum Associates; 2000;
  7. Hambleton RK, Swaminathan H, Rogers HJ. Fundamentals of item response theory. Newbury Park: Sage; 1991;
  8. Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement in the 21st century. Med Care. 2000;38(9 Suppl):II28–II42
  9. Bjorner JB, Kosinski M, Ware JE. The feasibility of applying item response theory to measures of migraine impact: a re-analysis of three clinical studies. Qual Life Res. 2003;12:887–902
  10. Cella D, Yount S, Rothrock N, et al. The Patient-Reported Outcomes Measurement Information System (PROMIS): progress of an NIH Roadmap cooperative group during its first two years. Med Care. 2007;45(5 Suppl 1):S3–S11
  11. Hambleton RK. Emergence of item response modeling in instrument development and data analysis. Med Care. 2000;38(9 Suppl):II60–II65
  12. Institute of Medicine. Crossing the quality chasm: a new health system for the 21st century. Washington (DC): National Academy Pr; 2001;
  13. Institute of Medicine. Rewarding provider performance: aligning incentives in Medicare. Washington (DC): National Academies Pr; 2006;
  14. Porter ME, Teisberg EO. Redefining health care (Creating value-based competition on results). Boston: Havard Business School Pr; 2006;
  15. Lord FM, Novick MR. Statistical theories of mental test scores. Reading: Addison-Wesley; 1968;
  16. Lord F. Some test theory for tailored testing. In:  Holtzman W editors. Computer-assisted instruction, testing, and guidance. New York: Harper and Row; 1970;p. 139–183
  17. Ware JE, Kosinski M, Bjorner JB, et al. Applications of computerized adaptive testing (CAT) to the assessment of headache impact. Qual Life Res. 2003;12:935–952
  18. Jette AM, Haley SM, Tao W, et al. Prospective evaluation of the AM-PAC-CAT in outpatient rehabilitation settings. [published erratum appears in Phys Ther 2007;87:617] Phys Ther. 2007;87:385–398
  19. McHorney CA. Generic health measurement: past accomplishments and a measurement paradigm for the 21st century. Ann Intern Med. 1997;127(8 Pt 2):743–750
  20. Patrick DL, Chiang YP. Convening health outcomes methodologists. Med Care. 2000;38(9 Suppl):II3–II6
  21. Revicki DA, Cella DF. Health status assessment for the twenty-first century: item response theory, item banking and computer adaptive testing. Qual Life Res. 1997;6:595–600
  22. Dijkers MP. A computer adaptive testing simulation applied to the FIM instrument motor component. Arch Phys Med Rehabil. 2003;84:384–393
  23. Gardner W, Kelleher KJ, Pajer KA. Multidimensional adaptive testing for mental health problems in primary care. Med Care. 2002;40:812–823
  24. Haley SM, Ni P, Hambleton RK, Slavin MD, Jette AM. Computer adaptive testing improved accuracy and precision of scores over random item selection in a physical functioning item bank. J Clin Epidemiol. 2006;59:1174–1182
  25. Hart DL, Cook KF, Mioduski JE, Teal CR, Crane PK. Simulated computerized adaptive test for patients with shoulder impairments was efficient and produced valid measures of function. J Clin Epidemiol. 2006;59:290–298
  26. Hart DL, Mioduski JE, Stratford PW. Simulated computerized adaptive tests for measuring functional status were efficient with good discriminant validity in patients with hip, knee, or foot/ankle impairments. J Clin Epidemiol. 2005;58:629–638
  27. Hart DL, Mioduski JE, Werneke MW, Stratford PW. Simulated computerized adaptive test for patients with lumbar spine impairments was efficient and produced valid measures of function. J Clin Epidemiol. 2006;59:947–956
  28. Haley SM, Fragala-Pinkham M, Ni P. Sensitivity of a computer adaptive assessment for measuring functional mobility changes in children enrolled in a community fitness programme. Clin Rehabil. 2006;20:616–622
  29. Ware JE, Gandek B, Sinclair SJ, Bjorner J. Item response theory in computer adaptive testing: implications for outcomes measurement in rehabilitation. Rehabil Psychol. 2005;50:71–78
  30. Deutscher D, Hart DL, Dickstein R, Horn SD, Gutvirtz M. Implementing an integrated electronic outcomes and electronic health record process to create a foundation for clinical practice improvement. Phys Ther. 2008;88:270–285
  31. Hart DL, Wang YC, Stratford PW, Mioduski JE. Computerized adaptive test for patients with knee impairments produced valid and responsive measures of function. J Clin Epidemiol. 2008;Jul 9. [Epub ahead of print]
  32. Hart DL, Connolly JB. Pay-for-Performance for Physical Therapy and Occupational Therapy: Medicare Part B Services (Health & Human Services/Centers for Medicare & Medicaid Services; 2006). www.cms.hhs.gov/TherapyServices/downloads/P4PFinalReport06-01-06.pdfAccessed August 27, 2008
  33. Ader DN. Developing the Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care. 2007;45(5 Suppl 1):S1–S2
  34. Haley SM, Coster WJ, Andres PL, Kosinski M, Ni P. Score comparability of short forms and computerized adaptive testing: simulation study with the activity measure for post-acute care. Arch Phys Med Rehabil. 2004;85:661–666
  35. Haley SM, Coster WJ, Andres PL, et al. Activity outcome measurement for postacute care. Med Care. 2004;42(1 Suppl):I49–I61
  36. American Physical Therapy Association. Guide to physical therapist practice. Phys Ther. 2001;81:1–768
  37. Resnik L, Hart DL. Using clinical outcomes to identify expert physical therapists. Phys Ther. 2003;83:990–1002
  38. Werneke MW, Hart DL. Centralization phenomenon as a prognostic factor for chronic low back pain and disability. Spine. 2001;26:758–764
  39. 2007 Physician Quality Reporting Initiative (PQRI). Physician Quality Measures (Centers for Medicare and Medicaid Services; 2007). http://www.cms.hhs.gov/PQRI/EmailUpdates/list.asp#TopOfPageAccessed August 27, 2008
  40. Dobrzykowski EA, Nance T. The Focus On Therapeutic Outcomes (FOTO) Outpatient Orthopedic Rehabilitation Database: results of 1994–1996. J Rehabil Outcomes Meas. 1997;1:56–60
  41. Swinkels IC, van den Ende CH, de Bakker D, et al. Clinical databases in physical therapy. Physiother Theory Pract. 2007;23:153–167
  42. Binkley JM, Stratford PW, Lott SA, Riddle DL. The Lower Extremity Functional Scale (LEFS): scale development, measurement properties, and clinical application (North American Orthopaedic Rehabilitation Research Network). Phys Ther. 1999;79:371–383
  43. Alcock GK, Stratford PW. Validation of the Lower Extremity Functional Scale on athletic subjects with ankle sprains. Physiother Can. 2002;54:233–240
  44. Stratford PW. Getting more from the literature: estimating the standard error of measurement from reliability studies. Physiother Can. 2004;56:27–30
  45. Stratford PW, Binkley JM, Watson J, Heath-Jones T. Validation of the LEFS on patients with total joint arthroplasty. Physiother Can. 2000;52:97–205
  46. Stratford PW, Hart DL, Binkley JM, Kennedy DM, Alcock GK, Hanna SE. Interpreting lower extremity functional status scores. Physiother Can. 2005;57:154–162
  47. World Health Organization. International Classification of Functioning, Disability and Health. Geneva: World Health Organization; 2001;
  48. Hart AC, Stegman MS. ICD-9-CM 2008 expert. 6th ed.. Salt Lake City: Ingenix; 2007;
  49. Andrich D. A rating formulation for ordered response categories. Psychometrika. 1978;43:561–573
  50. Millsap RE, Everson HT. Methodology review: statistical approaches for assessing measurement bias. Appl Psychol Meas. 1993;17:287–334
  51. Crane PK, Hart DL, Gibbons LE, Cook KF. A 37-item shoulder functional status item pool had negligible differential item functioning. J Clin Epidemiol. 2006;59:478–484
  52. Thissen D, Mislevy RJ. Testing algorithms. In:  Wainer H editors. Computerized adaptive testing: a primer. 2nd ed.. Mahwah: Lawrence Erlbaum Associates; 2000;p. 101–134
  53. Hart DL, Mioduski JE. CAT development and testing software user's guide. Knoxville: FOTO Inc; 2006;
  54. Lord FM. Applications of item response theory to practical testing problems. Hillsdale: Lawrence Erlbaum Associates; 1980;
  55. Linacre JM. Estimating measures with known polytomous item difficulties. Rasch Meas Trans. 1998;12:638
  56. Folk VG, Smith RL. Models for delivery of CBTs. In:  Mills CN,  Potenza MT,  Fremer JJ,  Ward WC editor. Computer-based testing building the foundation for future assessments. Mahwah: Lawrence Erlbaum Associates; 2002;p. 41–66
  57. Groll DL, To T, Bombardier C, Wright JG. The development of a comorbidity index with physical function as the outcome. J Clin Epidemiol. 2005;58:595–602
  58. Shapiro S, Wilk MB. An analysis of variance test for normality. Biometrika. 1965;52:591–611
  59. Jette DU, Jette AM. Physical therapy and health outcomes in patients with spinal impairments. Phys Ther. 1996;76:930–945
  60. Jette DU, Jette AM. Physical therapy and health outcomes in patients with knee impairments. Phys Ther. 1996;76:1178–1187
  61. Resnik L, Feng Z, Hart DL. State regulation and the delivery of physical therapy services. [published erratum appears in Phys Ther 1997;77:113] Health Serv Res. 2006;41(4 Pt 1):1296–1316
  62. Hart DL. The power of outcomes: FOTO Industrial Outcomes Tool-Initial Assessment. Work. 2001;16:39–51
  63. Hart DL, Dobrzykowski EA. Influence of orthopaedic clinical specialist certification on clinical outcomes. J Orthop Sports Phys Ther. 2000;30:183–193
  64. Hart DL, Dobrzykowski EA. Impact of exercise history on health status outcomes in patients with musculoskeletal impairments. Orthop Phys Ther Clin N Am. 2000;9:1–16
  65. Beaton DE, Bombardier C, Katz JN, Wright JG. A taxonomy for responsiveness. J Clin Epidemiol. 2001;54:1204–1217
  66. Liang MH. Longitudinal construct validity: establishment of clinical meaning in patient evaluative instruments. Med Care. 2000;38(9 Suppl):II84–II90
  67. Wright JG, Young NL. A comparison of different indices of responsiveness. J Clin Epidemiol. 1997;50:239–246
  68. Kirshner B, Guyatt G. A methodological framework for assessing health indices. J Chronic Dis. 1985;38:27–36
  69. Riddle DL, Stratford PW, Binkley JM. Sensitivity to change of the Roland-Morris Back Pain Questionnaire: part 2. Phys Ther. 1998;78:1197–1207
  70. Beaton DE, Hogg-Johnson S, Bombardier C. Evaluating changes in health status: reliability and responsiveness of five generic health status measures in workers with musculoskeletal disorders. J Clin Epidemiol. 1997;50:79–93
  71. Liang MH. Evaluating measurement responsiveness. J Rheumatol. 1995;22:1191–1192
  72. Stratford PW, Binkley FM, Riddle DL. Health status measures: strategies and analytic methods for assessing change scores. Phys Ther. 1996;76:1109–1123
  73. Hsieh YW, Wang CH, Wu SC, Chen PC, Sheu CF, Hsieh CL. Establishing the minimal clinically important difference of the Barthel Index in stroke patients. Neurorehabil Neural Repair. 2007;21:233–238
  74. Stratford PW, Riddle DL. Assessing sensitivity to change: choosing the appropriate change coefficient. Health Qual Life Outcomes. 2005;3:23
  75. Crosby RD, Kolotkin RL, Williams GR. Defining clinically meaningful change in health-related quality of life. J Clin Epidemiol. 2003;56:395–407
  76. Jaeschke R, Singer J, Guyatt GH. Measurement of health status (Ascertaining the minimal clinically important difference). Control Clin Trials. 1989;10:407–415
  77. Lingard EA, Riddle DL. Impact of psychological distress on pain and function following knee arthroplasty. J Bone Joint Surg Am. 2007;89:1161–1169
  78. Tubach F, Ravaud P, Baron G, et al. Evaluation of clinically relevant changes in patient reported outcomes in knee and hip osteoarthritis: the minimal clinically important improvement. Ann Rheum Dis. 2005;64:29–33
  79. Tubach F, Ravaud P, Beaton D, et al. Minimal clinically important improvement and patient acceptable symptom state for subjective outcome measures in rheumatic disorders. J Rheumatol. 2007;34:1188–1193
  80. Deyo RA, Centor RM. Assessing the responsiveness of functional scales to clinical change: an analogy to diagnostic test performance. J Chronic Dis. 1986;39:897–906
  81. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36
  82. George SZ, Fritz JM, Bialosky JE, Donald DA. The effect of a fear-avoidance-based physical therapy intervention for patients with acute low back pain: results of a randomized clinical trial. Spine. 2003;28:2551–2560
  83. Chakravarty EF, Bjorner JB, Fries JF. Improving patient reported outcomes using item response theory and computerized adaptive testing. J Rheumatol. 2007;34:1426–1431
  84. Dodd BG, Koch WR, De Ayala RJ. Operational characteristics of adaptive testing procedures using the Graded Response Model. Appl Psychol Meas. 1989;13:129–143
  85. Martin M, Kosinski M, Bjorner JB, Ware JE, Maclean R, Li T. Item response theory methods can improve the measurement of physical function by combining the modified health assessment questionnaire and the SF-36 physical function scale. Qual Life Res. 2007;16:647–660
  86. Ware JE, Sherbourne CD. The MOS 36-item short-form health survey (SF-36) (I. Conceptual framework and item selection). Med Care. 1992;30:473–483
  87. Hart DL, Wright BD. Development of an index of physical functional health status in rehabilitation. Arch Phys Med Rehabil. 2002;83:655–665
  88. Jette AM, Haley SM, Ni P. Comparison of functional status tools used in post-acute care. Health Care Financ Rev. 2003;24:13–24
  89. McHorney CA, Cohen AS. Equating health status measures with item response theory: illustrations with functional status items. Med Care. 2000;38(9 Suppl):II43–II59
  90. Steinberg L, Thissen D, Wainer H. Validity. In:  Wainer H editors. Computerized adaptive testing: a primer. 2nd ed.. Mahwah: Lawerence Erlbaum Associates; 2000;p. 185–229
  91. Hart DL. Assessment of unidimensionality of physical functioning in patients receiving therapy in acute, orthopedic outpatient centers. J Outcome Meas. 2000;4:413–430
  92. Reeve BB, Hays RD, Bjorner JB, et al. Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care. 2007;45(5 Suppl 1):S22–S31
  93. American Medical Association. Current Procedural Terminology (CPT) 2007, Professional Edition. Chicago: American Medical Assoc; 2006;
  • a FOTO Inc, PO Box 11444, Knoxville, TN 37939.

 Supported by Focus On Therapeutic Outcomes, Inc.

 A commercial party having a direct financial interest in the results of the research supporting this article has conferred or will confer a financial benefit on the author or 1 or more of the authors. Hart, Wang, and Mioduski are employees of and Hart is an investor in Focus On Therapeutic Outcomes, Inc, which distributes the hip CAT discussed in this study.

PII: S0003-9993(08)00829-0

doi:10.1016/j.apmr.2008.04.026

Archives of Physical Medicine and Rehabilitation
Volume 89, Issue 11 , Pages 2129-2139, November 2008