Journal Home
Search for

Volume 89, Issue 2, Pages 275-283 (February 2008)


View previous. 15 of 34 View next.

Computerized Adaptive Testing for Follow-Up After Discharge From Inpatient Rehabilitation: II. Participation Outcomes

Stephen M. Haley, PhDaCorresponding Author Informationemail address, Barbara Gandek, MSc, Hilary Siebens, MDd, Randie M. Black-Schaffer, MDe, Samuel J. Sinclair, PhDc, Wei Tao, BSa, Wendy J. Coster, PhDb, Pengsheng Ni, MDa, Alan M. Jette, PhDa

Abstract 

Haley SM, Gandek B, Siebens H, Black-Schaffer RM, Sinclair SJ, Tao W, Coster WJ, Ni P, Jette AM. Computerized adaptive testing for follow-up after discharge from inpatient rehabilitation: II. Participation outcomes.

Objectives

To measure participation outcomes with a computerized adaptive test (CAT) and compare CAT and traditional fixed-length surveys in terms of score agreement, respondent burden, discriminant validity, and responsiveness.

Design

Longitudinal, prospective cohort study of patients interviewed approximately 2 weeks after discharge from inpatient rehabilitation and 3 months later.

Setting

Follow-up interviews conducted in patient’s home setting.

Participants

Adults (N=94) with diagnoses of neurologic, orthopedic, or medically complex conditions.

Interventions

Not applicable.

Main Outcome Measures

Participation domains of mobility, domestic life, and community, social, & civic life, measured using a CAT version of the Participation Measure for Postacute Care (PM-PAC-CAT) and a 53-item fixed-length survey (PM-PAC-53).

Results

The PM-PAC-CAT showed substantial agreement with PM-PAC-53 scores (intraclass correlation coefficient, model 3,1, .71–.81). On average, the PM-PAC-CAT was completed in 42% of the time and with only 48% of the items as compared with the PM-PAC-53. Both formats discriminated across functional severity groups. The PM-PAC-CAT had modest reductions in sensitivity and responsiveness to patient-reported change over a 3-month interval as compared with the PM-PAC-53.

Conclusions

Although continued evaluation is warranted, accurate estimates of participation status and responsiveness to change for group-level analyses can be obtained from CAT administrations, with a sizeable reduction in respondent burden.

Article Outline

Abstract

Methods

Sample

Participation Item Banks

CAT Construction

Participation Fixed-Length Form

Data collection

Statistical Analyses

Results

Score Comparability and Validity

CAT Item Usage and Respondent Burden

Responsiveness

Discussion

Study Limitations

Conclusions

References

Copyright

MEASUREMENT OF REHABILITATION outcomes is evolving, as there is increasing recognition of the importance of collecting information that reflects participation in major life activities.1, 2, 3, 4 This requires rehabilitation outcome measurement to extend beyond basic and intermediate activities of daily living,5, 6 and demands that measures are applicable to the home and community setting where participation of major life activities are assessed. In addition, increasing financial pressures and health care quality concerns are compelling the rehabilitation sector to work toward uniform, valid, and systematic measurement of long-term participation outcomes.7

Several theoretical models have led to various approaches to measuring participation outcomes.8, 9 The World Health Organization’s recent International Classification of Functioning, Disability and Health (ICF) is based on a broad conceptual model that is being used increasingly to guide the development and utilization of rehabilitation outcomes instruments in both acute hospital and postacute health care settings.10, 11 The ICF framework expands on its predecessor, the International Classification of Impairment, Disability and Handicap,12 by including positive and negative experiences; organizing the former impairments, disabilities, and handicaps into 4 components of body functions, body structures, activity, and participation; and including environmental and personal factors. Researchers have been using this framework to guide the development and choice of outcome measures for rehabilitation research.13, 14

Defining and measuring participation, those activities that reflect a person’s involvement in a life situation for research and clinical purposes requires distinction from the ICF’s concept of activity, as the latter is defined as the “execution of a task or action by an individual.” Whereas a person may be unable to walk around the neighborhood (activity), the person may be able to use public transportation and assistance from family or friends, or adaptive technology in order to achieve full access to the neighborhood (participation). A person’s level of participation may vary by life domain. The ICF specifically delineates these domains, which include learning and applying knowledge, general tasks and demands, communication, mobility, self-care, domestic life, interpersonal interactions and relationships, major life areas, and community, social, and civic life.15 Of these ICF domains, our early work had identified 2 major composite domains of participation: community and social and/or home.1 However, in this article, we propose 3 major participation domains using the ICF terminology: (1) mobility, (2) domestic life, and (3) community, social, & civic life, based on new calibration data described in this article.

Efficient measurement of participation outcomes is challenging after patients have returned to the community. Computerized adaptive testing (CAT) has been proposed as an alternative to traditional fixed-length instruments that have been used to monitor rehabilitation outcomes16, 17 and appears to be the preferred alternative for future health care outcome assessments.17, 18, 19, 20, 21, 22 CAT systems can tailor item selection to each individual respondent, thus providing breadth of content to individual patients while minimizing the number of items administered in any one assessment.23 During an individual assessment, item selection can be tailored based on responses to prior items; thus, item selection is adapted to meet the expected level of patient functioning. The adapted item selection has the result of reducing respondent burden, measuring all persons on the same metric regardless of care setting, and determining change in a more precise manner because items are targeted to a particular level of functioning.24, 25, 26, 27, 28, 29, 30

A number of empirical simulations of CAT assessments have been performed in rehabilitation and related fields.24, 25, 26, 27, 28, 29, 30, 31, 32, 33 These simulations use item responses from previously collected data to replicate a CAT session by using the most informative items for each person. The simulations clearly indicate that CAT software has potential to minimize response burden and produce accurate group-level scores. Simulation studies, however, tend to overestimate CAT results in patient care settings, because the same items that are used to build the CAT are then used to estimate person scores.16, 33 Prospective studies of CAT in clinical follow-up environments are needed to further evaluate the accuracy, validity, and responsiveness of the outcome measures generated using CAT software. Such studies have been less common than simulation studies in the rehabilitation field to date, but are increasing in frequency.34, 35, 36 Recently, in a companion follow-up study, Haley et al37 found that both full length and CAT versions of activity measures in adults after inpatient hospital rehabilitation provided accurate and responsive estimates for functional activity group-level changes. In general, prospective CAT programs for activity concepts, particularly physical functioning, have performed very well as compared with fixed-length forms.16, 37

Our objective in the current project was to examine the psychometric and operational performance of a CAT in measuring participation outcomes for patients who have recently been discharged from inpatient rehabilitation. Concepts underlying participation are ultimately more complex and may be more difficult to model than activity functioning.3, 15, 38 However, early work has shown that participation items can be placed in meaningful constructs that will meet the general assumptions of item response theory (IRT) models.1 Once items are placed along a continuum and meet basic IRT assumptions,21 they can be developed into a CAT application. We examined the agreement between CAT-generated scores and those derived from a 53-item fixed-length form that measured the identical 3 participation domains of mobility, domestic life, and community, social, & civic life. Second, we examined CAT item usage and the reduction in respondent burden related to CAT. Third, we evaluated the ability of CAT scores to discriminate between patient groups classified using a global clinician-rated severity index. Finally, we examined the sensitivity of the CAT to detect change and examined the responsiveness of CAT in relation to patient-based estimates of their own functional change during the follow-up period. We expected that the advantage of a CAT-based assessment of participation outcomes would be reduced respondent burden with only small losses in score accuracy, discriminant validity, sensitivity and responsiveness, as compared with the longer fixed-length form assessment.

Methods 

return to Article Outline

Sample 

The initial sample for this study consisted of 149 patients whom we recruited at discharge from an inpatient program at a major rehabilitation hospital. Of these, 111 completed an initial assessment administered by a trained clinician at home at approximately 2 weeks after discharge, and 94 completed a follow-up home visit, approximately 3 months after the initial home visit. The final longitudinal sample of 94 (mean age ± standard deviation [SD], 61.7±17y; range, 20–90y) was stratified based on functional severity, to enroll patients at 2 severity levels—slight to mild impairment (41.5%) and moderate to severe impairment (58.5%), based on scores from an adapted Modified Rankin Scale (MRS).39 This is the identical sample reported in an earlier study on activity outcomes using CAT programs,37 and full details of the sample are presented in that companion article. The institutional review boards of Boston University and Spaulding Rehabilitation Hospital approved the study and all persons signed informed consent forms prior to participation.

Participation Item Banks 

CAT requires a bank of items that measure the participation domains of interest. The Participation Measure for Postacute Care (PM-PAC-CAT) was developed from a separate item calibration study of 518 patients (age, 65±16y; 60% female; 90% white) who were receiving rehabilitation services in outpatient or home care settings. Of the 518 patients, 29.7% had neurologic, 48.7% had orthopedic, and 21.6% had complex medical conditions. Approximately 50% had moderate or severe disability on the MRS at the time of the interview. Short-Form 8-Item Health Survey40 physical component summary mean scores indicated that the physical health (39.8±10) of the sample was 1 SD below the U.S. general population norm of 50. However, the mental component summary (49.9±10.1) was comparable with the general population norm of 50.

Three PM-PAC item banks were constructed to measure the ICF domains of mobility (k=24), domestic life (k=22), and community, social, & civic life (k=33). These banks included items from the Community Integration Questionnaire,41 Functional Status Questionnaire,42 Impact on Participation and Autonomy Questionnaire,43 Medical Outcomes Study,44 National Health Interview Survey (NHIS) 2001 participation module,45 NHIS 1994 disability module,46 PM-PAC,1 Reintegration to Normal Living Index,47 Sickness Impact Profile,48 and U.S. Census.49 In addition, items from the Arthritis Impact Measurement Scale,50 Nottingham Health Profile,51 and Stroke Impact Scale52 were adapted for the item bank. All patients answered the items from both NHISs and the U.S. Census, and core items from the PM-PAC. Approximately one third of the patients answered each of the other measures.

Prior to building CATs for each domain, we evaluated assumptions about the underlying IRT models on which the CAT was based. Unidimensionality, or the assumption that all items within an item bank are measuring the same concept, was evaluated using confirmatory factor analysis of categorical data53 and multitrait scaling methods that evaluate the strength of an item as a measure of 1 domain as opposed to all other hypothesized domains in a particular measure.54 In summary, multitrait scaling methods supported the assignment of the participation items to their hypothesized item banks. A confirmatory factor analysis of the core items completed by all respondents supported a 3-factor model (comparative fit index, .945; root mean square error of approximation, .078; 2% of residual correlations were > ±.20, but the correlations among the 3 factors were high [.70–.89]). Response option characteristic curves for each item were examined to determine whether each response category provided unique information, because IRT models function optimally when each response option has a distinct relationship to the latent continuum.55 This was evaluated using nonparametric statistical methods and the TestGraf software.56,a When analyses indicated adjacent response options were not unique for a particular item, they were collapsed before fitting the model. The generalized partial-credit model (GPCM) and maximum marginal likelihood estimation procedure was used to fit IRT models for each item bank, using Parscale software57, 58, 59,b and the weighted maximum likelihood procedure.60 A 2-parameter GPCM was used instead of a 1-parameter model such as the partial credit model because the data did not support the requirement of common item slopes. Item fit was examined by comparing model predictions to observed data, using a method originally proven for dichotomous items61 and subsequently adapted for polytomous items.62 To ease interpretation, we transformed all original logit scores by multiplying by 10 and adding 50. Effective range of the 50/10 scale differed slightly among the 3 participation domains, but was approximately ±3 SDs (range, 20–80) for each scale.

CAT Construction 

Once the IRT models were estimated for the 3 participation domains, they were incorporated into the DYNHA softwarec developed at QualityMetric Inc.16, 62 The PM-PAC-CAT was constructed for use on a laptop computer using the Windows operating system. An initial item with a high information function value in the middle of the scoring range, which was appropriate for all patients, was selected to be the start item for each of the 3 PM-PAC-CATs. The response to this question generates an initial score estimate, as well as the selection of the next most informative item for each respondent from a bank of items. The response to this second item again generates a score and the next most informative item from the item bank is determined. At each step, the patient’s level of participation is re-estimated along with a patient-specific confidence interval (CI). When a predetermined maximum number of items have been administered or a specified level of precision has been achieved, the PM-PAC-CAT stops or begins assessing another participation domain. In this study, the PM-PAC-CAT concluded when a 95% CI of ±5 points was reached or a maximum of 10 items (a domain) had been answered. The stop-rule was based on the standard error (SE) of a particular score, in our case, 5 points translates into 0.5 SD.

Participation Fixed-Length Form 

To compare the PM-PAC-CAT programs with more traditional fixed-length forms, we selected 53 items from the participation item banks for inclusion in the PM-PAC-53. Items were selected to represent the full content of the item banks, limit the number of response scales that needed to be evaluated by the respondent, and to minimize ceiling and floor effects. Eighteen items were selected for the mobility scale (75% of the item bank), 15 items for the domestic life scale (68% of the item bank), and 20 items for the community, social, & civic life scale (61% of the item bank). Scale scores were derived for each domain using the parameter estimates generated from the IRT models described previously.

Data collection 

Initial and follow-up interviews were conducted by trained interviewers at the patient’s living location. The initial interview was scheduled to be about 2 weeks after discharge; those who were not interviewed within 6 weeks of hospital discharge were excluded from the study. The order of PM-PAC-53 fixed-form and PM-PAC-CAT administration was systematically alternated to avoid order effects, such that each enrolled patient was preassigned to receive the CAT first and fixed-length form second on initial visit, and fixed-length form first and CAT second on follow-up visit; or the reverse pattern. For the CAT administration, patients who were not computer literate watched the computer screen with the data collectors. If the patient chose to do so, he/she completed the CAT directly. However, most frequently the data collector used the mouse to record responses for the patient. Each interview lasted 45 to 60 minutes. The actual time (to the closest minute) was recorded for administration of the fixed-length questionnaire; the CAT had an internal clock to track the time of the CAT administration and the number of items answered for each domain.

At the end of the follow-up interview, patients rated their amount of change (worse, about the same, better) in their overall level of daily functioning since the start of the study, using a previously validated scale.63, 64, 65 Patients first provided an overall assessment of change (worse, about the same, better). Those patients who rated their overall change as worse also scored the amount of change on a 7-point scale ranging from −7 (a very great deal worse) through −1 (hardly worse at all). Similarly, patients who rated their change as better also scored the amount of change on a 7-point scale ranging from 1 (hardly better at all) to 7 (a very great deal better). See full description of global rating of change in Haley et al.37

Statistical Analyses 

To assess how well PM-PAC-CAT scores reproduced an IRT-based estimate of the latent trait, intraclass correlation coefficients, model 3,1 (ICC3,1) between the CAT scores and the fixed-length form scores were calculated at both the initial and the follow-up visits. ICCs were calculated as a ratio of the variance of scores between subjects to the total variance of scores between and among subjects. Reliability is considered high if the ICC is greater than .80, substantial between .61 and .80, moderate between .41 and .60, and poor to fair if less than or equal to .40 for group estimates.66

The ability of the PM-PAC-CAT compared with the PM-PAC-53 scales to discriminate between groups of patients was examined using follow-up data. First, patients were classifed on the basis of initial disability severity (using the adapted MRS).67 The discrimination of PM-PAC-CAT versus PM-PAC-53 scores was evaluated by a series of independent sample t tests.

Second, we used a series of paired t tests to examine differences between follow-up and initial scores, as evaluated with 2 sensitivity indices. For this study, we defined sensitivity as the ability to detect change. The simple effect size for correlated samples is the average change between initial and follow-up measurements, divided by the SD of the initial measurement.68 The standardized response mean (SRM) is the ratio of the mean change to the SD of the change score.69 For the same sample, the standardized response mean is identical to the Cohen d statistic for correlated samples, which is calculated by taking the t statistic and dividing it by the square root of the sample.69 The effect size and SRM often yield the same ranking of measures, but the absolute values can be different. The magnitude of effect size is dependent on the variability of the initial scores. If the correlation between the initial and follow-up scores is large, the SRM is considerably larger than the effect size.70 We also constructed 95% bootstrap CIs for both effect size and SRM using 2000 random samples with replacement. If the CI overlaps between the PM-PAC-53 and PM-PAC-CAT, no significant difference exists.

To assess response burden, we used 1-sample t tests to examine the difference between the number of items required in the PM-PAC-CAT versus the fixed-length 53 items on the alternative format and a series of paired t tests to examine differences in the amount of time needed for the PM-PAC-CAT (as measured by the internal computer clock) and the PM-PAC-53 (timed by interviewers).

Finally, we examined responsiveness by examining mean change scores for the PM-PAC-CAT and PM-PAC-53 survey in relation to the overall ratings of change in daily functioning between initial and follow-up visits provided by the patients. For this study, we defined responsiveness as change compared with an external anchor (in this case, patient report of change). The absolute value of the ratings (either worse or better) was used to classify patients into 3 groups: same or ±1 (no change), ±2 to 5 (small to medium, or some, change), and ±6 to 7 (large change). Rationale for using absolute values is provided in Haley et al.37 We used 3 rather than the 4 categories that have been used in previous studies due to the small numbers of patients reporting scores of ± 2 to 3.63, 64, 65 We used analysis of variance to compare mean PM-PAC-CAT and PM-PAC-53 form scores for each change category. In addition, we contrasted the responsiveness of PM-PAC-CAT and PM-PAC-53 formats by producing receiver operator characteristic (ROC) curves for detecting at least a small to medium change (either worse or better) based on global patient ratings.71 The construction of paired-ROC curves involved plotting sensitivity against (1 − specificity) along multiple cutpoints, based on the absolute change values of patient scores from either the PM-PAC-CAT or PM-PAC-53 scales. A series of change cutpoints using absolute values (≈89 a domain) were used to develop the ROC curves. The true positive rate (sensitivity) is the proportion of those patients exceeding each absolute change cutpoint relative to those who reported making at least a small to medium change based on their global rating. The false-positive rate (ie, 1 − specificity) is the proportion of patients not exceeding each absolute change cutpoint relative to those who reported at least a small to medium change. Chi-square tests were conducted to examine if the areas under the ROC curves were different than expected by chance alone (ie, different than the diagonal)72 and if the area under the curves between paired PM-PAC-CAT and PM-PAC-53 ROC curves was statistically different, taking into account the correlated nature of the data.73

Results 

return to Article Outline

Score Comparability and Validity 

ICCs between PM-PAC-CAT and PM-PAC-53 scores indicate a substantial degree of correspondence between measures, with ICCs ranging from .71 to .81 across domains, times, and methods, with an average of .76 (table 1). There were no substantial differences in score agreement between the initial (ICC mean, .747) and follow-up (ICC mean, .77) tests.

Table 1.

Score Comparability Between the PM-PAC-CAT and PM-PAC-53

Participation DomainsInterviewICC3,1 (N=94)ICC3,1 (average within domain)ICC3,1 (average across domains)
MobilityInitial.812.78.76
Follow-up.742
Community, social & civic lifeInitial.720.75
Follow-up.782
Domestic lifeInitial.709.75
Follow-up.785

Forty-two (44.7%) patients were coded as having moderate to severe disability on the follow-up visit. Both the PM-PAC-CAT and PM-PAC-53 discriminated between known severity groups in each of the 3 participation domains (table 2). Mean scores on the PM-PAC-CAT closely approximated scores on the PM-PAC-53 mobility scales for 3 participation domains. If we divide the t statistic in the PM-PAC-CAT version by the t statistic in the PM-PAC-53 version, the PM-PAC-CAT discriminated 85% as well as the PM-PAC-53 survey in measures of mobility, 65% in community, social, & civic life, and 83% in domestic life.

Table 2.

Discriminant Validity of the PM-PAC-CAT Versus the PM-PAC-53 Test Formats on Severity of Physical Disability, Follow-Up Data

Participation DomainsTest FormatMean Severe/ Moderate ± SE (n=42)Mean Mild/Slight ± SE (n=52)DifferencetP
MobilityPM-PAC-5349.18±1.0158.20±0.969.026.41≤.001
PM-PAC-CAT48.32±0.8959.14±1.7710.825.47≤.001
Community, social &PM-PAC-5347.55±1.3960.50±1.2412.956.96≤.001
civic lifePM-PAC-CAT47.37±1.7959.46±1.9012.094.55≤.001
Domestic lifePM-PAC-5348.58±1.3458.27±1.109.695.65≤.001
PM-PAC-CAT46.61±1.3056.80±1.7410.194.70≤.001

CAT Item Usage and Respondent Burden 

The average number of items required for the PM-PAC-CAT at the initial visit was 6.6±2.5 for mobility and 8.7±1.2 for community, social, & civic life. The minimum number of items per domain was 4 for mobility and 6 for community, social, & civic life, and the maximum (established by the item-stop rule) was 10. Thirty-one percent of PM-PAC-CAT administrations stopped when the 10-item stop rule was reached for mobility and 28% stopped at 10 items for community, social, & civic life. In the follow-up PM-PAC-CAT administration, the mean number of mobility items administered was 7.4±2.5 and community, social, & civic items was 8.7±1.2, and similar proportions of administrations stopped at the 10-item maximum. All PM-PAC-CAT domestic life administrations stopped when 10 items had been reached in both the initial and follow-up CAT administrations. Across both PM-PAC-CAT administrations, the average number of items a person across the 3 domains was 25.7±3.0 (table 3). The time to complete the PM-PAC-53 survey averaged 13.6±4.5 minutes across both administrations, compared with 5.7±2.4 minutes for the PM-PAC-CAT. Overall, the CAT resulted in large decreases in respondent burden, requiring 48% the number of items and 42% of the administration time of the full-length survey.

Table 3.

Respondent Burden of Participation on the PM-PAC-53 and PM-PAC-CAT Surveys

Initial and Follow-Up Tests (N=94)
Survey Mean ± SDRangeCAT as % of Fixed
PM-PAC-53Time(min)13.60±4.506–36NA
PM-PAC-CATTime(min)5.66±2.421.5–1841.6
No.ofitems25.7±2.9620–3048.5

Abbreviation: NA, not applicable.

Paired t test; t=16.59; P<.005.

One-sample t test; t=111.09; P<.005.

For the mobility domain, 20 of the 24 items were selected 1 or more times across both initial and follow-up administrations, 17 of the 22 items were selected from the domestic life domain, and 27 of 33 items were selected from the community, social, & civic life domain. Within each of the participation domains, the 10 most frequently selected items accounted for 91% of the item administrations for mobility, 81% for domestic life, and 79% for community, social, & civic life.

Responsiveness 

Evaluation of change scores for the PM-PAC-CAT and PM-PAC-53 forms showed that both were able to detect change, and that comparable scores (eg, CAT and fixed-length scores at initial visit) had similar mean values, within 1 to 2 points (table 4). However, the SD for the PM-PAC-CAT scores was larger than for the PM-PAC-53, for all domains. Though the effect size and SRM for all 3 domains favored the PM-PAC-53 over the PM-PAC-CAT, bootstrapping CIs for all 3 domains show an overlap between the CAT and fixed-length forms, indicating both forms are statistically equal in their sensitivity to detect changes between visits.

Table 4.

Sensitivity of the PM-PAC-CAT Versus the PM-PAC-53 Test Formats Over a 3-Month Interval

Mean ± SD
Participation DomainsTest FormatInitial VisitFollow-UpChangetPEffect SizeSRM
MobilityPM-PAC-5349.33±7.5054.17±8.114.876.90<.005.65.71
PM-PAC-CAT48.94±9.1754.30±11.525.364.27<.005.58.44
Community, social &PM-PAC-5346.62±9.8554.72±11.028.107.90<.005.82.81
civic lifePM-PAC-CAT47.88±12.4154.06±14.106.184.29<.005.50.44
Domestic lifePM-PAC-5347.98±7.6453.94±9.555.817.62<.005.76.79
PM-PAC-CAT47.22±10.0252.25±11.965.024.06<.005.50.42

The bar charts in figure 1 show the amount of absolute change in the PM-PAC-CAT and PM-PAC-53 that correspond to no change (n=14), some change (small to medium, n=38), or large change (n=42), as defined by patient’s overall rating of change in daily functioning. For the no change group, the change scores are nearly identical between the PM-PAC-CAT and PM-PAC-53 for domestic life domain and community, social, & civic life domain. But in the mobility domain, the PM-PAC-CAT detected a statistically larger difference than the PM-PAC-53 (difference=3.74, t13=2.906, P=.012). Of the 38 patients who reported some change, the average change for the mobility PM-PAC-CAT (mean, 6.3±5.3) and corresponding PM-PAC-53 (mean, 7.6±7.8) were within a point. For domestic life, scores for patients reporting “some change” were similar overall on the PM-PAC-CAT (mean, 7.1±5.2) and PM-PAC-53 (mean, 7.6±5.8). For the community, social, & civic life, the PM-PAC-CAT (mean, 8.8±6.1) and PM-PAC-53 scores (mean, 8.9±6.9) were nearly identical for the group reporting some change. Paired t tests show that the change scores did not differ statistically between the PM-PAC-CAT and PM-PAC-53 for all 3 domains. However, in the patients who reported large change over the 3-month interval, the PM-PAC-CAT uniformly detected statistically larger differences than the PM-PAC-53. In the mobility domain, the change detected by the PM-PAC-CAT form is 4.15 points higher (t40=3.177, P=.003); in the domestic life domain, the PM-PAC-CAT change score is 4.07 points higher (t41=2.810, P=.008); and in the community, social, & civic life, PM-PAC-CAT change score is 2.79 points higher (t41=2.515, P=.016).


View full-size image.

Fig 1. Comparison of changes (absolute values) detected by CAT and fixed-length formats of the PM-PAC based on categories of patient-reported ratings of change. Error bars are ±2 SEs of the mean.


Using the some change category as the cutpoint for examining paired-ROC curves to detect minimal levels of patient-reported responsiveness, we found in general that the PM-PAC-CAT and PM-PAC-53 performed equally well. For community, social, & civic life, both the PM-PAC-CAT (ROC ± SE, .684±.074; P=.034) and PM-PAC-53 (ROC, .690±.072; P=.029) ROC curves were statistically different from chance levels and there was no statistically significant difference between the paired-ROC curves. For domestic life, the PM-PAC-CAT (ROC, .655±.077; P=.076) was stronger than the PM-PAC-53 (ROC, .599±.07; P=.255), although neither result was significant at a .05 level. For mobility, the PM-PAC-53 (ROC, .749±.064; P=.004) outperformed the PM-PAC-CAT (ROC, .587±.079; P=.318), although the level of difference did not exceed significance at a .05 level.

Discussion 

return to Article Outline

CAT programs that measure rehabilitation outcomes are potentially a major technologic step forward by reducing response burden with relatively modest compromises in accuracy or sensitivity to change. Our study results suggest that CAT programs built to measure participation may show some of the same advantages in reducing respondent burden and maintaining accuracy and validity as we have seen with measures of activity.37

The overall level of score agreement between PM-PAC-CAT and PM-PAC-53 was slightly less for participation (.76 vs .82) than we found with activity concepts.37 We note that all PM-PAC-CAT domestic life administrations required the full 10 items and community, social, & civic life required nearly an average of 9 items. Improved agreement can be easily rectified by assigning a higher level of precision to the CAT stop-rule in future studies, or a higher number of items to engage the stop-rule, or both. If participation scores are to be interpreted at the individual level, then more items are likely needed to provide more precision. However, this study provides evidence that the CAT programs, even at a fairly low level of precision and number of items administered, can provide valid and responsive data over a 3-month follow-up period after inpatient rehabilitation when participation changes are most likely to occur.3, 74 As we found with the 3 activity concepts,37 the CAT format was able to easily discriminate between patients with different severity levels in all the participation scales.

Effect sizes for participation range from moderate to high over the 3-month interval. These are substantially higher than the corresponding effect sizes seen in the changes in activity during the same interval.37 This was somewhat unexpected, although it might be that activity changes have peaked at or near discharge, and the peak change period for participation for those patients recently discharged from inpatient rehabilitaton may be in the first few months after returning home.3, 74 Longer longitudinal follow-up with CAT programs and in different postacute groups are needed to address these questions more systematically.

Our analysis and interpretation of change is complicated by the fact that about one third of the patients had lower scores (indicating lower participation performance) at 3-month follow-up in at least 1 of the 3 participation domains in either the PM-PAC-CAT scores or the PM-PAC-53. The vast majority of these patients (85%) were in either the neurologic or the complex medical impairment groups. Neurologic and medical conditions, severe enough to require patients to receive acute rehabilitation hospital admission, are more likely to worsen over time compared with musculoskeletal or orthopedic conditions. In addition, 24.5% of the entire sample experienced either a hospitalization (n=17) and/or an intercurrent illness (n=23) between the first and follow-up interviews. These health problems also occurred primarily in patients with neurologic or complex medical conditions (91%). In future studies, larger numbers are needed to tease out differential sensitivity across major impairment groups. Because of the large number of patients who lost function during the study, absolute change values were used to evaluate responsiveness of the PM-PAC-CAT. Of course, the mean change scores, on which the sensitivity analyses were based, combined results from patients who improved, stayed the same, and deteriorated. Actually, in about 10% to 15% of the cases, depending on the participation domain, we found inconsistencies between the direction of self-report ratings of change and CAT scores. Our main results represent a general trend, recognizing there is variability in the direction of change in each person’s self-report and CAT scores. We assumed that the meaning of change, whether an improvement or deterioration, was equally important clinically in assessing rehabilitation outcomes. It should also be noted that because this was an observational design, the final functional status of each patient included changes due to rehabilitation services, the passage of time, and other factors specific to that patient. Future studies in which the direction of change is important to assess will need larger samples to run these separate analyses.75

A consistent finding in all 3 participation domains was that the PM-PAC-CAT scores better reflected the scores of persons who indicated they made a large change in their global functioning than the PAC-CAT-53. We had noted this finding only in 1 of the activity domains (personal care & instrumental).37 Our impression is that because the CAT format has the flexibility to administer different, more relevant items at serial assessment points, it is better able to capture changes in those patients who report large absolute changes. Future studies in both clinical and home environments should continue to examine the potential efficiency gains by use of the CAT platform for functional assessments, but balanced for the level of scoring precision needed for either group- or individual-level analysis.

Study Limitations 

We do note some limitations to the study. There was some mixture in modes of administration of the CAT, depending on the willingness and computer experience of the follow-up patients. As noted in the methods, most patients chose to interact with the data collector when answering the CAT questions, but others wanted to use the CAT computer interface without assistance from the data collector. This may have created an interview-bias effect in some patients, and clearer guidelines in use of the CAT in field studies will be advantageous in future work. The range of ICCs between the fixed form and the CAT was generally lower than expected, and although we believe this is an acceptable level of agreement for group level studies, it does point toward considerable measurement variability between CAT and the fixed-length forms. Finally, the decision to create a 10-item stop-rule was based on findings from previous work and from a practical standpoint to minimize response burden. Thus the large reduction in response burden reported here is somewhat forced by the stop-rule we imposed. The determination of the balance between response burden and accuracy and precision of the CAT, largely depending on the pupose of measurement, should be an important focus of future CAT research.

Conclusions 

return to Article Outline

The results of this study and its companion study on activity outcomes37 support the continued evaluation and development of CAT-based functional assessments in rehabilitation. It appears that participation outcomes can be modeled effectively on to a CAT platform, and initial results in this 3-month follow-up study are encouraging. The current CAT programs used item stop-rules developed for group-level analyses and not for care planning or identifying limitations of individuals in specific areas of functioning. The efficiency gains coupled with promising psychometic performance of the participation scales noted in this article, together with similar results on CAT-based activity outcomes suggest that CAT assessments may provide an important tool for future follow-up studies and group monitoring in rehabilitation and postacute care programs.

Suppliers

References 

return to Article Outline

1. 1Gandek B, Sinclair J, Jette A, Ware JE. Development and initial testing of the Participation Measure for Post-Acute Care (PM-PAC). Am J Phys Med Rehabil. 2007;86:57–71. MEDLINE | CrossRef

2. 2Gray D, Hollingsworth H, Stark S, Morgan K. Participation survey/mobility: psychometric properties of a measure of participation for people with mobility impairments and limitations. Arch Phys Med Rehabil. 2006;87:189–197. Abstract | Full Text | Full-Text PDF (182 KB) | CrossRef

3. 3Jette AM, Keysor J, Coster W, Ni P, Haley S. Beyond function: predicting participation in a rehabilitation cohort. Arch Phys Med Rehabil. 2005;86:2087–2094. Abstract | Full Text | Full-Text PDF (478 KB) | CrossRef

4. 4Noreau L, Desrosiers J, Robichaud L, Fouqeyrollas P, Rochette A, Viscoqliosi C. Measuring social participation: reliability of the LIFE-H in older adults with disabilities. Disabil Rehabil. 2004;26:346–352. MEDLINE | CrossRef

5. 5Dijkers MP, Whiteneck G, El-Jaroudi R. Measures of social outcomes in disability research. Arch Phys Med Rehabil. 2000;81(12 Suppl 2):S63–S80. Abstract | Full-Text PDF (128 KB) | CrossRef

6. 6Whiteneck GC, Charlifue SW, Gerhart KA, Overholser JD, Richardson GN. Quantifying handicap: a new measure of long-term rehabilitation outcomes. Arch Phys Med Rehabil. 1992;73:519–526. MEDLINE

7. 7Hearing before the Subcommittee on Health of the House Committee on Ways and Means on Standardized Payment and Patient Assessments in Post-Acute Care, 109th Cong, 1st Sess (testimony of Herb Kuhn, director, Center for Medicare Management, Centers for Medicare Medicaid Services).

8. 8Nagi S. A study in the evaluation of disability and rehabilitation potential: concepts, methods, and procedures. Am J Public Health. 1964;54:1568–1579.

9. 9Pope AM, Tarlov A. Disability in America: toward a national agenda for prevention. Washington (DC): Natl Acad Pr; 1991;.

10. 10World Health Organization. International classification of functioning, disability, and health. Geneva: WHO; 2001;.

11. 11Stucki G, Ewert T, Cieza A. Value and application of the ICF in rehabilitation medicine. Disabil Rehabil. 2003;25:628–634. MEDLINE | CrossRef

12. 12World Health Organization. International classification of functioning, disability and handicap (ICF). Geneva: WHO; 2001;.

13. 13Stucki G. International classification of functioning, disability, and health (ICF): a promising framework and classification for rehabilitation medicine. Am J Phys Med Rehabil. 2005;84:733–740. MEDLINE | CrossRef

14. 14Weigl M, Cieza A, Andersen C, Kollerits B, Amann E, Stucki G. Identifications of relevant ICF categories in patients with chronic health conditions: a Delphi exercise. J Rehabil Med. 2004;36:12–21. MEDLINE | CrossRef

15. 15Brown M, Dijkers M, Gordon W, Ashman T, Charatz H, Cheng Z. Participation objective, participation subjective: a measure of participation combining outsider and insider perspectives. J Head Trauma Rehabil. 2004;19:459–481. MEDLINE | CrossRef

16. 16Ware JE, Gandek B, Sinclair SJ, Bjorner B. Item response theory in computer adaptive testing: implications for outcomes measurement in rehabilitation. Rehabil Psychol. 2005;50:71–78.

17. 17Jette AM, Haley SM. Contemporary measurement techniques for rehabilitation outcomes assessment. J Rehabil Med. 2005;37:339–345. MEDLINE | CrossRef

18. 18Cella D, Gershon R, Lai JS, Choi S. The future of outcomes measurement:item banking, tailored short forms, and computerized adaptive assessment. Qual Life Res. 2007;16(Suppl 1):133–141. CrossRef

19. 19Fries J, Bruce B, Cella D. The promise of PROMIS: using item response theory to improve assessment of patient-reported outcomes. Clin Exp Rheumatol. 2005;23:S53–S57.

20. 20Cella D, Young S, Rothrock N, et al. The Patient-Reported Outcomes Measurement Information System (PROMIS): progress of an NIH Roadmap cooperative group during its first two years. Med Care. 2007;45:S3–S11. MEDLINE | CrossRef

21. 21Hambleton RK. Applications of item response theory to improve health outcomes assessment: developing item banks, linking instruments, and computer-adaptive testing. In:  Lipscomb J,  Gotay CC,  Snyder C editor. Outcomes assessment in cancer. Cambridge: Cambridge Univ Pr; 2005;p. 445–464.

22. 22Fayers P. Applying item response theory and computer adaptive testing: the challenges for health outcomes assessment. Qual Life Res. 2007;16(Suppl 1):187–194. CrossRef

23. 23Wainer H. Computerized adaptive testing: a primer. Mahwah: Lawrence Erlbaum Assoc; 2000;.

24. 24Dijkers MP. A computer adaptive testing simulation applied to the FIM instrument motor component. Arch Phys Med Rehabil. 2003;84:384–393. Abstract | Full-Text PDF (113 KB) | CrossRef

25. 25Revicki DA, Cella DF. Health status assessment for the twenty-first century: item response theory, item banking and computer adaptive testing. Qual Life Res. 1997;6:595–600. MEDLINE | CrossRef

26. 26Haley SM, Ni P, Hambleton RK, Slavin MD, Jette AM. Computer adaptive testing improves accuracy and precision of scores over random item selection in a physical functioning item bank. J Clin Epidemiol. 2006;59:1174–1182. Abstract | Full Text | Full-Text PDF (314 KB) | CrossRef

27. 27Haley SM, Coster WJ, Andres PL, Kosinski M, Ni P. Score comparability of short-forms and computerized adaptive testing: simulation study with the activity measure for post-acute care. Arch Phys Med Rehabil. 2004;85:661–666. Abstract | Full Text | Full-Text PDF (136 KB) | CrossRef

28. 28Andres PL, Black-Schaffer RM, Ni PS, Haley SM. Computer adaptive testing: a strategy for monitoring stroke rehabilitation across settings. Top Stroke Rehabil. 2004;11:33–39. MEDLINE | CrossRef

29. 29Siebens H, Andres PL, Ni P, Coster WJ, Haley SM. Measuring physical function in patients with complex medical and postsurgical conditions: a computer adaptive approach. Am J Phys Med Rehabil. 2005;84:741–748. MEDLINE | CrossRef

30. 30Hart D, Mioduski J, Werenke M, Stratford P. Simulated computerized adaptive test for patients with lumbar spine impairments was efficient and produced valid measures of function. J Clin Epidemiol. 2006;59:947–956. Abstract | Full Text | Full-Text PDF (305 KB) | CrossRef

31. 31Hart DL, Cook KF, Mioduski JE, Teal CR, Crane PK. Simulated computerized adaptive test for patients with shoulder impairments was efficient and produced valid measures of function. J Clin Epidemiol. 2006;59:290–298. Abstract | Full Text | Full-Text PDF (152 KB) | CrossRef

32. 32Hart DL, Mioduski JE, Stratford PW. Simulated computerized adaptive tests for measuring functional status were efficient with good discriminant validity in patients with hip, knee, or foot/ankle impairments. J Clin Epidemiol. 2005;58:629–638. Abstract | Full Text | Full-Text PDF (227 KB) | CrossRef

33. 33Kosinski M, Bjorner JB, Ware JE, Sullivan E, Straus WL. An evaluation of a patient-reported outcomes found computerized adaptive testing was efficient in assessing osteoarthritis impact. J Clin Epidemiol. 2006;59:715–723. Abstract | Full Text | Full-Text PDF (227 KB) | CrossRef

34. 34Haley SM, Raczek AE, Coster WJ, Dumas HM, Fragala-Pinkham MA. Assessing mobility in children using a computer adaptive testing version of the Pediatric Evaluation of Disability Inventory. Arch Phys Med Rehabil. 2005;86:932–939. Abstract | Full Text | Full-Text PDF (153 KB) | CrossRef

35. 35Haley SM, Fragala-Pinkham MA, Ni P. Sensitivity of a computer adaptive assessment for measuring functional mobility changes in children enrolled in a community fitness program. Clin Rehabil. 2006;20:616–622. MEDLINE | CrossRef

36. 36Jette A, Haley S, Tao W, et al. Prospective evaluation of the AM-PAC-CAT in outpatient rehabilitation settings. Phys Ther. 2007;87:385–398. MEDLINE

37. 37Haley S, Siebens H, Coster W, et al. Computerized adaptive testing for follow-up after discharge from inpatient rehabilitation: I. Activity outcomes. Arch Phys Med Rehabil. 2006;87:1033–1042. Abstract | Full Text | Full-Text PDF (233 KB) | CrossRef

38. 38Jette AM. Toward a common language for function, disability, and health. Phys Ther. 2006;86:726–734. MEDLINE

39. 39van Swieten JC, Koudstaal PJ, Visser MC, Schouten HJ, van Gijn J. Interobserver agreement for the assessment of handicap in stroke patients. Stroke. 1988;19:604–607. MEDLINE

40. 40Ware J, Kosinski M, Dewey J, Gandek B. How to score and interpret single-item health status measures: a manual for users of the SF-8 health survey. Lincoln: QualityMetric; 1999;.

41. 41Willer B, Ottenbacher KJ, Coad ML. The community integration questionnaire: a comparative examination. Am J Phys Med Rehabil. 1994;73:10–11. MEDLINE | CrossRef

42. 42Jette AM, Davies AR, Cleary PD, et al. The Functional Status Questionnaire: reliability and validity when used in primary care. J Gen Intern Med. 1986;1:143–149. MEDLINE | CrossRef

43. 43Cardol M, de Haan RJ, de Jong BA, van den Bos GA, de Groot IJ. Psychometric properties of the Impact on Participation and Autonomy Questionnaire. Am J Phys Med Rehabil. 2001;82:210–216.

44. 44In:  Stewart AL,  Ware JE editor. Measuring function and well-being: the measuring outcomes study approach. Durham: Duke Univ Pr; 1992;.

45. 45U.S. Department of Health and Human ServicesNational Center for Health Statistics. National Health Interview Survey, 1994: Second Longitudinal Study on Aging, Wave 3. [computer file] Hyattsville: DHHS, NCHS; 2003;.

46. 46U.S. Dept. of Health and Human ServicesNational Center for Health Statistics. National Health Interview Survey on Disability, 1994: Phase I, disability outcome supplement. Hyattsville: DHHS, NCHS; 1997;.

47. 47Wood-Dauphinee SL, Opzoomer MA, Williams JI, Marchand B, Spitzer WO. Assessment of global function: the Reintegration to Normal Living Index. Arch Phys Med Rehabil. 1988;69:583–590. MEDLINE

48. 48Bergner M, Bobbitt RA, Carter WB, Gilson BS. The Sickness Impact Profile: development and final revision of a health status measure. Med Care. 1981;19:787–805. MEDLINE | CrossRef

49. 49U.S. Census Bureau. United States Census 2000: Demographic profiles: 100-percent and sample data. http://www.census.gov/Press-Release/www/2002/demoprofiles.htmlAccessed October 16, 2007.

50. 50Meenan RF, Gertman PM, Mason JH. Measuring health status in arthritis (The arthritis impact measurement scales). Arthritis Rheum. 1980;23:146–152. MEDLINE | CrossRef

51. 51Wiklund I. The Nottingham Health Profile—a measure of health-related quality of life. Scand J Prim Health Care Suppl. 1990;1:15–18. MEDLINE

52. 52Duncan PW, Wallace D, Lai SM, Johnson D, Embretson S, Laster LJ. The stroke impact scale version 2.0 (Evaluation of reliability, validity, and sensitivity to change). Stroke. 1999;10:2131–2140.

53. 53Muthen BO, Muthen L. MPius user’s guide. Los Angeles: Muthen & Muthen; 1998;.

54. 54Ware JE, Gandek B. Methods for testing data quality, scaling assumptions, and reliability: the IQOLA Project approach (International Quality of Life Assessment). J Clin Epidemiol. 1998;51:945–952. Abstract | Full Text | Full-Text PDF (172 KB) | CrossRef

55. 55Ware JE, Bjorner JB, Kosinski M. Practical implications of item response theory and computerized adaptive testing: a brief summary of ongoing studies of widely used headache impact scales. Med Care. 2000;38(9 Suppl):II73–II82. MEDLINE

56. 56Ramsay JO. TestGraf—a program for the graphicalanalysis of multiple choice test and questionnaire data. Montreal: McGill Univ; 1995;.

57. 57Muraki E. A generalized partial credit model. In:  van der Linden WJ,  Hambleton RK editor. Handbook of modern item response theory. Berlin: Springer; 1997;p. 153–164.

58. 58Muraki E, Bock RD. PARSCALE: IRT item analysis and test scoring for rating—scale data. Chicago: Scientific Software; 1997;.

59. 59Muraki E, Bock RD. PARSCALE: IRT based test scoring and item analysis for graded open-ended exercises and performance tasks. Chicago: Scientific Software; 1996;.

60. 60Bock RD, Aitkin M. Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika. 1981;46:443–459. CrossRef

61. 61Orlando M, Thissen D. Likelihood-based item-fit indices for dichotomous item response theory models. Appl Psychol Meas. 2000;24:5044.

62. 62Ware J, Kosinski , Bjorner J, et al. Applications of computerized adaptive testing of headache impact. Qual Life Res. 2003;12:935–952. MEDLINE | CrossRef

63. 63Jaeschke R, Singer J, Guyatt GH. Measurement of health status: ascertaining the minimal clinically important difference. Control Clin Trials. 1989;10:407–415. MEDLINE | CrossRef

64. 64Osoba D, Rodrigues G, Myles J, Zee B, Pater J. Interpreting the significance of changes in health-related quality-of-life scores. J Clin Oncol. 1998;16:139–144.

65. 65Wiebe S, Matijevic S, Eliasziw M, Derry PA. Clinically important change in quality of life in epilepsy. J Neurol Neurosurg Psychiatry. 2002;73:116–120. MEDLINE | CrossRef

66. 66Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420–428. CrossRef

67. 67de Haan R, Horn J, Limburg M, Van Der Meulen J, Bossuyt P. A comparison of five stroke scales with measures of disability, handicap, and quality of life. Stroke. 1993;24:1178–1181. MEDLINE

68. 68Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care. 1989;27(3 Suppl):S178–S189. MEDLINE | CrossRef

69. 69Rosenthal R, Rosnow R. Essentials of behavioral research: methods and data analysis. 2nd ed.. New York: McGraw-Hill; 1991;.

70. 70Dunlap W, Cortina J, Vaslow J, Burke M. Meta-analysis of experiments with matched groups or repeated measures designs. Psychol Methods. 1996;1:170–177. CrossRef

71. 71Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology. 1983;148:839–843. MEDLINE

72. 72Weinstein MC, Berwick DM, Goldman PA, Murphy JM, Barsky A. A comparison of three psychiatric screening tests using receiver operating characteristic (ROC) analysis. Med Care. 1989;27:593–607. MEDLINE | CrossRef

73. 73Delong E, DeLong D, Clarke-Pearson D. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–845. CrossRef

74. 74Keysor J, Jette A, Coster W, et al. Association of environmental factors with levels of home and community participation. Arch Phys Med Rehabil. 2006;87:1566–1575. Abstract | Full Text | Full-Text PDF (133 KB) | CrossRef

75. 75Cella D, Eton DT, Lai JS, Peterman AH, Merkel DE. Combining anchor and distribution-based methods to derive minimal clinically important differences on the Functional Assessment of Cancer Therapy (FACT) anemia and fatigue scales. J Pain Symptom Manage. 2002;24:547–561. Abstract | Full Text | Full-Text PDF (105 KB) | CrossRef

a Health and Disability Research Institute, School of Public Health, Boston University Medical Center, Boston, MA

b Department of Occupational Therapy and Rehabilitation Counseling, Sargent College of Health and Rehabilitation Sciences, Boston University Medical Center, Boston, MA

c Health Assessment Lab, Waltham, MA

d Department of Physical Medicine and Rehabilitation, University of Virginia at Charlottesville, Charlottesville, VA

e Spaulding Rehabilitation Hospital and Department of Physical Medicine and Rehabilitation, Harvard Medical School, Boston, MA.

Corresponding Author InformationReprint requests to Stephen Haley, PhD, Health and Disability Research Institute, Boston University School of Public Health, Boston University Medical Center, 580 Harrison Ave, 4th Fl, Boston, MA 02118-2639

 Supported in part by the National Institute of Child Health and Human Development and the Agency for Healthcare Research and Quality (grant no. R01 HD043568), and an Independent Scientist Award (grant no. K02 HD45354-01).

A commercial party having a direct financial interest in the results of the research supporting this article has conferred or will confer a financial benefit on the author or 1 or more of the authors. Haley and Jette have stock interests in CRE Care LLC, which distributes the AM-PAC products discussed in this study.

a Department of Psychology, McGill University, 1205 Penfield Ave, Montreal, QC H3A 1B1, Canada.

b Scientific Software International Inc, 7383 N Lincoln Ave, Ste 100, Lincolnwood, IL 60712-1747.

c QualityMetric Inc, 640 George Washington Hwy, Lincoln, RI 02865.

PII: S0003-9993(07)01690-5

doi:10.1016/j.apmr.2007.08.150


View previous. 15 of 34 View next.