If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Department of Rehabilitation Medicine, Braeside Hospital, Hope Healthcare, Sydney, AustraliaDepartment of Rehabilitation Medicine, Liverpool Hospital, Sydney, AustraliaSchool of Public Medicine and Community Health, University of New South Wales, Sydney, AustraliaDepartment of Rehabilitation Medicine, Fairfield Hospital, Sydney, Australia
Department of Rehabilitation Medicine, Liverpool Hospital, Sydney, AustraliaSchool of Public Medicine and Community Health, University of New South Wales, Sydney, AustraliaDepartment of Ambulatory Care, Liverpool Hospital, Sydney, Australia
Kohler F, Redmond H, Dickson H, Connolly C, Estell J. Interrater reliability of functional status scores for patients transferred from one rehabilitation setting to another.
To report the interrater reliability of FIM total score, FIM motor subscore, and FIM cognitive subscore from scoring that occurred in routine clinical practice in 2 closely linked inpatient rehabilitation services in Sydney, Australia.
A natural-experiment blind clinical interrater reliability cohort study of the FIM across 2 rehabilitation units.
This study is set in 2 inpatient rehabilitation units immediately adjacent to each other in southwestern Sydney, New South Wales, Australia.
All patients (N=143) who were transferred between the 2 rehabilitation units between August 2006 and October 2007 were included in the study.
Discharge FIMs were scored by the first unit and an admission FIM was scored independently by the second unit within a few days. The FIM scores were analyzed for agreement and systematic bias.
Main Outcome Measure
Intraclass correlation coefficients, kappa statistic, weighted kappa statistic, and Bland-Altman plots were used.
There were 143 sets of scores identified. The range of differences between the 2 FIM totals was −32 to 50, between the FIM motor subscores was −22 to 43, and between the FIM cognitive subscores was −14 to 21. Bland-Altman plots demonstrated poor agreement. Few FIM totals were perfectly matched. The intraclass correlation coefficients ranged from .872 for the FIM total to .830 for the cognitive subscales. Values for kappa ranged from −.007 (FIM motor subscore) to .123 (FIM cognitive subscore). Values for weighted kappa ranged from .465 (FIM cognitive subscore) to .521 (FIM total).
There was no systematic scoring bias evident. Intraclass correlation coefficients were high, but tests of agreement demonstrated poor agreement. These findings have implications for the use of the FIM and any patient classification or funding system based on the FIM, especially if poor levels of agreement were found in the presence of all staff being FIM credentialed and standardization of methods of assessment. This study indicates that further investigation of agreement of both FIM totals and FIM item scores in the clinical setting is warranted.
INCREASING EMPHASIS ON patient classification and funding systems using activities of daily living scales in rehabilitation medicine mandates a good understanding of the underlying reliability of the scale used in the classification. Activities of daily living scales such as the FIM, which can be used as proxy measures of outcome, have been used in classification and funding systems for rehabilitation patients for about 15 years.
Fundamental requirements of a classification system include accuracy and easy reproducibility of allocation into the classes. Accuracy of class allocation in rehabilitation is in turn dependent on the underlying measures used in the classification. Clinicians should be aware of the strengths and limitations of the underlying measurements, which form the basis of any classification, because this influences or determines the accuracy and reliability of the classification system.
The FIM, which is commonly used in the inpatient setting to measure the functional level of patients,
assesses performance during tasks that can be broadly categorized as activities of daily living, mobility, and cognition. The FIM has a total of 18 items, for which a score ranging from 1 to 7 is given, with 7 signifying complete independence or normative function and 1 signifying complete dependence or requiring total assistance. The total maximum score of the FIM is 126, which implies total independence; the minimum score is 18, which implies full assistance is required for all 18 items.
The prospective payment system in the United States uses the admission FIM motor score to allocate patients into a case mix group that ultimately determines funding to the inpatient rehabilitation facility.
has been promoted for use in funding. This classification system uses the total FIM score as well as the motor or the cognitive subscores in different parts of its classification. A detailed description of AN-SNAP v2 or other classifications based on the FIM is beyond the scope of this article, but for ease of understanding, a copy of the Australian National Subacute and Non Acute Patient Classification version 2 classification is included as appendix 1. Of particular relevance is the smallest range of FIM points for allocation into specific classes. For FIM totals, the smallest range defining a class is 24 points; for FIM motor, the smallest range is 10 points; and for FIM cognitive, the smallest range is 4 points. Good interrater agreement of FIM ratings is essential to ensure reliability of the classification.
Reliability refers to the degree that a scale is free from random error. It is the stability or consistency of measurement. Two components of reliability that are commonly examined are test-retest reliability or reproducibility, and interrater reliability.
Test-retest reliability is a measure of the consistency of scores on repeated testing, and interrater reliability refers to the consistency of scores when 2 different raters score the patient. In clinical practice, reliability is generally not routinely measured. However, clinical reliability is important in the context of using outcome measures for benchmarking or classification and funding purposes. Differences in assessments or poor clinical reliability of measurements, when used as a surrogate measure of unit performance or to determine unit funding, may result in a skewed perception of a unit's performance.
A review of international literature on the interrater reliability of instruments for measuring functional dependence discusses some of the shortfalls of the interrater reliability studies on the FIM.
The review suggests that a more intensive study of reliability and validity of these instruments is required, and in particular that interrater studies with multiple raters need to be carried out. Some published studies of FIM reliability have been carried out in controlled settings using standardized patients and patient descriptions or videos of patients.
In a study of 20 community patients that demonstrated good interrater reliability as measured by ICCs, with values ranging from .90 to .99, the mean FIM cognitive score difference was 6, the mean FIM motor score difference was 17, and the mean FIM total score difference was 23.
This suggests that there is poor underlying agreement between the raters, although this was not reported in the study. One study has reviewed interinstitutional agreement in patients transferred from the acute setting to the rehabilitation setting. It reported on both the reliability of individual FIM items and the reliability coefficient for the total FIM scores. It reported that the reliability coefficient varied from .49 to .87 depending on the subgroup; however, the subgroups were quite small. Although some of the limitations of correlation coefficients were acknowledged and agreement was measured for individual FIM items, results on agreement of the total FIM scores were not included in the published article.
In patients where disability is measured by the FIM, the mathematic relationship is that as one is more independent, one is scored higher and therefore has a higher total FIM score. However, even if there is a good correlation between 2 measurements, the actual values (in this case, the FIM scores) may not be in agreement with each other.
Because the total FIM has a range of 108 different possible scores, even a subject variance of 5 FIM points is a relatively small proportion. This would explain a high correlation between the scores even if there was relatively low absolute agreement. Agreement between raters ultimately becomes a question of what difference in score is clinically relevant.
The aim of this article is to report the FIM interrater reliability for FIM total score, FIM motor subscore, and FIM cognitive subscore from assessments that occurred in routine clinical practice in 2 closely linked inpatient rehabilitation services in Sydney, Australia.
This study was set in 2 rehabilitation units immediately adjacent to each other in southwestern Sydney, New South Wales, Australia. The subacute unit is a 20-bed combined geriatric/rehabilitation ward within a 200-bed acute hospital. Patients admitted to this unit are generally not sufficiently medically stable to allow admission directly to the rehabilitation unit. The rehabilitation unit is a 36-bed mixed general rehabilitation unit in an immediately adjacent subacute hospital.
Data Collection Process
All patients who are admitted to either unit have their functional levels measured using the FIM within 72 hours of admission and in the 72 hours prior to discharge as part of routine patient care. For most patients, the admission FIM is assessed within 24 hours of admission to the unit. In the case of FIM mobility items, this is done by the physiotherapist, usually within 2 or 3 hours of admission to the rehabilitation unit as part of a detailed physical assessment. FIM scoring and data collection are performed independently by various therapists from different disciplines, each concentrating on their area of expertise. The physiotherapists rate the mobility items; the occupational therapists rate the self-care items; the nursing staff rate bowel, bladder, and cognition; and the speech therapists rate the language items. In the subacute unit, the team consists of 1 physiotherapist, 1 occupational therapist, and a complement of nurses, of whom 5 are regularly involved in measuring the patients' activity for FIM scores. Up to 7 people were involved in the data collection in the subacute unit. In the rehabilitation unit, there are 3 physiotherapists, 3 occupational therapists, and 8 nurses who are involved in collecting the FIM data. In both units, some of the raters are FIM credentialed and some are not. The identity and number of individual staff involved in collecting the FIM data for any particular patient are not recorded. This process for scoring FIM items occurs in many rehabilitation units.
The individual ratings are collected on the FIM data sheet and are subsequently entered into a data base, usually by a member of the clerical staff.
All patients included in this study commenced their rehabilitation in the subacute unit and completed it in the rehabilitation unit. These patients have a discharge FIM scored by the subacute unit staff within 3 days of discharge but most frequently in the last 2 days before discharge. The patients then have an independent admission FIM scored by the rehabilitation unit always within 3 days of admission, but most frequently in the first 24 hours. All patients therefore have independent assessments within a few days of each other, with a maximum of 6 days between assessments. Because the patients who are transferred to the rehabilitation unit are medically stable and are usually part way into their rehabilitation program, it would be expected that there would be minimal to no difference between the 2 scores, because it is unlikely, although certainly not impossible, for any significant functional change to occur in the interim.
All patients who were transferred between the 2 units between August 2006 and October 2007 were included in this study.
Relationship of Ratings
Copies of functional and mobility assessments, but not the actual FIM scores, accompanied the patients on transfer between the 2 hospitals. The general practice is for a complete assessment to be carried out on all patients who are admitted to either of the units. The FIM scores are collected and processed independently in the 2 units and are not available for the clinical staff to peruse.
The clinicians working on the 2 units were unaware of the study and were thus blind, continuing their usual practice of FIM scoring. The transfer of patients from one unit to the other therefore constituted a natural experiment allowing the interrater properties of the FIM to be studied.
Approval for the study was gained from the respective human research ethics committees of the hospitals. The FIM data were routinely collected for the purpose of demonstrating patient improvement as well as service quality and funding purposes. Patient identifiers were not required for the purpose of the analysis, and no staff involved in the data collection could be identified from the data. All staff involved in FIM scoring and data collection were in agreement with the study when they were informed. The need to seek individual consent for analysis of the data and publication of the study was waived by the human research ethics committees because loss of data by refusal to participate would jeopardize the integrity of the study. The study presented no risk to participants.
Information on patient demographics, diagnostic groupings, FIM totals, and FIM motor and FIM cognitive subscores were recorded. The differences between the totals were calculated by subtracting the admission FIM score or subscore from the relevant discharge FIM score or subscore. The minimum and maximum values from this calculation were taken as the extremes of the range.
In view of the ongoing debate regarding appropriate measures of agreement and their use, we analyzed the data in a number of ways including calculation of the 1-way random model ICC, kappa statistic, weighted kappa statistic, and Bland-Altman plots. The Bland-Altman plot is a graphic presentation of the 2 sets scores in which the differences between the 2 scores are plotted against the averages of the 2 scores. Horizontal lines are drawn at the mean difference and at the limits of agreement, which are defined as the mean difference plus or minus 1.06 times the SD of the differences.
Power calculations were completed and included in the discussion. The analysis and graphing were performed using SPSS Statistics 17.0 for Windows
The average age of the patients was 76 years, and the median age was 79 years. Most of the patients, 63%, had an orthopedic condition, and 13% of patients had a stroke, with the rest coming from diverse groups.
The results for FIM totals and FIM motor and FIM cognitive subscores are summarized in table 1, and a summary of the distribution of differences is outlined in table 2.
Table 1Summary of Results for FIM Total Scores and Subscores for Discharge, Admission, and Differences
There was considerable difference between the 2 FIM total scores, with a range of −32 to 50. The results are shown graphically in figure 1. The Bland-Altman plot of FIM total scores is shown in figure 2. On statistical analysis of the FIM total scores, the ICC was .872 (CI, 0.822–0.908), the kappa was .011, and the weighted kappa was .521.
There was considerable difference between the 2 FIM motor subscores with a range of −22 to 43. The results are shown graphically in a Bland-Altman plot in figure 3. On statistical analysis of the FIM motor subscores, the ICC was .854 (CI, .797–.895), the kappa was −.007, and the weighted kappa was .493.
There was considerable difference between the 2 FIM cognitive subscores with a range of –14 to 21. Of the 35 scores in perfect agreement, most had agreement because of the ceiling effect of the measure; 28 of these patients had the highest possible score of 35. The results are shown graphically in a Bland-Altman plot in figure 4. On statistical analysis of the FIM cognitive scores, the ICC was .830 (CI, .764–.878), the kappa was .123, and the weighted kappa was .465.
All 143 patients who were transferred between the 2 units over the 15-month period were included in the study. No patients needed to be excluded from the review for incomplete data because we place significant emphasis on having completed FIM scores for our patients.
In our setting, only the FIM total and the motor and cognitive subscores are clinically or administratively relevant. We considered it appropriate to concentrate on these 3 elements in our study. There was no systematic bias evident between the scoring practices in the 2 units, as demonstrated by the small median differences between the FIM total as well as the FIM subscores. We are confident that the scoring reflects our normal practice, because the staff were not informed that the data for these patients would be analyzed in this manner prior to the commencement of the study and were thus blind during the study. It has been suggested that scoring of functional status could be open to manipulation to maximize apparent improvement
; however, these results show no evidence of manipulation.
From the clinical point of view, a difference of 4 points in the FIM cognition score, 10 points in the FIM motor score, or 20 points in the FIM total score between test and retest may be acceptable. However, it could also mean the difference between a patient being independent or not independent, determined by the actual underlying difference in the FIM item scores. In view of the narrow range in the AN-SNAP v2 classification, such a difference would ensure that the patient falls into a different class. If the patient is allocated into a different class, this might alter the predicted length of stay and associated cost weights for funding. In this case, such a difference (FIM cognitive, motor, or total) is highly relevant.
uses the SEM and the ICC to construct CIs and to determine the minimal difference required in order to be confident that there has been a real change in the performance. When this method is applied to the total FIM scores, a minimum difference of 20 FIM points in a repeated total FIM score is required before the difference could be considered to be a real change. Twenty FIM points in either direction signifies considerable clinical variability within the spectrum of nonsignificant change. However, even when such a wide range of statistical nonsignificance is applied, there are still 18 of the 143 patients who have a real change in total FIM scores. The broad statistical range suggests that great care needs to be taken when measuring and interpreting FIM improvement and efficiency.
There has been considerable discussion in the literature regarding the best tools and methods for analyzing categorical data for reliability and agreement both with respect to the FIM and from a general statistical point of view.
A detailed discussion would go well beyond the limits of this work, and interested readers are referred to the literature. No clear best measure of agreement is evident, and ultimately the issue of agreement is a matter of the clinical importance of differences in repeated measures.
The ICC is a measure of correlation or association rather than absolute agreement. The values we observed, .872 for the FIM total score, .854 for the FIM motor subscore, and .830 for the FIM cognitive subscore, fit into a category of high correlation (with 1.0 signifying complete correlation) between the paired measurements. This would be expected, because patients would be grouped into similar bands of functional independence but not necessarily scored exactly the same. The correlation coefficient for total FIM in this study lies at the lower end of the range (.83–.99) published in the literature in a review of 11 studies of validity of the FIM.
It is noteworthy that the correlation coefficients decrease (although they remain high) as the number of possible scores attainable decreases, in line with the properties of correlation coefficients as outlined in the introductory section.
Weighted kappa statistics approximate ICCs and therefore are also more of a measure of association than agreement.
In this study, the weighted kappa values fall into the fair to moderate agreement range. As regards the methodology of weighting, it is not clinically sensible for FIM scores, which vary by considerable amounts, such as greater than 5 or 6 FIM points, to contribute to improved agreement. However, this is exactly the case in calculating weighted kappa.
Unweighted kappa values are a better reflection of pure agreement, but there are well published concerns about their limitations.
suggested that a high correlation for any 2 methods designed to measure the same property is in itself just a sign that one has chosen a wide spread sample. A high correlation does not automatically imply that there is good agreement between the 2 methods. They described a method of data plotting used in analyzing the agreement between 2 different measurements. The Bland-Altman plot shows the difference between the variables against the average of the variables. Limits of agreement usually set at ±1.96 SD of the mean measure the agreement between the variables. Large limits of agreement show poor agreement between the variables. In this study, the Bland-Altman plots show that there is no obvious agreement between the FIM differences and the means of the FIM scores for the total FIM score as well as the motor and cognitive FIM subscores. For the FIM total, the limits of agreement are 26.2 points on either side of the mean. Therefore, based on the values in this study, even a difference of 52 FIM total points would fall within the limits of agreement. Fifty-two FIM total points translate to an average difference of as much as 3 FIM points per FIM item. For the FIM motor subscore, the limits of agreement are 20.7 points on either side of the mean, indicating a difference as much as 4 FIM points per FIM motor item, and for the FIM cognitive subscore, the limits of agreement are 10.9 points on either side of the mean, or a difference as much as 6 FIM points per FIM cognitive item. Although there is fair to good correlation between the measures, there is very poor agreement.
Possible contributing factors to the poor agreement in our study include the degree of attention or rigor given to the FIM scoring by the staff, the staff level of training, and staff experience both with rehabilitation patients and using a functional outcome measure.
All patients in this study had total FIM assessments completed, suggesting that staff were aware of the importance of scoring.
The need for staff training or FIM credentialing/certification for accurate scoring has been highlighted in previous studies and by the owners of the FIM copyright.
Not all of the staff were FIM-certified at the time of this study, and this is a potential weakness of our clinical practice. However, in many units, in Australia at least, with regular staff turnover and staff leave, at any one time there might be some staff who have not been FIM-credentialed but who perform patient assessments. If all staff were FIM-credentialed, the results might potentially be different. This issue needs further investigation. However, the reality of clinical practice is reflected in our results and indicates a challenge faced by all rehabilitation providers.
Another possible limitation in the study could be the variable period between the 2 measurements. Generally, in these units, admission FIMs are scored within 24 hours of admission as part of a holistic comprehensive patient assessment.
It is possible that patients might be scored on a combination of performance (at the actual time of measurement) and capacity (based on previous performance in sessions with the therapists), because the therapists are better acquainted with the patients at the time of discharge. Capacity would not be known at the time of admission, and therefore the score would be based solely on performance in the environment at admission. There is also a possibility of bias being evoked when the FIM is scored by therapists who are treating the patient rather than by more objective evaluators.
between the 2 units. There are no published studies on the effects of performance and capacity on FIM scores.
There might have been a change in the patients' functional status in the time between the measurements. Based on the natural history of the diseases and progression of function in rehabilitation units, one would expect that this change would be toward improvement. However, uniform improvement was not evident in the total FIM scores of the group as a whole or in individual patients.
The scoring of the FIM in this study, as in other studies in clinical settings, by a number of members of a team rather than a single rater might also be a source of bias, but this has not been evaluated in the literature.
A further potential contribution to the variability could be the different number of raters in the 2 institutions. There were twice as many raters involved in the rehabilitation unit. This has not been evaluated in the literature.
Individual FIM item scores might be scored differently because it might be difficult to distinguish between standby assistance and independence, or there might be variability of performance between assessments. Subtle differences might always be a problem in functional assessments, but if the measure is sufficiently stable, then one would hope that this would balance out over the 18 items and would not be unidirectional with any subject. A separate analysis of agreement of individual item scores of the FIM
the expected ICC should be 0.8 or higher. With conventional power parameters alpha equal to .05 and beta equal to 0.2, and 2 trials, the number of subjects required is approximately 46, and our sample of 143 is sufficiently large to draw conclusions regarding the findings.
There was no systematic scoring bias evident. ICCs were high, but tests of agreement demonstrated poor agreement. There were wide limits of agreement. These findings may have implications for the use of the FIM, especially if poor levels of agreement were found to exist in the presence of all staff being FIM-credentialed and standardization of methods of assessment. Patient classification and funding systems based on FIM scores with any potentially inherent difficulties and inaccuracies would reflect these difficulties and inaccuracies in classification.
This study indicates that further investigation of agreement of both FIM totals and FIM item scores in the clinical setting is warranted.
aSPSS Inc, 233 S Wacker Dr, 11th Fl, Chicago, IL 60606.
No commercial party having a direct financial interest in the results of the research supporting this article has or will confer a benefit on the authors or on any organization with which the authors are associated.