Criteria for success after surgery for cervical radiculopathy — estimates for a substantial amount of improvement in core outcome measures

BACKGROUND CONTEXT: Deﬁning clinically meaningful success criteria from patient-reported outcome measures (PROMs) is crucial for clinical audits, research and decision-making. PURPOSE: We aimed to deﬁne criteria for a successful outcome 3 and 12 months after surgery for cervical degenerative radiculopathy on recommended PROMs. STUDY DESIGN: Prospective cohort study with 12 months follow-up. PATIENT SAMPLE: Patients operated at one or two levels for cervical radiculopathy included in the Norwegian Registry for Spine Surgery (NORspine) from 2011 to 2016. OUTCOME MEASURES: Neck disability index (NDI), Numeric Rating Scale for neck pain (NRS-NP) and arm pain (NRS-AP), health-related quality-of-life EuroQol


Introduction
The last decade's advances in surgical technique and equipment have increased the effectiveness and safety of surgical intervention for cervical degenerative radiculopathy (CDR) making operations for disc herniation and spondylotic foraminal stenosis high volume procedures [1,2].Since surgery is a costly treatment with potential risks, there has been a need to define criteria for substantial benefit to facilitate doctor-patient communication and assess quality of surgical care [3,4].In this way, the introduction of patient-reported outcome measures (PROMs) [5] and the concept of minimal important change (MIC) have been important to establish evidence-based practice.The MIC represents the smallest difference in PROM score that is clinically beneficial within a patient group, as recommended by consensus-based standards for the selection of health status measurement instruments [5,6].Other similar concepts are currently being used, like minimal clinically important difference (MCID) [7].
The concept of success, representing a more optimal treatment goal than the MIC, can be used both in communication with patients in clinical practice and in research but is often poorly defined or surgeon-reported.One way to assess it more accurately is to align it with the concept of substantial improvement which was first described for patients undergoing lumbar surgery [8] and later assessed for heterogeneous patient populations undergoing surgery for degenerative spine conditions [9,10].For CDR patients, however, PROM-based definitions of substantial change after surgery have not been well defined.
The aim of this study was to define success criteria after surgery for cervical radiculopathy performed in daily clinical practice based on frequently used PROMs; the neck disability index (NDI), the Euro-Qol (EQ-5D-3L) with visual analogue scale (EQ-VAS), and numeric rating scale for arm pain (NRS-AP) and neck pain (NRS-NP).

Data source
All data were collected through the Norwegian Registry for Spine Surgery (NORspine).NORspine is a government funded comprehensive clinical registry receiving no industry funding and used for quality assessment and research.Informed consent is obtained from all patients before they enter the registry.Currently, all centers performing cervical spine surgery in Norway report data to NORspine (coverage=100%) and the operation recording rate is 78% (completeness) [11].
The board of NORspine allowed us to access the data after the Norwegian Committee for Medical and Health Research Ethics Midt approved our research protocol (2014/344).

Design
This is a prospective cohort study with follow-up at 3 and 12 months.This report is consistent with the strengthening the reporting of observational studies in epidemiology statement [12] and the methods used are in accordance with the consensus-based standards for the selection of health measurement instruments recommendations [6].

Eligibility criteria
Of 4,229 consecutive patients operated for degenerative disorders in the cervical spine between January 2011 and August 2016 in ten private or public clinics, 2,868 were included for the main investigation.Eligible patients were those who had undergone surgery with either anterior cervical discectomy and fusion (ACDF) or arthroplasty (ACDA) (n=2,640) or posterior cervical foraminotomy or hemilaminectomy (n=228) at one or two levels due to CDR, excluding patients with more complex pathology, verified or possible myelopathy, and former operation(s) at the index level (Fig. 1).
Two diagnostic subgroups were investigated separately: patients with disc herniation (n=1,182) and patients with spondylotic foraminal stenosis (n=430).Since these degenerative changes often coexist, we excluded patients operated for both diagnoses.Also, patients operated at more than one level, indicating more widespread cervical spondylosis, were excluded in these subgroup analyses.We chose this strategy because it may be difficult to decide the clinical relevance of multiple nerve root compressions found on MRI.Therefore, the total number of patients in the two diagnostic subgroups (n=1,612) do not add up to the number of patients for the whole material (n=2,868) in Fig. 1.

Measurements
The comprehensive NORspine self-administered questionnaire consists of information about sociodemographic factors, lifestyle, work, pain location and duration of symptoms in addition to PROMs.Patients complete it at admission for surgery (baseline) and at home 3 and 12 months after surgery after receiving it by postal mail.To avoid selective reporting, the NORspine central unit collects follow-up data without involvement of the treating hospitals.The patient receives a reminder with a new questionnaire if he or she does not respond.
After the operation, the surgeon completes a separate form with information about diagnosis, treatment, comorbidity (including the American Society of Anesthesiologists physical status (ASA), surgical indication (radiculopathy, myelopathy, pain paresis and others) and type of operation.
The following PROMs were included at all time points: Neck disability index (NDI) [13] is a measure of neck pain related disability, containing 10 items (pain, personal care, lifting, reading, headaches, concentration, work, driving, sleeping and recreation), all scored on a 6-point ordinal scale (0−5).The 10 items are summarized and recalculated to a percentage score ranging from 0 to 100 (no to maximum disability).
EuroQoL (EQ-5D-3L) [14] is a generic measurement and preference-weighted measure of health-related quality-oflife based on five dimensions: mobility, self-care, usual activity, pain/discomfort and anxiety/discomfort.For each dimension the patient assesses three possible levels (3L) of problems; "none," "mild to moderate," and "severe."The score ranges from À0.59 to 1, where 1 corresponds to perfect health and 0 to death and negative values worse than death.In the second part, called the EQ-VAS, the patient is asked to indicate overall health on a vertical analogue scale, ranging from 0 to 100 ("worst to "best imaginable health").
Numeric rating scale for arm (NRS-AP) and neck pain (NRS-NP) [15,16] assesses pain severity ranging from 0 to 10 ("no" to "worst conceivable pain") on two separate scales.Information about joint pain is not collected.
Included in the two follow-up questionnaires is also The Global Perceived Effect scale (GPE) [17] which measures the patient perceived benefit of an operation by asking how the situation is for the patient after the procedure.There are seven response categories; (1) "completely recovered," (2) "much improved," (3) "slightly improved," (4) "unchanged," (5) "slightly worse," (6) "much worse", and ( 7) "worse than ever."In this study, the GPE scale was applied as an external criterion to define cut-offs for success on the PROM scales.Patients reporting to be "completely recovered" or "much improved" (1−2) were classified as having a "successful outcome," while those who considered themselves to be "slightly improved," "unchanged" or worse (3−7) were classified as having a "nonsuccessful" outcome.The same method has previously been applied on several datasets from NORspine [18−21].

Statistical analyses
All statistical analyses were performed with the Statistical Package for the Social Sciences (SPSS, version 25).Baseline characteristics and preoperative PROMs were reported as means and standard deviations of continuous variables and as percentages of categorical variables.The patient cohort was analyzed as a whole, then separately for 3-and 12-month follow-ups, procedural groups (the posterior approach group and the anterior approach group) and diagnostic groups (the disc herniation group and the spondylotic foraminal stenosis group).
We calculated the change score as the absolute difference between the pre-and postoperative scores.The percentage change score equals the absolute difference divided by the baseline score, multiplied by 100.
The distribution of 3-and 12-month scores, that is the follow-up, mean change and percentage change scores according to each of the response alternatives of the GPE scale, were analyzed by ANOVA analysis.Because the EQ-5D-3L questionnaire values range from À0.6 to 1.0, it is not mathematically possible to evaluate the percent change.However, percentage change score was measured for EQ-VAS (0−100).The correlations between the ordinal GPE scale and the PROMs were analyzed by the Spearman rank coefficient, rho.
Receiver operating curves (ROC) were used to identify discriminative ability of the PROMs and to define the optimal cut-off with the highest sensitivity and specificity.ROC-curves were made by plotting the sensitivity against (1-specificity) for each possible cut-off value for success.The sensitivity refers to the probability of correctly classifying an individual replying "completely recovered" or "much improved" into the group with a successful outcome (1−2) based on the simultaneously reported PROM score.Correspondingly, the specificity refers to the probability of correctly classifying a patient reporting anything less than "much improved" into the "nonsuccessful" group (3−7).The area under the ROC-curves (AUC) with 95 % confidence interval was used for discriminative ability as it describes the test's accuracy in correctly classifying a case according to the anchor.The larger the area under the curve, the greater is the accuracy of the test.The AUC is classified as "excellent" from 1.0 to 0.90, "good" from 0.90 to 0.80, "fair" from 0.80 to 0.70, "poor" from 0.70 to 0.60, and "failed" from 0.60 to 0.50 [22].

Results
Out of the 4,229 patients operated for CDR in the NORspine registry, 2,868 patients met the inclusion criteria.Of these patients, 2,640 patients had undergone either anterior cervical discectomy and fusion (n=2,609) or anterior cervical discectomy and arthroplasty (n=31).Another 228 patients were operated with posterior approach procedures, meaning either unilateral or bilateral posterior cervical foraminotomy (n=227) or hemilaminectomy (n=1).
A total of 66% and 64% of the patients responded to the 3-and 12-months follow-up, respectively (Fig. 1).The nonresponding patients were slightly older, were more likely to be men, to smoke, to have less comorbidity and low ASA level, and to score slightly poorer on levels of pain severity, disability, and health-related quality-of-life (Table 1).Baseline characteristics of the whole radiculopathy group and of the two diagnostic subgroups operated on one-level (disc herniation and foraminal stenosis group) are presented in Table 2.The spondylotic foraminal stenosis group had a z Numeric rating scale for arm pain (0−10).
higher proportion of men, higher age, ASA level, degenerative changes in the neck and comorbidity as compared to the disc herniation group.Patients with disc herniation had more severe symptoms at baseline than patients with spondylotic foraminal stenosis, as well as lower health condition scores.There were minor differences in the baseline PROM scores between the two diagnostic subgroups.For the procedural groups, patients operated with posterior approach procedures had significantly better PROM scores than the anterior approach group: NDI 35.3 versus 41.7, p<.001; NRS-AP 5.5 versus 6.4, p<.001, NRS-NP 5.8 versus 6.1, p<.001; EQ-5D-3L 0.4 versus 0.5, p=.005; EQ-VAS 56.6 versus 49.8, p<.001.The mean follow-up scores of PROMs at 12 months according to each GPE category are presented in Fig. 2A−E.For all PROMs, there was a stepwise decrease in follow-up scores for patients who reported themselves to be completely recovered and much better compared to those reporting no change or worsening.The results of the mean change scores and the mean percentage change scores at 12 months showed a similar pattern (Appendix A), as well as the follow-up score, change score and percentage change score at 3 months (obtained on request).The correlations between the PROMs and the GPE were moderate to strong, especially for NDI and NRS-AP follow-up scores and percentage change scores (0.7−0.8) but weaker for mean change scores (0.5−0.7).The correlations were generally weaker for the NRS-NP, EQ-5D-3L and EQ-VAS (0.4−0.7) scores.
We found minor differences in AUC and cut-off values between 3-and 12-month scores.Therefore, further analysis of the data is presented only for PROMs at 12-month follow-up.3-month scores can be found in Appendix B. AUC for NDI and NRS-AP follow-up scores and percentage change scores showed from "good" to "excellent" test accuracy (Table 3).NRS-NP, EQ-5D-3L and EQ-VAS showed either "good" or "fair" test accuracy.In general, AUC was slightly lower for the change scores than for the follow-up scores and the percentage change scores.
In Table 3, we present the cut-off values for follow-up scores, change scores and percentage change scores with highest sensitivity and specificity for the PROMs at 12 months.The cut-off values for the NDI and NRS-AP had highest sensitivity and specificity, showing that at followup for example a NDI percentage change score of 35% or more provided a sensitivity and specificity of 84% in distinguishing between a successful outcome or not.The NRS-AP had a larger percentage change score of 47%, whereas the NRS-NP score was 39%.Both these PROMs had slightly lower accuracy estimates.The EQ-5D-3L and EQ-VAS showed the poorest discriminative ability of success versus nonsuccess.For the subgroup analyses there were only minor variations across the two diagnoses.Finally, we also found minor differences between anterior approach and posterior approach procedural groups regarding cut-off scores (Table 4) and AUC (Appendix C).

Discussion
We found very good to excellent discriminative ability in distinguishing between success and nonsuccess following neck surgery due to radiculopathy for the most commonly used PROMs.The NDI and the NRS-AP had the highest z Numeric rating scale for arm pain (0−10).
ǁ Health-related quality-of-life by EuroQol (À0.4−1.0).{ General health status by EuroQol (0−100).discriminative ability at 3 and 12 months.The NRS-NP, EQ-5D-3L and EQ-VAS showed markedly lower accuracy.We found a better discriminative ability for the percentage change scores and the follow-up scores compared to the change scores.This finding is in line with previous studies conducted on surgery for lumbar disc herniation [18] and lumbar spinal stenosis [19,20].Furthermore, the use of change scores for benchmarking has been criticized for not taking into account the patient's baseline score [23−25].The percentage change score, on the other hand, tells something about the actual improvement the patient has been through.Also, our impression is that patients seem to put more emphasis on the follow-up score rather than the change score in clinical practice.We therefore recommend using the cut offs for success on follow-up and percentage change scores in clinical practice and future studies.
We found only minor differences in cut-off values across the two diagnostic groups and between 3 and 12 months after surgery.This means that the same cut-off scores can be applied on different time intervals and across subgroups of patients operated for CDR.One exception was the cutoff value for the NRS-NP percentage change score.Patients with spondylotic foraminal stenosis had to undergo a considerably greater change for the procedure to be considered a success (43.7%) than patients with disc herniation (35.4%).Since this is the only major difference between the two diagnostic groups, the result should be interpreted carefully.
For the two procedural groups, one cut-off score can be used.This is supported by findings in recent studies [26,27].However, the posterior approach group was small in comparison to the anterior approach group (n=228 vs. n=2,540) and one should be careful to conclude on the basis of our results alone.
Conceptually, "success," implying a substantial improvement, is different from the MIC.Therefore, we chose to use "much better" or "completely recovered" as success criteria on the GPE (1−2) and defined "slightly better" and the other categories (GPE 3−7) as a "nonsuccess."Substantial improvement has previously been assessed for populations constituted by both radiculopathy and myelopathy patients [9,10] and on lumbar spine surgery cohorts [8,19,21], but not for radiculopathy patients alone.Fig. 2 illustrates that our definitions were reasonable.
Often in studies of MIC/MCID, the category "slightly better" is placed in the "improved" class [28].This distinction is important to consider when interpreting our results.For instance, the cut-off values for NDI change score was 13.5 points, which is in line with previous definitions of MIC for neck patients [10,29−31].Similar concordance with MIC was also found for the other PROMs.Also, in previous NORspine studies on lumbar surgery patients, cutoff values for a successful outcome assessed by the Oswestry Disability Index, NRS leg pain and NRS back pain were found to be at the same or slightly higher level as compared to NDI, NRS-AP and NRS-NP in this study [19,21].z Numeric rating scale for arm pain (0−10).
ǁ Health-related quality-of-life by EuroQol (À0.4−1.0).{ General health status by EuroQol (0−100).Limitations and strengths of study The main limitation of this study is using the GPE scale as an anchor, since it is a self-reported scale, influenced by the current health status of the patient [17].Using a more objective anchor could be advisable [32,33].However, no objective golden standard currently exists.The psychometric properties of the GPE seems to be good [17,34−36].It has therefore been recommended, despite its limitations [23,37].
Another limitation is the nonrespondent rate of approximately 35%.Although it may be regarded as acceptable for a spine registry [38], it might represent a selection bias.Some of the baseline characteristics of the nonrespondents (Table 1) have been associated with poorer outcomes [39], though others have not.Also, two previous studies found no differences in outcome when comparing respondents and nonrespondents at follow-up [40,41].
A major strength of this study is the large sample size of patients operated in daily clinical practice [11] indicating a high external validity of our results.

Conclusion
In conclusion, this study showed the best ability in distinguishing between a successful and nonsuccessful outcome 12 months after surgery for a NDI follow-up score lower than 24 or a percentage change score of larger than 35% and for a NRS-AP follow-up score lower than 2.5 or a percentage change score larger than 47%.In this cohort, these criteria were stable at both 3 and 12 months of follow-up, and across subgroups of patients operated for CDR.Further research is needed to see if these scores are similar for other cohorts.

Fig. 2 .
Fig. 2. (A−E).Boxplots of global perceived effect scale (GPE) and follow-up scores of patient-reported outcome measures (PROMs) at 12 months.Values which are more than three box lengths from either end of the box are denoted by asterisks ("*").Values which are between one and a half and three box lengths from either end of the box are denoted by "o" (outliers).(A): Boxplot of neck disability index (NDI) and GPE at 12 months.(B): Boxplot of numeric rating scale for arm pain (NRS-AP) and GPE at 12 months.(C): Boxplot of numeric rating scale for neck pain (NRS-NP) and GPE at 12 months.(D): Boxplot of healthrelated quality-of-life by EuroQol (EQ-5D-3L) and GPE at 12 months.(E): Boxplot of general health status by EuroQol (EQ-VAS) and GPE at 12 months.

Table 1
Baseline characteristics of respondents and nonrespondents to follow-up at 12 months * Standard deviation.yNeck disability index (0−100).

Table 2
Baseline characteristics.Characteristics of the whole radiculopathy group and of the two diagnostic groups operated on one-level and with either disc herniation or spondylotic foraminal stenosis * Standard deviation.yNeck disability index (0−100).

Table 3
Area under the curve and cut-off values for "success" for all patient-reported outcome measures at 12 months y Area under the curve.

Table 4
Cut-off values with sensitivity and specificity for all patient-reported outcome measures in the two diagnostic subgroups and the two procedural groups.Estimates for the 12-months follow-up score, and the change score and percentage change score from baseline to 12-months follow-up