Do clinical trials prepare to fail by failing to prepare? An examination of MS trials and recommendations for patient-reported outcome measure selection

Background: Many clinical trials use patient-reported outcome (PRO) measures, which can influence treatment decision-making, drug approval and label claims. Given that many PRO measure options exist, and there are conceptual and contextual complexities with PRO measurement, we aimed to evaluate how and why specific PRO measures have been selected for pivotal multiple sclerosis (MS) clinical trials. Specifically, we aimed to identify the reasons documented for PRO measure selection in contemporary phase III MS disease-modifying treatment (DMT) clinical trials. Methods: We searched for phase III clinical trials of MS DMTs published between 2015 and 2021 and evaluated trial protocols, or primary publications where available, for PRO measure selection information. Specifically, we examined study documents for their clarification of clinical concepts measured, definitions of concepts measured, explanations of which PRO measures were considered, why specific PRO measures were chosen, and trade-offs in PRO measure selection. Results: We identified 1705 abstracts containing 61 unique phase III MS DMT clinical trials. We obtained and examined 27/61 trial protocols. Six protocols were excluded: four contained no mention of PRO measures and two contained redacted sections preventing adequate assessment, leaving 21 protocols for assessment. For the remaining 34 trials (61 – 27), we retrieved 31 primary publications; 15 primary publications mentioned the use of a PRO measure. None of the 36 clinical trials that mentioned the use of PRO measures (21 protocols and 15 primary publications) documented clear PRO or clinical outcome assessment (COA) measurement strategies, provided clear justifications for PRO selection, or reasons why specific PRO measures were selected when alternatives existed. Conclusion: PRO measure selection for clinical trials is not evidence-based or underpinned by structured systematic approaches. This represents a critical area for study design improvement as PRO measure results directly affect patient care, PRO measurement has conceptual and contextual complexities, and there is a wide range of options when selecting a PRO measure. We recommend trial designers use formal approaches for PRO measure selection to ensure PRO measurement-based decisions are optimised. We provide a simple, logical, five-stage approach for PRO measure selection in clinical trials.


Introduction
Pivotal multiple sclerosis (MS) treatment trials frequently include clinical variables best measured using patient reports.Common examples are fatigue, walking ability, and life quality.These variables are measured using self-reported rating scales or questionnaires.The important distinction between the clinical variables for measurement (patient-reported outcomes, PROs), and the methods for their measurement (PRO measures) is often unclear.This article concerns how and why clinical trialists choose PRO measures.
Initially, PROs were exploratory endpoints (Freeman et al., 2001).Now, they are typically secondary (Naismith et al., 2020) or primary (Hobart et al., 2019;Hupperts et al., 2022) endpoints, and subjective methods for interpreting objective measurements (Fisk et al., 2005;Goodman et al., 2010Goodman et al., , 2009)).Consequently, data generated by PRO measures increasingly influence drug approvals, public expenditure, and personalised clinical decision-making (Butcher et al., 2020).These critical implications mean there can be no justification, scientific or ethical, for weak measurement; or, more precisely, for using weaker-than-is-available PRO measurement as trial designers typically have a choice of PRO measures from which to select.This emphasises the importance of optimizing PRO measure selection so that measured changes accurately approximate real changes in the most meaningful outcome within the context of the targeted population, study design, and analysis plan.
The chosen PRO measure affects the study results because PRO measures differ in structural and performance aspects.For example, PRO measures purporting to be reliable and valid measures of the same clinical concept/variable (e.g., fatigue) can differ in development quality (Close et al., 2023), item number and content, and item response category number and nature.These factors affect PRO measurement performance (validity, reliability, range, precision, ability to detect change).Consequently, different PRO measures, seemingly measuring the same concepts, could reach substantially different conclusions.
PRO measures also have context-dependent performance characteristics.For example, PRO measures have limited measurement ranges.Therefore, the distribution of scores on a PRO measure may have implications for the potential to measure change, depending on the context of use.For example, in the EXPAND study (Kappos et al., 2018), baseline 12-item MS walking scale (MSWS-12v2) scores were skewed (Hobart et al., 2022).Half of participants (44%) were located in the most disabled quartile of the scale range (MSWS-12v2 total score range 32-42).This observed score distribution is not surprising, as EXPAND participants had secondary progressive MS (SPMS).However, the importance lies in the interaction between the PRO (walking ability), PRO measure (MSWS-12v2), patient sample (walking disabled people progressively worsening), conceptualisation of siponimod's treatment effect (aiming to slow progression), and study design (entry criteria, time to endpoint, analysis plan).
There are other context-dependent PRO measurement issues.These include: the nature of the concept of interest (e.g., walking ability), which cannot be assumed to be static across contexts of use; item response dependence, the influence of Time 1 item scores on subsequent timepoint scores (Andrich et al., 2012;Hobart et al., 2022); and differential item functioning (Dib et al., 2017) to assess different PRO measure performance across groups (e.g., treatments, genders and cultures).All these empiric issues underscore the value and importance of carefully selecting 'the best' of existing relevant PRO measures for clinical trials.

Objective
To identify phase III MS clinical trial disease-modifying treatment (DMT) studies that used PRO measures and assess whether rationales and justifications for PRO measure selection were provided.

Literature search
Table 1 shows our literature search terms and criteria.We searched for phase III MS DMT clinical trials published between January 2015 and November 2021.This time window was chosen to reflect recent studies, the increasing focus on PRO measures, and to enable a large enough sample size for meaningful inferences.For these trials we attempted to access the full clinical trial protocols, from either ClinicalTrials.govor directly from trial sponsors.When full trial protocols could not be acquired, we retrieved the study's primary publications.

Analysis of pro measure selection
From each protocol or primary publication meeting our inclusion criteria, we extracted data concerning PRO measure selection.Table 2 shows our six assessment criteria and simple bespoke scoring system (no

Table 1
Search terms and inclusion criteria.
• Screened for phase III trials of MS diseasemodifying therapies (DMTs) • Excluded abstracts not reporting a phase III MS DMT trial • NCT numbers and/or study names for the retained records obtained • Reviewed trial publications and clinical trial databases to identify study protocols, and also contacted study sponsors via email to request a copy of trial protocols • Excluded protocols that did not mention the use of a PRO measure established scoring exists): 'Good' (2), 'Partial' (1), 'None' (0).Our first assessment criterion asks whether an overall outcome measurement strategy (clear documentation of which outcomes were chosen for measurement and why) was documented a priori.This would include all outcomes measured as endpoints (primary, secondary, exploratory), not simply PROs and their PRO measures, that provide the framework for measurement, and lay the foundation for the more detailed and specific information on each clinical variable.
Our five other assessment criteria focus on PROs.They underpin a logical path for instrument choice when measuring complex clinical variables, where multiple measurement options exist, and different PRO measures have different structures, properties, and context-of-usespecific issues.The five questions are: was it clearly documented which PROs were being measured and why; were the PROs intended for measurement defined (PRO measurement scores need to represent a stated concept); was it clear which PRO measures were considered; was there an explanation why a specific PRO measure was chosen instead of others; were the trade-offs associated with different PRO measures explained?

Literature search
Fig. 1 shows our literature search returned 1705 abstracts.Supplementary Table S1 lists the 61 unique phase III MS DMT clinical trials we identified published between January 2015 and November 2021.

Clinical trial protocols
We retrieved 27/61 protocols (44%): 10 directly from study sponsors, 17 from ClinicalTrials.gov.Six protocols were excluded: four contained no details of PRO measures, two contained redacted sections.ClinicalTrials.govrecords for these six trials did not mention PRO measure use (Supplementary Table S1).
Table 3 shows the remaining 21 clinical trials, listed in alphabetic order, mentioning PRO measures in their trial protocols.For the remaining 34 trials, we examined their primary publications and Clin-icalTrials.govrecords.

Primary publications
We retrieved the primary publications for 31/34 (91%) clinical trials where trial protocols were not available.No primary publications were available for three currently ongoing clinical trials (ENSEMBLE [NCT03085810], HERCULES [NCT04411641] and PERSEUS [NCT04458051]).Only abstracts for these trials were retrieved from the literature.Although none of the abstracts mentioned PRO measure use, ClinicalTrials.govrecords indicated all three trials used PRO measures.Supplementary Table S1 details these studies and the PRO measures used.
Of the 31 primary publications available, 15 clinical trials mentioned PRO measures (Table 4).The remaining 16 primary publications, and their ClinicalTrials.govrecords, implied no PRO measures were used.Of note, ClinicalTrials.govrecords contain considerable variability in the level of clinical trial detail provided.PRO measures may have been used but not documented in primary publications, abstracts, or ClinicalTrials.gov records.
Overall, based on available information, 39/61 (64%) MS clinical trials reported between 2015 and 2021 used PRO measures.

Analysis of PRO measure selection
Below we provide separate results for the clinical trial protocols and primary publications.

Clinical trial protocols
Table 3 summarizes the information derived from the 21 clinical trial protocols.Relevant information on each clinical trial is provided, and our assessments based on the six criteria are listed in Table 2.
Overall, we rated nearly all criteria as 'None' (score=0) (82/87; 94% PRO measures) as the protocols did not provide robust justifications for the selection of individual PRO measures.For the remaining 5/87 (6%) PRO measures, we rated the quality of the documented rationale as 'Partial' (score=1).
The description of the PRO measures in the protocols assessed was often not provided, or not specific, providing limited or no information to justify their use in the trials.For example, the CASTING protocol (NCT02861014) used the Treatment Satisfaction Questionnaire for Medication (TSQM II).When assessing whether the protocols clarified the variables for measurement, and why the variables were being measured, the protocol states that "TSQM II was used to characterize patient satisfaction with treatment."When considering whether the protocols explained specifically why the chosen PRO measure was selected from those considered, the protocol states "TSQM has been validated using a national panel study of chronic disease."Neither statement adequately answers the questions we have posed.Was an outcome measurement strategy documented?.Table 3 shows that while 0/21 clinical trial protocols provided an overall outcome measurement, or PRO measure selection strategy, some information/ reasoning was provided in one study.ARTIOS (NCT04353492) provided the statement: "PRO measures are included in the study to provide an empirical assessment from the subject's perspective of the benefits of treatment that cannot be gained from magnetic resonance imaging (MRI), Expanded Disability Status Scale (EDSS), or relapse measurement."Whilst this is a general justification for using PRO measures, it does not provide the foundations of a PRO measure selection strategy as there is no information as to why specific concepts were chosen for measurement, nor why specific instruments were chosen to measure those concepts.There are many different clinical concepts which cannot be gained from MRI, EDSS, or relapse measurement, and those concepts can be measured using many different methods, some of which are PRO measures of which there are likely many options.In many cases it could be assumed that the PRO measure was added to support future cost-effectiveness, or other economic or healthcare decision-making research.These points underscore the need for, and importance of, clarification of PRO measure selection.
Was it clear which variables were being measured and why?.Table 3 shows that none of the protocols clearly identified which clinical variables were being measured and why they had been chosen.Though it seems logical that the starting point for choosing a measurement instrument is a clarification of what to measure, this was not stated in any of the protocols we reviewed.Table 5 lists the PRO measures used in the trials we assessed, the documented concepts they purport to measure (from the original instrument development publications), and the number of trials in which they were used.In almost all cases, there is a description of the concept/s the instrument is purported to measure, but no clear definition of the concepts measured, or the aspects of the concepts measured.For example, Patient Reported Indices in Multiple Sclerosis (PRIMUS) assesses "MS symptoms, activities and quality of life".It is unclear which of the many MS-related symptoms, activities and aspects of life quality are assessed and why.Without specific information it is difficult, if not impossible, to justify the suitability of the PRO measure and its comparison with others.
Excluding FLOODLIGHT and telephone interviews, which are not individual PRO measures per se, a total of 32 different PRO measures were used in the 21 trial protocols assessed.Columbia Suicide Severity Rating Scale (C-SSRS; 10/21 protocols), Multiple Sclerosis Impact Scale (MSIS-29; 9/21 protocols) and EQ-5D (7/21 protocols) were the most used PRO measures in the assessed trials.It is not surprising that the C-SSRS is the most widely used PRO measure as FDA mandates suicide risk is measured and describes the C-SSRS as the "gold standard".Table 5 shows the different versions of the same PRO measures grouped together; these are C-SSRS/eC-SSRS, EQ-5D-3L/EQ-5D-5L, 36-item short form health survey (SF-36/SF-36v2), TSQM II/TSQM v1.4/ TSQM-9, and Work Productivity and Activity Impairment questionnaire (WPAI/WPAI:MS).
Were the clinical variables intended for measurement defined clearly?.Table 3 shows that 0/21 protocols clearly defined the clinical variables intended for measurement.Table 6 provides exemplars of some of the statements provided in protocols.None of these are clear or partial definitions (i.e., 'Good' [2] or 'Partial' [1]) of the variables intended for measurement.Rather, they are simply statements of fact or assumption.For example, the RAM-MS trial protocol (NCT03477500) reports "The Fatigue Severity Scale (FSS) is designed to differentiate fatigue from clinical depression, since both share same symptoms."Was there a clear explanation why specific PRO measures were chosen from available candidates?
Were the measurement tradeoffs associated with the choice of PRO measure documented?

RMS Ofatumumab
Open-label, raterblind, randomised, multicentre, parallelarm, activecomparator MHI-5 N None (0) None ( 0) None ( 0) None ( 0) None ( 0) None (0) MSIS-29 None (0) None ( 0) None ( 0) None ( 0) None ( 0) None (0) MSTCQ None (0) None ( 0) None ( 0) None ( 0) None ( 0) None (0) Social life and activities impact None ( 0) None ( 0) None ( 0) None ( 0) None ( 0) None (0) TSQM v1.4 None ( 0) None ( 0) None ( 0) None ( 0) None ( 0) None (0) Work productivity questionnaire None ( 0) None ( 0) None ( 0) None ( 0) None ( 0 *Whether or not there were any basis, logic, reasoning, rationale, justification, motivation, or any other explanation or account, provided in the associated publication as to why the particular PRO measure was selected for use. The information provided in the trial protocols has been rated as 'None' (0), 'Partial' (1) or 'Good' (2).Where no information was available, this was categorised as 'None' (0).'Partial' (1) indicates that a minimal level of information was provided, and 'Good' (2) indicates that a comprehensive level of information was provided.Was it clear which PRO measures were considered for inclusion?.Table 3 shows that 0/21 protocols provided a clear overview of which instruments were considered for measuring the concept (the specific goal of measurement) of interest.Several PRO measure descriptions merely provided a statement that measures were used in accordance with regulatory guidelines, but no rationales were provided based on theoretical/conceptual reasons as to why the chosen instrument was most suitable.Several protocols justified the use of efficacy and safety endpoints, but this was lacking for the PRO endpoints, demonstrating a disparity in the focus placed on the sets of outcomes.
Was there a clear explanation why specific PRO measures were chosen from available candidates?.Table 3 shows that for 81/87 (93%) of assessed PROs, no clear explanation was provided as to why these specific PRO measures were chosen from those available.In several protocols using MSIS-29, no explicit justifications were given as to why this PRO measure was selected above others.The protocols state that MSIS-29 "is considered a reliable, valid and responsive PRO measure that complements other indicators of disease severity used to improve our understanding of the impact of MS."These examples demonstrate that the rationale for PRO measure selection was not included and could be described in more detail to clarify which concepts related to "disease severity" are measured by the MSIS-29, and justify whether or not those concepts are useful and are well measured in the specific context of use.Tables 3, 7 and 8 show six PRO measures (6/87; 7% of the total PROs assessed) from five protocols where we rated the information documented as 'Partial' (1) justifications for PRO measure selection; however, these 'Partial' justifications do not provide clarification as to why these individual PRO measures were chosen over other available PRO measures.Of these, perhaps the best information was given for the selection of the Patient Preference Questionnaire from OPTIMUM (NCT02425644) which aimed to "capture patient preferences for selected treatment outcomes for use as an additional input to healthcare decisions.An increased understanding of individual values and preferences is the basis for shared decision-making, which in turn encourages patient compliance and health outcomes".However, this statement leaves many relevant questions unanswered.
The only publication to clarify why a specific PRO measure is chosen over those available is shown in the development of the Fatigue Symptoms and Impacts Questionnaire -Relapsing Multiple Sclerosis (FSIQ-RMS) indicated in Table 5.In this PRO measure's development publication, the authors state: "Although available PRO instruments have been used to measure fatigue in MS patients, review of their measurement properties suggests shortcomings in terms of current standards for PRO instrument development.For instance, the 9-item Fatigue Severity Scale (FSS) and the 21-item Modified Fatigue Impact Scale (MFIS) do not fit the assumption of unidimensionality, and so studies using their global scores may need to be re-evaluated" (Hudgens et al., 2019).Whilst this justification seems reasonable, only two of many fatigue PRO measures are mentioned, and no steps were initiated or head-to-head comparisons reported to show the new instrument's superiority in their context of use.
Table 7 shows the justifications provided by OPTIMUM (NCT02425644) and POINT (NCT02907177) for using the FSIQ-RMS, which is that the development of the FSIQ-RMS was in accordance with FDA requirements (FDA et al., 2009).However, this is a statement about the development of FSIQ-RMS, from the development paper, rather than an objective critique of the suitability of the FSIQ-RMS.There is no definition of fatigue, conceptualization of how the active treatment might influence fatigue, explanation or empiric evidence why the FSIQ-RMS was considered preferable or superior to competing fatigue PRO measures which also purport to provide conceptually strong, reliable and valid fatigue measurement.
Were the measurement trade-offs associated with the choice of PRO measures explained?.Only the OPTIMUM (NCT02425644) and POINT (NCT02907177) studies provided information about comparison instruments, via the FSIQ-RMS development publication (Hudgens et al., 2019).This was to justify FSIQ-RMS's development.However, there was no consideration of the trade-offs of using the FSIQ-RMS above other fatigue PRO measures, which were deemed (assumed) to be inappropriate.

Primary publications
When clinical trial protocols were unavailable, we retrieved the primary publications associated with these trials to assess whether a PRO measure selection strategy was described.Fifteen of the 31 primary publications (48%) mention PRO measure use.Table 4 shows these 15 trials, the PRO measures used, and selection information documented.None of the primary publications provided a rationale for why the PRO measures had been selected.
We recognize journal articles have tight word limits and that these details might be sacrificed.Also, primary publications may not be the most appropriate platform for discussing PRO measure selection strategies.However, this should be considered when interpreting data derived from these publications.For these reasons, we also assessed the secondary publications of five randomly chosen trials (ACAPELLA, ASCEND, TEMSO, TOPIC, TOWER).No additional information was found to that reported below.
Was an outcome measurement strategy documented?.Table 4 shows that none of the primary publications documented a PRO measure selection strategy.
The EVOLVE-MS-II primary publication provided the most information.The head-to-head study evaluated the gastrointestinal (GI) tolerability of diroximel fumarate (DRF) versus dimethyl fumarate (DMF) in adult patients with RMS (NCT03093324; Naismith et al., 2020).The authors provide good information for the potential suitability of the Individual GI Symptom and Impact Scale (IGISIS) and Global GI Symptom and Impact Scale (GGISIS).Specifically, the IGISIS assesses "the incidence, intensity, onset, duration, and functional impact of five key individual GI symptoms: nausea, vomiting, upper abdominal pain, lower abdominal pain, and diarrhea… In the DMF pivotal DEFINE/CONFIRM trials, these specific GI symptoms were among the most commonly reported adverse effects (AEs) and were the most common GI AEs leading to treatment discontinuation … The GGISIS is designed to assess the overall intensity of five GI symptoms (nausea, vomiting, upper abdominal pain, lower abdominal pain, and diarrhea) experienced during the previous 24 h, the level of interference and functional impact on work and daily activities, and how bothersome GI symptoms were for patients."Nevertheless, there was no discussion of alternative PRO measures, and it would seem logical to pilot the IGISIS and GGISIS, which were adaptations of existing PRO measures, in a relevant sample of people with MS before using in a phase III clinical trial, particularly one evaluating gastrointestinal tolerability.
Was it clear which variables were being measured and why?.Table 4 shows that none of the primary publications provided clear explanations of which variables were being measured.Several of these publications stated the PRO measure used but did not provide any details of the variable being measured.The EVOLVE-MS-II primary publication provided partial information, as described above.
Were the clinical variables intended for measurement defined clearly?.Table 4 shows that none of the primary publications provided explicit definitions of the variables for measurement.The EVOLVE-MS-II primary publication provided partial information, as described above.

Table 4
Summary of the PRO measures and their selection process documented in the clinical trial from the primary publications of clinical trials when no trial protocol was acquired.0) None ( 0) None ( 0) None ( 0 0) None ( 0) None ( 0) None ( 0) None (0) TSQM None (0) None ( 0) None ( 0) None ( 0) None ( 0 Primary publications were retrieved where clinical trial protocols were unavailable.PRO measurement strategy as defined by our PRO measure selection analysis set out in Table 2. *Whether or not there were any basis, logic, reasoning, rationale, justification, motivation, or any other explanation or account, provided in the associated publication as to why the particular PRO measure was selected for use.**PRO measure mentioned in ClinicalTrials.govstudy summary but not in primary publication. The PRO information provided in the primary publications has been rated as 'None' (0), 'Partial' (1) or 'Good' (2).Where no information was available, this was categorised as 'None' (0).'Partial' (1) indicates that a minimal level of information was provided, and 'Good' (2) indicates that a comprehensive level of information was provided.Abbreviations: ABILHAND semi-structured item-response questionnaire that measures manual ability according to an individual's perceived difficulty performing daily bimanual tasks, C-SSRS Columbia Suicide Severity Rating Scale, DMT disease-modifying treatment, EQ-5D EuroQol Group health status measure (3-level version), EQ-5D EuroQol Group health status measure (3-level version), EQ-5D-5L EuroQol Group health status measure 5-level version, EQ-VAS EuroQol Group health status measure visual analogue scale, FIS Fatigue Impact Scale, FLS-S Flu-Like Symptoms Score, GGISIS   Telephone interviews* No specific information for broad description of PRO instrument.5 TFQ "…a questionnaire to provide a structured approach to evaluate patients' experience of clinical trial participation…Assessing the clinical trial experience from the patient perspective using a robust questionnaire may offer potential to improve trial design and ensure subjects stay engaged throughout the trial process." 1 (continued on next page) J. Hobart et al.Was it clear which PRO measures were considered for inclusion?.Table 4 shows that while the primary publications reported the PRO measure used, none mentioned whether other PRO measures were considered.
The EVOLVE-MS-II primary publication provided partial information, as described above.
Was there a clear explanation why specific PRO measures were chosen from available candidates?.Table 4 shows that none of the primary publications documented why individual PRO measures were chosen above others.Note: *FLOODLIGHT consists of multiple instruments and is not a single PRO measure.Telephone interviews are not considered as PRO measures.Both FLOODLIGHT and telephone interviews have been included in the table as patient-reported assessment methods.Quotes for the concepts measured are taken from the original instruments' development publications.The link to each instrument's development publication is beneath each quote.

Table 6
Exemplars of some of the statements provided in protocols.Were the measurement trade-offs associated with the choice of PRO measures explained?.Table 4 shows that none of the primary publications documented the trade-offs associated with their choice of PRO measures.This is not a surprise as none documented clear definitions of the concepts for measurement, nor justifications why the chosen PRO measures were selected above potential alternatives.

Discussion
None of the pivotal phase III MS clinical trials reviewed provided explicit rationales or justifications underpinning PRO measure selection.Trials providing limited reasonings tended to base PRO measure selection on previous trials.Measured concepts were not clearly defined.Explanations why specific concepts were chosen were limited.Explicit rationales and justifications may have underpinned PRO measure selection in these studies, but the information is not documented.PRO measure selection processes may have been more rigorous than our results imply, but more implicit and tacit than explicit.We suspect this is unlikely; our recent study showed fatigue PRO measure use was not related to PRO measure development quality (Close et al., 2023).
Several reasons may explain this situation.First, there is no obligation to provide these data in clinical trial protocols or publications.Second, there is no specific guidance on PRO measure selection in the many PRO recommendations that exist (Butcher et al., 2020;Calvert et al., 2013aCalvert et al., , 2013bCalvert et al., , 2021;;Close et al., 2023;Patrick et al., 2007;Rothman et al., 2009;US FDA, 2009, 2018, 2022a; https://www.healthmeasures.net/index.php),nor at the common data elements site sponsored by NINDS (www.commondataelements.ninds.nih.gov).Whilst these guidelines and resources have evolved over time and provide useful information and a starting point, they focus on aspects of PRO measure performance, development requirements and reporting.None provide clinical trialists with explicit guidance on PRO measure selection.Supplementary Fig. S1 shows FDA's roadmap to patient-focused outcomes measurement in clinical trials.This rarely cited diagram includes highly relevant information for PRO measure selection, but in our opinion provides neither explicit guidance nor enough detail (FDA, 2022b).
A third reason why limited PRO measure selection guidance exists may be a general under-recognition of measurement issues associated with PRO measures.Specifically, there may be an under-appreciation of the impact of different PRO measure development qualities, structures (items and number of item response categories and content), performance characteristics (range, precision, error) and context-dependent factors (score distributions, response dependence, differential item functioning).Also, for these reasons, there is a misplaced overinterpretation of the statement "reliable and valid measure of…".These under-recognitions and over-interpretations may reflect the limited availability of comprehensive head-to-head comparisons, post hoc examinations in clinical trial data, and appropriately critical PRO appraisals.As such, some clinicians are unfamiliar with PRO measurement issues and see measurement science research as perplexing.Consequently, valuable fundamental research is published in journals that clinicians are less likely to access (Stenner et al., 1983), and written in less clinician-accessible language (Andrich., 2011).
A fourth reason is that some researchers prefer to use the same PRO measures across studies for comparability between treatment options.This is inadvisable given that, as discussed above, measures developed for one context cannot be assumed valid and reliable for another.It seems that this is of little concern to some economic researchers [Brazier et al., 2023;Perfetto et al., 2023].Related to this fourth reason is the mandated use of PRO measures.For example, the FDA mandates suicide risk is measured.However, it is important to note the FDA does not mandate the specific PRO measure used.It mandates using a method meeting their reporting requirements.The FDA describes the C-SSRS as the 'gold standard'.However, the suitability of the C-SSRS in any context needs to be considered along with any other PRO measure.We find it hard to think the FDA would not accept a reasoned evidence-based argument as to why another suicide risk PRO measure was chosen.Hence, we recommend a PRO measure selection process.
With this recognition, it is evident that this current situation should be rectified, as the negative ramifications of weak PRO measurement are  A total of 21 clinical trial protocols were assessed, with 87 PRO measures described.Refer to the scoring system in Table 2 for definitions of 'None' (0), 'Partial' (1) and 'Good' (2).*Protocols documenting a 'Partial' (1) explanation why specific PRO measures were chosen from the available candidates (PRO measures in brackets): ASCLEPIOS I/II (C-SSRS), EXPAND (C-SSRS), OPTIMUM (FSIQ-RMS and Patient Preference Questionnaire), POINT (FSIQ-RMS).
too important.We recommend clinical trialists employ strategic and formal approaches to PRO measure selection, to maximize the possibility that trial results approximate real clinical effects, and document their approach in trial protocols and pivotal publications, albeit as appendices and supplementary information.Indeed, we strongly recommend these be made regulatory and scientific requirements.A more critical academic appraisal of PRO studies is needed so that clinicians are more familiar with the pitfalls, and the weaknesses of the field are exposed.Without these academic and regulatory efforts, developments in the quality of the health measurement field will continue to be slow and fragmented.

PRO measure selection strategy recommendation
Logically, a PRO measure selection strategy has five stages.First, clarify and justify the specific concepts for measurement, the PROs, within the specific context of use.Second, identify the pool of PRO measures from which to choose.Third, shortlist a set of candidate PRO measures for more detailed examination, based on published information of their development.Fourth, compare the performance characteristics of the short-listed PRO measures, head-to-head, in a suitable sample.Fifth, synthesize the information and make a reasoned decision.This logical, five-stage process provides the evidence required to select the best PRO measure for the objective at hand, and, if necessary, the platform for modification or new measure development.
Stage 1, the clarification of the specific concepts of interest and context of use, heavily underpins the subsequent stages.A meaningful search for PRO measures cannot be conducted until trialists have a clear understanding of the concepts they intend to measure and the context in which measurement will be conducted.Context of use should also include explicitly stated study hypotheses.All too frequently trialists cite ambiguous umbrella terms, such as quality-of-life, health status, wellbeing, disability, and functioning.These terms do not enable an evaluation of the suitability of a PRO measure or its item content.We recommend clinical trialists define their concepts of interest, very explicitly, and conduct qualitative research to understand the concepts that are important to patients.
Clinical trialists are very familiar with their contexts of use, especially the expected sample characteristics, disease natural history, and hypothesized treatment effects.However, we suggest trialists are less familiar with the PRO measurement implications associated with their contexts of use.A simple exemplar is the EXPAND study (siponimod versus placebo in people with SPMS) skewed MSWS-12v2 baseline score distribution, in the context of a sample where there are more walking disabled people with progressive disease and a treatment hypothesized to have an anti-progressive effect (Hobart et al., 2022).Conceptually, participant walking ability was expected (and shown) to worsen over time.Therefore, the MSWS-12v2 score distribution skewness was expected (and shown) to worsen over time, resulting in increasing proportions of EXPAND participants located in the upper quartile (i.e., worse walking ability) of the MSWS-12v2 score range where the scale's ability to detect change is constrained by its fixed measurement range (Hobart et al., 2022).Consequently, EXPAND's MSWS-12v2-measured walking ability changes, and treatment group differences, were almost certainly underestimates of 'true' walking ability changes.Had EXPAND used a PRO measure better targeted for decline in walking ability, all other measurement issues being equal, the magnitude of the treatment effect (i.e., a slowing of walking disability progression) would have been better detected.This does not mean the MSWS-12v2 is a bad measure, just that in the EXPAND study context of use the MSWS-12v2 likely underestimated the treatment effects leading to type II measurement error (i.e., a false negative).This could have been anticipated and mitigated.This exemplar shows that scientifically solid PRO measures have context-dependent limitations with notable implications.
In Stage 2 we recommend clinical trialists search for all potential PRO measures purporting to measure their concept of interest.For each PRO measure, trialists should evaluate its development and item content two related but different evaluations.PRO development differs in method and quality.It can be evaluated against guidance (Close et al., 2023;FDA et al., 2009).In essence, key issues are the definition of the concepts measured, the strength of the conceptual underpinnings, the method of item generation, the nature and extent of patient involvement, and how the final item set was achieved.Patient involvement in item development through qualitative methods is crucial for content validity.It is difficult to see how the development of a PRO measure without patient involvement could be considered content valid, unless strongly supported by post hoc qualitative research in patients.Unfortunately, for many PRO measures this information is not well documented.
Guidelines for PRO measure development provide useful frameworks for evaluation.However, in our opinion, they miss a key stepthe articulation of a measurement concept as a set of items.We have coined the term "item set analysis", discussed in detail elsewhere (Close et al., 2023).In brief, there should be an explicit link between the overall concept, domains, subdomains, items and scores generated.This requires clear definitions of concepts, domains and subdomains, and clarity of how and why the items that are combined to form scores adequately represent those subdomains.When this is the case, the link from score through item to subdomains and concept is explicit.This link is rarely clear.Even when there is a conceptual framework, there seems to be a disconnect between the components of the framework and the item sets that generate scores.We recommend careful and greater consideration is given to the item content of subdomains.
Stage 3 is short-listing.When PRO measures have been extracted, their development papers reviewed and critiqued, their development process evaluated and their item content considered, a short-list of suitable candidates can be identified.One way of formalizing shortlisting is to use Consensus Standards for Measurement Instruments (COSMIN) scoring, and select the highest scoring candidates.Although useful, we identified limitations in COSMIN's rating process (Close et al., 2023).Specifically, what constitutes adequacy for PRO definitions, conceptualisations and qualitative work.Perhaps the most important limitation is the absence of the item set analysis; the degree to which a set of items generating a score maps a variable, discussed above.
Stage 4 is a head-to-head comparison of short-listed PRO measures in a sample representative of the context of use.This enables a comparison of measurement properties.We recommend using modern psychometric methods, Rasch measurement theory (RMT) or item response theory (IRT), rather than traditional methods based on classical test theory (CTT).RMT is our preference.This provides an hypothesis test against which to examine observed PRO measure data, enables the diagnosis of measurement weaknesses, and provides a strong platform for measurement improvement.
Stage 5 synthesises all the information from Stages 1 to 4 to reach a rational, evidence-based, decision.We anticipate, in many circumstances, that this will not be straightforward and there will be trade-offs.However, the process will lead to consideration of the problem and opportunities for better measurement.
Our recommendations may seem labor intensive.However, we think this would be a misinterpretation.Several areas of coordinated research could provide the MS community with a body of work to underpin PRO measure selection in clinical trials.For example, publicly available repositories of PRO measures purporting to measure variables, with cataloging of concept definitions, conceptualisations, and items.Targeted, empiric, head-to-head comparisons in samples pertinent to MS trials (e.g., RMS, primary progressive MS [PPMS], SPMS, advanced MS) would provide an evidence base of relative performance and trade-offs.
Currently, PRO measurement issues appear to be secondary considerations.This parallels the history of medical statistics which used to be an afterthought but is now integral to the initial stages of clinical trial design.The same is required of a measurement strategy.It is also important not to conflate measurement methods, which concern the generation of measurements, with statistical analysis which involves the analysis of measurements.As such, measurement methodological issues are prior to statistics.Without high quality PRO measurement, type II errors will pervade our trials.The implications for MS care, MS science development and individual patients are far too great for these identified issues not to be considered seriously.

Conclusions
PRO measure selection in multi-million-dollar pivotal MS clinical trials that dictate patient care, drug licensing and label claims currently lacks evidence.We believe this is also a common problem in clinical trials in other therapy areas.Widespread type II error from clinical trials can be avoided in future by adopting a robust PRO measure selection strategy.Widespread recognition of this issue and subsequent evaluation, critique, and documentation are required to optimize PRO measurement and their utility in pivotal clinical trials.

Role of funding source
Medical writing support for the review was funded by Novartis Pharma AG.

Declaration of Competing Interest
Jeremy Hobart has received consulting fees, honoraria, support to attend meetings or research support from Acorda, Asubio, Bayer Schering, Biogen Idec, F. Hoffmann-La Roche, Genzyme, Merck Serono, Novartis, Oxford PharmaGenesis and Teva.Disclosures do not show a conflict with the work being presented.Tanuja Chitnis has received compensation for consulting from Biogen, Novartis Pharmaceuticals, Roche Genentech, and Sanofi Genzyme.She has received research support from Brainstorm Cell Therapeutics, EMD Serono, I-Mab Biopharma, Mallinckrodt ARD, the National Institutes of Health, National MS Society, Novartis Pharmaceuticals, Octave Bioscience, Roche Genentech, Sumaira Foundation, Tiziana Life Sciences, and US Department of Defense.Disclosures do not conflict with the work being presented.Jiwon Oh has received research support from Biogen-Idec, Eli-Lilly, EMD-Serono, and Roche; and fees for consulting or speaking from Biogen-Idec, Bristol Myers Squibb, EMD-Serono, Novartis, Roche, and Sanofi-Genzyme.Laurie Burke has past and ongoing research support and contracts from various non-profit organisations and for-profit companies that do not conflict with this work.Andrew Lloyd works for and holds stock in Acaster Lloyd Consulting Ltd which has received fees from Novartis.Disclosures do not show a conflict with the work being presented.Pamela Vo is an employee of Novartis Pharma AG.Miriam King and Jo Vandercappellen were employees of Novartis Pharma AG during the analysis of this study and manuscript development.

Fig. 1 .
Fig. 1.Flow diagram of literature search process and articles meeting inclusion criteria.*Conference abstracts were retrieved for the unavailable primary papers and the clinical trial protocols with redacted content.None mentioned the use of a PRO measure.DMT disease-modifying treatment, MS multiple sclerosis, PRO patient-reported outcome.

Table 2
Analysis of PRO measure selection process and scoring system used.
'None' (0): No description of the tradeoffs with this instrument is provided.

Table 3
Summary of the PRO measures and their selection process documented in the clinical trial protocols.

Table 3
(continued ) Abbreviations: CAREQOL-MS Caregiver Health-Related Quality-of-Life in Multiple Sclerosis, CES-D Center for Epidemiologic Studies Depression Scale, CGI-I Clinical Global Impression of Improvement Scale, C-SSRS Columbia Suicide Severity Rating Scale, DMT disease-modifying treatment, eC-SSRS Electronic self-rated version of the Columbia-Suicide Severity Rating Scale, EQ-5D EuroQol Group health status measure (3-level version), FIS Fatigue Impact Scale, FLOODLIGHT Smartphone-based remote tracking device, FSIQ-RMS Fatigue Symptoms and Impacts Questionnaire -Relapsing Multiple Sclerosis, FSMC Fatigue Scale for Motor and Cognitive Functions, FSS Fatigue Severity Scale, HADS Hospital Anxiety and Depression Scale, MFIS Modified Fatigue Impact Scale, MHI-5 Mental health inventory -5 Item, MSIS-29 Multiple Sclerosis Impact Scale, MSQOL-54 Multiple Sclerosis Quality-of-Life (54-item instrument), MSTCQ Multiple sclerosis treatment concerns questionnaire, MSWS-12 Multiple Sclerosis Walking Scale 12, Peds-QL Pediatric Quality-of-Life Inventory, PGIC Patient Global Impression of Change, PGI-S Patient's Global Impression of Severity of Fatigue, PMS Progressive multiple sclerosis, PPMS Primary progressive multiple sclerosis, PRIMUS Patient Reported Indices in Multiple Sclerosis, PRO patient-reported outcome, RMS Relapsing multiple sclerosis, RRMS Relapsing-remitting multiple sclerosis, SATMED-Q The Treatment Satisfaction with Medicines Questionnaire, SF-12 12-Question health questionnaire, SF-36 36-item generic health status measure, SF-36v2 36-item generic health status measure version 2, SPMS Secondary progressive multiple sclerosis, TFQ Trial Feedback Questionnaire, TSQM II Treatment Satisfaction Questionnaire for Medication II, TSQM-1.4Treatment Satisfaction Questionnaire for Medication Version 1.4, TSQM-9 Treatment Satisfaction Questionnaire for Medication (9-items), WPAI:MS Work Productivity and Activity Impairment questionnaire for multiple sclerosis.

Table 4
(continued ) Gastrointestinal Symptom and Impact Scale, HADS Hospital Anxiety and Depression Scale, IGISIS Individual Gastrointestinal Symptom and Impact Scale, mFIS modified Fatigue Impact Scale, MSIS-29 Multiple Sclerosis Impact Scale, MSQOL-54 Multiple Sclerosis Quality-of-Life (54-item instrument), MSWS-12 Multiple Sclerosis Walking Scale, PDDS Patient-Determined Disease Steps, PPMS Primary progressive multiple sclerosis, PRIMUS Patient Reported Indices in Multiple Sclerosis, PRO patient-reported outcome, RMS Relapsing multiple sclerosis, RRMS relapsing-remitting multiple sclerosis, SF-12 12-item short-form health survey, SF-36 36-item generic health status measure, SPMS Secondary progressive multiple sclerosis, TSQM Treatment Satisfaction Questionnaire for Medication, U-FIS Unidimensional Fatigue Impact Scale, WPAI Work Productivity and Activity Impairment questionnaire.

Table 5
Patient-reported assessment methods and the concepts they purport to measure.
"…captures reliable and clinically relevant measures of functional impairment in MS…assessed the functional ability across three key domains affected by MS: cognition, upper extremity function, and gait and balance."Adherence and Satisfaction of Smartphone-and Smartwatch-Based Remote Active Testing and Passive Monitoring in People With Multiple Sclerosis: Nonrandomized Interventional Feasibility Study -PMC (nih.gov)A smartphone sensor-based digital outcome assessment of multiple sclerosis -PubMed (nih.gov) 1 FSIQ-RMS "…assess fatigue symptoms relevant to patients within the spectrum of RMS and the relevant impact of these symptoms on patients' lives, in accordance with the FDA PRO guidance."Development and Validation of the FSIQ-RMS: A New Patient-Reported Questionnaire to Assess Symptoms and Impacts of Fatigue in Relapsing Multiple Sclerosis -PubMed (nih.gov)"…self-report measure of HRQOL for MS that combines the strengths of generic and disease-targeted approaches to HRQOL measurement…assesses HRQOL for individuals with a chronic neurological condition, multiple sclerosis, using 54 items that define 12 multiple-item scales."A health-related quality of life measure for multiple sclerosis -PubMed (nih.gov) 4 (continued on next page) J. Hobart et al.

Table 5
(continued )Sclerosis Treatment Concerns Questionnaire (MSTCQ) and pain measures after introduction of the new Rebiject II™ injection system would allow determination of changes perceived by patients…assessed patient perceptions of the multiple domains associated with use of an injection device for IFN-β− 1a." Patient satisfaction with an injection device for multiple sclerosis treatment -PubMed (nih.gov) "…designed to identify and assess symptoms of fatigue with both reliability and validity for use in clinical practice and research."TheClinicalGlobal Impressions Scale -PMC (nih.gov)ECDEUassessmentmanual for psychopharmacology (1976 edition) | Open Library Development of a clinical global impression scale for fatigue -PubMed (nih.gov) 1 PRIMUS"…assess MS symptoms, activities, and quality of life…aid the assessment of the impact of MS from the patient's perspective.The opportunity was also taken to generate scales of symptoms (impairment) and activity limitations that could be used as summary measures in clinical studies."Thedevelopment of patient-reported outcome indices for multiple sclerosis (PRIMUS) -PubMed (nih.

Table 5
(continued )"…a general measure of patients' satisfaction with medication…a psychometrically sound and valid measure of the major dimensions of patients' satisfaction with medication…may also be a good predictor of patients' medication adherence across different types of medication and patient populations…provides a way of evaluating and comparing patients' satisfaction with various types and forms of medications."Validation of a general measure of treatment satisfaction, the Treatment Satisfaction Questionnaire for Medication (TSQM), using a national panel study of chronic disease -PubMed (nih.gov)6 WPAI and WPAI:MS "…WPAI measures of time missed from work, impairment of work and regular activities due to overall health and symptoms…The Work Productivity and Activity Impairment (WPAI) questionnaire elicited the number of days and hours missed from work, days and hours worked, days during which performing work was difficult and the extent to which the individual was limited at work (work impairment) during the past 7 days.The extent of work loss and impairment, attributable to both poor health and the symptom or problem specific by the respondent, was elicited."The validity and reproducibility of a work productivity and activity impairment instrument -PubMed (nih.gov)"To characterize work productivity in relapsing multiple sclerosis (MS) by using a work productivity scale and to identify associations between work productivity and disability, depression, fatigue, anxiety, cognition, and health-related quality of life."Work Productivity in Relapsing Multiple Sclerosis Associations with Disability, Depression, Fatigue, Anxiety, Cognition, and Health-Related Quality of Life (core.ac.uk) 6

Table 7
PRO measures given a score of 'Partial' for the quality of rationale for their selection in the clinical trial protocols."C-SSRSdatamappedtoColumbia Classification Algorithm for Suicide assessment (C-CASA) as per FDA guidance on suicidality."EXPANDC-SSRS*"Avalidatedversion of the C-SSRS is used to capture self-reported C-SSRS data via an interactive voice response telephone system (eC-SSRS).The eC-SSRS uses a detailed branched logic algorithm to perform the C-SSRS patient interview, evaluating each patient's suicidality ideation and behavior in a consistent manner.The use of the eC-SSRS (or equivalent) to detect suicidal ideation or behavior is currently mandated in studies of CNS active drugs."OPTIMUMFSIQ-RMS"Thedevelopment of FSIQ-RMS is in accordance with the requirements set forth in the Final Guidance to the Industry on Subject Reported Outcomes: Use in Medical Product Development to Support Label Claims [FDA 2009a]."PatientPreferenceQuestionnaireTo"capturepatient preferences for selected treatment outcomes for use as an additional input to healthcare decisions.An increased understanding of individual values and preferences is the basis for shared decision-making, which in turn encourages patient compliance and health outcomes."POINTFSIQ-RMS"Thedevelopment of FSIQ-RMS is in accordance with the requirements set forth in the Final Guidance to the Industry on Subject Reported Outcomes: Use in Medical Product Development to Support Label Claims[FDA 2009a]."Abbreviations:CNS central nervous system, C-SSRS Columbia Suicide Severity Rating Scale, FSIQ-RMS Fatigue Symptoms and Impacts Questionnaire -Relapsing Multiple Sclerosis.*The FDA mandates the measurement of suicide risk using a method that meets FDA reporting requirements.The FDA describes C-SSRS as the 'gold standard'.We recommend a PRO measure selection process.

Table 8
Assessment of PRO measure selection reported in clinical trial protocols.