Editorial
March 21, 2023
Toward Personalizing Care: Assessing Heterogeneity of Treatment Effects in Randomized Trials
Issa J. Dahabreh, Dhruv S. Kazi
JAMA. Published online March 21, 2023. doi:10.1001/jama.2023.3576
Clinicians know that individual patients may respond differently to a given treatment and that the overall treatment effect reported in a randomized trial of the treatment may not be directly applicable to all patients in clinical practice.1 Determining the treatment effect for an individual patient involves a comparison of the outcome when that patient is exposed to the treatment vs the outcome of the same patient exposed to a control treatment at the same time, a comparison impossible to make in conventional parallel-group trial designs. A practical alternative is to examine heterogeneity of (variation in) treatment effects across groups of patients, categorized by baseline demographic or clinical characteristics, such as age or risk factors for the outcome.2
In this issue of JAMA, Goligher and colleagues3 explore the ability of contemporary statistical techniques to detect heterogeneity of treatment effects using pooled data from 3 randomized platform trials assessing the effect of therapeutic-dose heparin on organ support–free days and all-cause mortality in patients hospitalized for COVID-19 in the early pandemic. They compare 3 approaches for identifying heterogeneity of treatment effects: (1) traditional one-variable-at-a-time subgroup analyses; (2) risk score analyses, in which patients are grouped by predicted risk of trial outcomes; and (3) effect score analyses, in which patients are grouped by predicted treatment effect. The 3 approaches yielded congruent results, suggesting that patients with a body mass index (BMI) of less than 30 and those with moderate severity COVID-19 at presentation appeared to benefit, whereas those with a BMI of 30 or greater and severe COVID-19 at presentation did not benefit and may have been harmed. These findings highlight the need to evaluate heterogeneity of treatment effects in randomized trials: had the trials evaluated the effect of therapeutic heparin in patients hospitalized with COVID-19 without stratifying by disease severity, the overall treatment effect may have been close to null, obscuring signals of differential benefit and harm across clinically meaningful patient subgroups.
This Editorial will attempt to explain the rationale for Goligher and colleagues’ efforts, place them in broader methodological context, and offer suggestions for future assessments of heterogeneity in randomized trials.
Traditional subgroup analyses to examine heterogeneity of treatment effects are ubiquitous in the medical literature. Investigators group trial participants by clinical variables (eg, disease severity or BMI categories) and assess whether effects are heterogeneous across subgroups, one variable at a time (Goligher and colleagues’ first approach). As common as the approach is, it presents several challenges.4,5 Trials are typically statistically underpowered to detect differences between subgroups, so there is high risk of false-negative findings. At the same time, performing multiple subgroup analyses increases the risk of false-positive findings. Although these challenges can be addressed by approaches such as rigorous prespecification of comparisons, multiplicity adjustments for hypothesis testing, and hierarchical modeling, a key practical limitation remains: one-variable-at-a-time subgroup analyses are difficult to use for clinical decision-making because each patient belongs to multiple subgroups and each subgroup may have a different magnitude and direction of treatment effect (eg, a patient can have a BMI of less than 30 and severe COVID-19).2 Thus, traditional one-variable-at-a-time subgroup analyses may be useful as exploratory or descriptive analyses, and may produce population-level insights, but multiple variables have to be jointly considered to generate clinically relevant assessment of heterogeneity of treatment effects.
One way to integrate information from multiple variables is to examine heterogeneity of treatment effects over the predicted risk of a trial outcome (Goligher and colleagues’ second approach).6 A well-calibrated risk model is used to integrate multiple variables into a single “risk score” variable that captures risk of the outcome without treatment, followed by an examination of whether treatment effects vary over the risk score.7 In practice, risk score analyses typically have 3 steps: first, a risk score—internally developed using the trial data or externally developed using independent data—is used to group trial participants by level of predicted risk; next, risk group–specific treatment effects are estimated; and, finally, the treatment effects are examined for heterogeneity. There are several advantages to this approach. Clinicians intuitively incorporate risk into clinical decision-making, validated risk scores are widely used in clinical practice, and risk is correlated with treatment benefit. To the extent that the risk score captures variation in risk, it should be able to identify groups of patients who are unlikely to benefit from treatment as well as groups that have the potential to benefit. Furthermore, by reducing multiple variables into a single score, risk score approaches avoid the multiplicity issues of one-variable-at-a-time subgroup analyses. These attractive features may explain the increasing popularity of risk score analyses in randomized trials and the emphasis on such approaches in recent methodological recommendations.8
But risk score analyses may not fully capture heterogeneity of treatment effects because risk of an outcome in the absence of treatment may not strongly correlate with benefit or harm from treatment. For example, among patients hospitalized with COVID-19, a patient with a BMI of less than 30 and severe disease may have the same risk of in-hospital mortality in the absence of treatment as a patient with a BMI of 30 or greater and moderate disease severity, yet the benefits and harms of treatment may differ between the 2 individuals. To more fully capture heterogeneity, it may be preferable to focus heterogeneity analyses in randomized trials on differences in risk under different treatments and study variation in treatment effects not over predicted risk but over predicted treatment effect (as in Goligher and colleagues’ third approach).9 In practice, the approach can be operationalized similar to risk score analyses as described above, replacing the risk score with an “effect score”—the difference in predicted risk of outcomes with treatment vs control.
Of course, the estimation of both risk and effect scores can be challenging, particularly when using internal trial data. Indeed, the term scores is used to highlight that these are imperfect proxy predictors of the outcome risk or treatment effect. The underlying risk or effect function can be better approximated using modern statistical methods, including machine-learning methods, that are more “flexible” than traditional regression approaches, in that they can more closely approximate the relation between outcomes and baseline variables. However, the ability of these methods to closely approximate the data increases the risk of “overfitting”—a phenomenon in which a model captures “noisy” aspects of the data that do not reflect true underlying relationships. The impact of overfitting can be controlled by sample splitting approaches (eg, using one part of the trial data to develop the model and another part to estimate treatment effects, possibly followed by reversing the roles of the 2 parts). Sample splitting approaches also support the valid quantification of uncertainty when constructing confidence intervals or conducting statistical tests.10-12
Goligher and colleagues3 followed all of these important steps in their effect score analysis, stratifying the study cohort by deciles of an effect score, obtained using machine-learning methods,13 for the effect of therapeutic-dose heparin vs prophylactic-dose heparin on in-hospital death. A key strength of their approach is that it produces valid assessments of heterogeneity of treatment effects even if the effect score is an imperfect estimate of the treatment effect.11 They found that the group in the lowest decile of the effect score may have been harmed by therapeutic-dose heparin (absolute risk reduction in hospital survival of –5.7%; 95% CI, –22.4% to 10.6%); in a post hoc analysis, the effect in this group was statistically significantly different from those in other groups. They also found that patients in this group tended to have high BMI and were more likely to require intensive care unit admission at baseline. Although direct comparison between their risk score and effect score approaches is difficult because of differences in the outcomes and effect measures used, the qualitative agreement across different approaches is reassuring. Their study demonstrates that modern state-of-the-science methods for assessing heterogeneity of treatment effects are feasible with high-quality, large-scale clinical data.
What are the implications of this careful study for future examinations of heterogeneity of treatment effects in randomized trials?
First, it is important to recognize that one-variable-at-a-time subgroup analyses, risk score analyses, and effect score analyses answer different questions. The first 2 approaches may offer some important insights (Do specific subgroups of patients benefit comparably? Can we improve equity based on risk of an adverse outcome?), but effect score analyses have the potential to more fully capture heterogeneity of treatment effects and may be more appropriate for personalizing patient care. In Goligher and colleagues’ analyses, the risk score and effect score analyses produced congruent findings—but this may not be true in every case. For now, it may be useful to conduct both risk score and effect score analyses to empirically compare their strengths and limitations. Over time, however, as investigators gain more practical experience with effect score approaches, we expect they will become the primary mode of exploring heterogeneity in randomized trials, including in prespecified analyses.
Second, heterogeneity assessments often require larger sample sizes than those needed for estimation of average effects;4,14 thus, the most informative analyses will use data from large trials and pooled harmonized data from multiple similar trials. In future large trials and pooled analyses of trials, risk score and effect score analyses deserve consideration as potential prespecified secondary analyses. When heterogeneity of treatment effects is strongly suspected on the basis of prior knowledge, trials could be prospectively powered to assess heterogeneity over risk and effect scores.
Third, the internal development of scores should adopt modern approaches to control overfitting (ie, by sample splitting) and quantify uncertainty.11,12 These state-of-the-science-approaches, which combine flexible estimation of the score with modern methods for statistical inference, will require integration of clinical and methodological expertise in research teams undertaking heterogeneity analyses.
Fourth, shared decision-making between clinicians and patients requires information regarding heterogeneity on the absolute scale.15 Thus, studies that investigate heterogeneity of treatment effects should examine absolute effect measures (eg, risk differences), possibly alongside relative effect measures (eg, relative risks or odds ratios).
In summary, while conventional one-variable-at-a-time subgroup analyses and risk score approaches will continue to have a role in estimating and reporting variation in treatment effects across clinically relevant patient subgroups, randomized trials should increasingly report effect score analyses for detecting heterogeneity of treatment effect on an absolute scale to better inform personalized care decisions in real-world populations.