A new tool to assess Clinical Diversity In Meta‐analyses (CDIM) of interventions

Objective: To develop and validate Clinical Diversity In Meta-analyses (CDIM), a new tool for assessing clinical diversity between trials in meta-analyses of interventions. Study design and setting: The development of CDIM was based on consensus work informed by empirical literature and expertise. We drafted the CDIM tool, reﬁned it, and validated CDIM for interrater scale reliability and agreement in three groups. Results: CDIM measures clinical diversity on a scale that includes four domains with 11 items overall: setting (time of conduct/country development status/units type); population (age, sex, patient inclusion criteria/baseline disease severity, comorbidities); interventions (intervention intensity/strength/duration of intervention, timing, control intervention, cointerventions); and outcome (deﬁnition of outcome, timing of outcome assessment). The CDIM is completed in two steps: ﬁrst two authors independently assess clinical diversity in the four domains. Second, after agreeing upon scores of individual items a consensus score is achieved. Interrater scale reliability and agreement ranged from moderate to almost perfect depending on the type of raters. Conclusion: CDIM is the ﬁrst tool developed for assessing clinical diversity in meta-analyses of interventions. We found CDIM to be a reliable tool for assessing clinical diversity among trials in meta-analysis.


Introduction
A meta-analysis of high-quality randomized clinical trials is considered the best available evidence in health care management and often forms the basis of clinical practice guidelines and for protocols of randomized clinical trials [1] .Still, undetected clinical diversity, methodological and/or statistical heterogeneity may lead to inappropriate conclusions or recommendations.Several potential sources of heterogeneity exist among trials included in systematic reviews and meta-analyses.Clinical diversity can be characterized by variability in settings, participants, characteristics of interventions and comparators, use of cointerventions, and the types and timing of outcome assessments.Methodological heterogeneity, or difference in risk of bias, is characterized by variability in trial design and quality in distinct domains.Statistical heterogeneity is characterized by variability in treatment effects between or among trials [2] .The presence and magnitude of statistical heterogeneity are associated with risk of bias and may be associated with clinical sources of diversity [3 , 4] , arise from other unknown or unrecorded trial characteristics, or from random errors ('play of chance') due to sparse data and repetitive testing [3][4][5][6][7] .In the context of systematic reviews, clinical diversity can be defined as differences in the clinical characteristics of trials, which may or may not lead to variations in the pooled treatment effect estimates across trials that are not explained by bias of the included trials [3 , 4 , 8] .
In contrast to methodological and statistical heterogeneity [9] , assessment of clinical diversity in meta-analyses is usually not conducted in a transparently and systematically [5 , 7] .Although subgroup analyses and meta-regression analyses may detect differences in treatment effect sizes associated with trial characteristics, the overall clinical di-versity is usually not assessed and mapped.We are not aware of any tool designed to assess and map clinical diversity in meta-analyses.
One of the main reasons to explore clinical diversity is to inform treatment decisions, eg, by identifying specific aspects of the intervention or population that might make an intervention more or less effective.It is therefore important to improve the interpretation of systematic reviews and possibly their external validity by increasing our understanding of clinical diversity.Furthermore, as methodological heterogeneity, does not include clinical differences between trials of the included interventions, such as dosage or length of follow-up, this further call for a tool to assess, map, and screen clinical diversity between trials in a meta-analysis.
Accordingly, we aimed to develop a tool for assessing and mapping clinical diversity in meta-analyses of interventions, and to test the reliability of the tool.In a supplementary exploratory analysis, we estimated the association, if any, between a summary clinical diversity in meta-analyses score and statistical heterogeneity.

Methods
The development and interrater scale reliability and agreement assessments of the Clinical Diversity In Metaanalyses (CDIM) tool was conducted following our prepublished protocol and reported following the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) [10 , 11] .

Development of CDIM
We constructed CDIM during a pilot phase based on consensus work informed by empirical literature and expertise by Gagnier and colleagues [12 , 13] ( Fig. 1a ): a methodologic review of guidance of the literature on clinical diversity in systematic reviews and their consensusbased recommendations for investigating clinical diversity in systematic reviews (based on the method using a modified Delphi technique with three phases: 1. pre-meeting item generation; 2. face-to-face consensus meeting; and 3. post-meeting feedback).
One author drafted the CDIM tool which was reviewed by the author/project group and revised according to comments and circulated three times ( Fig. 1a ).Initially, a complete list of Cochrane reviews within the field of intensive care medicine was created [14 , 15] .Two authors scored the first three meta-analyses with subsequent adjustment of the CDIM tool and wrote a draft manual providing guidance on the use of CDIM.The manual was circulated between the authors and revised.Hereafter, two authors scored the next five meta-analyses from the same list and the overall summary CDIM score (CDIMS) was categorized into low, moderate, or high clinical diversity.A final version of the CDIM tool was produced to be evaluated for reliability.
In the following CDIM is used for the mapping tool of Clinical Diversity In Meta-analyses while CDIMS is the summary of item scores in the CDIM tool.

Testing of CDIM
A sample of 60 meta-analyses was deemed sufficient to evaluate CDIM as 10-20 evaluations per category is considered sufficient to accurately estimate the coefficients of a regression model [16] and two times the squared amount of categories (2 • categories 2 ) to approximate a normal distribution to be used for the analysis of quadratic weighted kappa [17] .
We applied CDIM to 60 meta-analyses with a dichotomous primary outcome with at least three randomized clinical trials included ( Fig. 1b ).We selected in a consecutive order 20 titles (which had not already been used in the development of the CDIM) from the list of Cochrane reviews within the intensive care setting.Another 20 Cochrane reviews of interventions focusing on clinical scenarios outside the intensive care setting were selected to cover a wide range of non-intensive care interventions.These were picked by browsing The Cochrane Database of Systematic Reviews by topic.Finally, a convenience sample of 20 mainly non-Cochrane reviews with meta-analyses, of which around half were within the field of intensive care, were selected.
We evaluated CDIMS for interrater scale reliability by CDIMS scoring the 60 meta-analyses [11] .Two independent raters involved in the development of CDIMS (codevelopers) and two independent raters not involved in the development of CDIMS and neither in the meta-analyses (non-developers) scored the same 40 meta-analyses.Finally, the sample of 20 mainly non-Cochrane reviews with meta-analyses was CDIMS scored by two of the review's original authors.
The two non-developers of CDIM and the 20 pairs of original review authors were instructed only by reading the guidance document -no additional guidance was given.
After individual and independent scoring of clinical diversity using CDIMS, the raters pairwise agreed upon each item score, thereby achieving summarized consensus CDIMS.
According to our protocol [10] , we calculated the interrater agreement of CDIMS assessed by two independent raters in 60 meta-analyses, analyzed by linear regression and weighted Kappa for agreement between two raters, with 95% confidence intervals, of the CDIMS, with unweighted items scores, of the clinical diversity in 60 meta-analyses with clinical diversity low (score 0-10), moderate or unclear (score [11][12][13][14][15][16][17][18], and high (score 19-22): 1) We analyzed the interrater agreement of CDIMS, assessed by two independent raters involved in the development of the CDIM and the CDIM manual, in 40 meta-analyses, 20 ICU meta-analyses and 20 non-ICU meta-analyses, estimating weighted Kappa and intraclass correlation coefficient (ICC) by linear regression.We investigated the interaction between the interrater agreement or ICC and whether the meta-analyses were ICU or non-ICU meta-analyses.2) We analyzed the interrater agreement of CDIMS, assessed by two independent raters not involved in the development of the CDIM and the CDIM manual, in 40 meta-analyses, 20 ICU meta-analyses and 20 non-ICU meta-analyses, estimating weighted Kappa and intraclass correlation coefficient (ICC) by linear regression.We investigate the interaction between interrater agreement or ICC and whether the meta-analyses were ICU or non-ICU meta-analyses.3) We analyzed the interrater agreement, in 20 systematic reviews within 20 pairs of review authors scoring a meta-analysis of the primary, dichotomous outcome of their systematic review, estimating weighted Kappa and intraclass correlation coefficient (ICC) by linear regression.We investigated the interaction between the interrater agreement or ICC and whether the meta-analyses were ICU or non-ICU meta-analyses.For pairs of original review authors, we also analyzed interrater reliability within the specific domains of CDIM.
We stratified the analyses of interrater scale reliability between co-developers of CDIM and non-developers of CDIM according to meta-analyses of intensive care unit (ICU) interventions or non-ICU interventions.We analyzed the possible difference between the distributions of consensus CDIMS in ICU and non-ICU meta-analyses using the Mann-Whitney test, presenting box and whiskers plots with medians, interquartile ranges, and full ranges.
The interrater reliabilities of the overall summarized CDIMS were analyzed with ICC using one-way random reliability analysis of exact agreement on average CDIMS and for single measures (single meta-analysis) for codevelopers and non-developers of CDIM.A two-way random reliability analysis of exact agreement was used for pairs of original review authors.For pairs of original review authors, we also analyzed interrater reliability within the domains of CDIM.
Quadratic weighted kappa values for the agreement between the protocolized categorical classification of CDIMS (low: 0-11; moderate 12-18; high [19][20][21][22], defined after a pilot scoring, were calculated.Moreover, quadratic weighted kappa values for the agreement between the protocolized categorical classification of CDIMS and the categorical classification of I 2 in the meta-analyses (low I 2 ≤ 30%; moderate I 2 > 30% to ≤ 60%; high I 2 > 60%) modified from Higgins et al. were calculated [18] .Imputed relative distances between ordinal categories in the calculation of the quadratic weighted kappa were set to one.
Additionally, linear regression analyses were performed for any associations between the raters' summarized total CDIMS.Finally, we analyzed the possible association between the consensus CDIMS and I 2 in 60 meta-analyses using linear regression.Pearson's correlation coefficients, R 2 , and P -values for the linear regression coefficients being equal to zero were calculated.We plotted regression lines and regression standardized residuals including P-P plots to investigate whether residuals were normally distributed as required for a linear regression models to be adequate.

Results
CDIM measures clinical diversity on an ordinal scale that includes four domains with 11 items overall, covering essential domains describing clinical diversity [12 , 13] ( Table 1 ).
• The first domain aims to detect setting diversity by assessing differences between trials in: time of conduct; type of country development status; localization within the health care system.  1 ).The 11 items are each scored as low diversity (0 points), moderate (or unknown/undescribed) diversity (1 point), or high diversity (2 points), with a total range of 0-22 with the equal weight assigned to each item.The thresholds should be used as guidance only, and assessors may use other thresholds if that better suits the clinical field that is investigated.When assessors are in doubt, then assessors have the possibility of choosing the score 1 which corresponds to "unknown/undescribed/not applicable".Guidance on how to score each item is provided in the CDIM manual (Supplementary Appendix A). 2. Sum the item scores into summary CDIMS.

Interrater scale reliability and agreement of CDIMS
Four raters independently applied CDIM to 20 metaanalyses of ICU-interventions and 20 meta-analyses of non-ICU interventions, for a total of 160 evaluations.Twenty pairs (of 35 different raters) of original review authors applied CDIM to 20 meta-analyses, for a total of 40 evaluations (Supplementary Appendix B).In total, 721 tri- CDIMS varied between 0 and 21 points in the 60 metaanalyses.Average CDIMS for all raters varied between (mean ± SD) 11.5 ± 5.4 and 14.2 ± 3.9 and the difference between average CDIMS for pairs of raters ranged between 0.3 and 2.4 ( Table 2 ).

Co-developers of CDIM
Interrater scale reliability of CDIMS was almost perfect for two co-developers of CDIM with an ICC of 0.85 (95% confidence interval 0.72-0.92)for average measures and substantial with an ICC of 0.74 (0.56-0.85) for single measures.Pearson's correlation coefficient was 0.76 (0.53-0.98).Quadratic weighted kappa values for the agreement between categorical CDIMS for two co-developers was substantial with a kappa of 0.61 (0.18-1.00).Consensus CDIMS score between developers of CDIM stratified for ICU and non-ICU meta-analyses were median 18 (range 9-20) and median 12 (range 7-18), respectively ( P = 0.001, Mann-Whitney test for different distributions of CDIMS; Supplementary Appendix C).The interrater scale reliability between two developers of CDIM in ICU meta-analyses and non-ICU meta-analyses were almost perfect as well ( Table 3 ).

Non-developers of CDIM
Interrater scale reliability for two non-developers of CDIM was substantial with an ICC of 0.74 (0.51-0.86) for average measures and moderate for single measures with an ICC of 0.59 (0.34-0.76).Pearson's correlation coefficient was 0.72 (0.56-0.88).Quadratic weighted kappa values for the agreement between categorical CDIMS for two non-developers was moderate with a kappa of 0.41 (0.14-0.69).Consensus CDIMS between non-developers of CDIM stratified for ICU and non-ICU meta-analyses were median 17 (range 7-21) and median 12 (range 5-19), respectively ( P = 0.016, Mann-Whitney test for different distributions of CDIMS; Supplementary Appendix C).The interrater scale reliability between two non-developers of CDIM on average measures in ICU meta-analyses and non-ICU meta-analyses were substantial and moderate, respectively ( Table 3 ), and moderate and fair, respectively for single measures ( Table 3 ).

Pairs of original review authors
Interrater scale reliability of CDIMS for two original review authors was almost perfect with an ICC of 0.94 (0.85-0.98) for average measures and 0.89 (0.75-0.96) for single measures.Pearson's correlation coefficient was 0.90 b One-way random reliability analysis of exact agreement analysis of 40 meta-analyses rated with CDIMS.c Two-way random reliability of exact agreement analysis of 20 pairs of raters of 20 meta-analyses not involved in the development of CDIMS.CI is confidence interval.Non-ICU meta-analyses Interrater agreement a between two non-developers of CDIM 0.55 (-0.13 to 0.82) 0.38 (-0.06 to 0.69) 0.63 (0.17-0.70) 1. Rater: 12.4 ± 3.4 2. Rater: 9.0 ± 4.9 a One-way random reliability analysis of exact agreement in 20 meta-analyses rated with CDIMS.CI is confidence interval.SPSS version 17 was used.
Interrater scale reliability of CDIMS for two original review authors on the four CDIM domains was consistent with a scale reliability ranging from substantial to almost perfect across domains.The domain summary scale reliability ranged from 0.68-0.93 on average measures and from 0.51-0.87for single meta-analyses (Supplementary Appendix 3).

Consensus scores between developers and non-developers of CDIM
Interrater scale reliability of consensus CDIMS between developers and non-developers of CDIMS was almost perfect with an ICC of 0.91 (0.83-0.95) for average measures and 0.84 (0.72-0.91) for single measures.Pearson's correlation coefficient was 0.85 (0.81-1.22) (Supplemental Appendix C).Quadratic weighted kappa values for the agreement between the categorical consensus CDIMS was substantial with a kappa of 0.68 (0.38-0.98).
Linear regression showed that a linear model explained from 52%-82% of the covariation in CDIMS between raters regardless of the meta-analyses being ICU or non-ICU meta-analyses ( Table 2 ).Model of fit analyses justified a linear regression model as standardized residuals were normally distributed.

Association between clinical diversity expressed as consensus CDIMS and statistical heterogeneity expressed as I 2
Consensus CDIMS from both developers and nondevelopers of CDIMS supplemented with consensus CDIMS for pairs of original review authors indicated an absence of association with regression coefficients close to zero with narrow CIs: -0.02 (-1.6-1.4) and -0.13 (-2.0 to 0.7), respectively ( Table 4 and Supplementary Appendix C).In fact, a linear model seems unjustified, as analyses of standardized residuals indicated the absence of a normal distribution.Quadratic weighted kappa values for the agreement between categorical consensus CDIMS and categorical statistical heterogeneity was not calculable because the observed concordance was smaller than mean chance concordance ( Table 4 ).

Discussion
We aimed to create a systematic approach to assess and map differences in clinical characteristics (diversity) of included trials in a meta-analysis.By clinical characteristic differences, we mean factual clinical differences between or among the included trials in a meta-analysis.To assist in this aim, we developed CDIM and its summary score CDIMS, a mapping and screening tool developed to characterize the factual clinical diversities present in the meta-analysis which may or may not have been explored for effect modifying properties in a meta-analysis from a systematic review.
We evaluated CDIM in three groups of assessors.The highest interrater scale reliability and agreement on both average and single summarized measures of CDIMS and categorical classification of CDIMS (low, moderate, high) were achieved in groups of original review authors.Codevelopers achieved lower interrater scale reliability and agreement compared to original review authors.Nondevelopers of CDIM who were not involved in the rated meta-analyses achieved the lowest interrater reliability and agreement.Although interrater scale reliability and agreement between non-developers of CDIM were only moderate to substantial for average measures, single measures, and categorical classifications of CDIMS, respectively, the reliability and agreement increased to substantial and almost perfect, respectively, when either scores from two codevelopers of CDIM or two original review authors were compared.Even for individual domains, the reliability and agreement were substantial to almost perfect when ratings of two original review authors were compared.The external reliability (or generalizability) tested by assessing consensus scores from the group of co-developers and 'lumping' review that includes all participants regardless of eg, age, and thus may lead to high clinical diversity between the included trials for the items of age, but also for items such as participant inclusion criteria, baseline disease severity, and comorbidities, consequently leading to possible double counts.
In our sample, clinical diversity in meta-analyses of interventions in the field of intensive care appears to be high as compared to the group of meta-analyses in other medical fields.This difference indicates higher clinical diversity in meta-analyses in the field of intensive care, but it may also be a chance finding.Nevertheless, the domains and items included in the CDIM tool have been selected to be key categories/topics especially with the purpose of investigating clinical diversity in meta-analyses regardless of the medical field [13] .
A reason for the imperfect agreement between the categories low, moderate and high CDIMS may be attributable to the somewhat arbitrary cut off between these categories, which may be reflected in the analyses of the quadratic weighted kappa values.

Implications
The CDIM tool is designed to be applicable in all medical fields and intended to be used by multiple users such as researchers and guideline panels conducting or critically appraising meta-analyses.
The CDIM tool is to be used as a mapping and screening tool that should help authors of systematic reviews in a structured and transparent way to compare the PICO characteristics across trials and to the review PICO and other clinical characteristics of the included trials in the metaanalysis.Thus, the tool is only intended to be used as a mapping and screening tool that may be used to point out clinical characteristics that ought to be explored further.The scorings cannot tell how and why the meta-analytic result is affected by clinical diversity.The scoring will enable authors post hoc and in future updates of their review to explore factors that might modify intervention effects importantly, and in that way identify potential effect modifiers.Further, systematic reviewers may easily incorporate a plan in their protocol for a new systematic review to use CDIM to check whether possible clinical diversity has been explored sufficiently.CDIM can then be used to define and select subgroup analyses as we suggest items with a score of 2 possibly ought to be explored in subgroup analyses; this way CDIM can be used prospectively to select subgroup analyses ( Fig. 3 ).
The summary clinical diversity measure, CDIMS, is intended to help the systematic reviewers become informed about the degree of clinical diversity between the trials; the higher the overall score, the more should this be explored guided by the mapping of the clinical differences by eg, subgroup analyses and highlighted in the systematic review or in future updates of the review.Furthermore, if CDIMS is zero or low and the items creating this have been explored in subgroup analyses, then clinical diversity may have been addressed or may not be a problem and effect modifiers have or have not been detected.On the other hand, if CDIMS is moderate or high and items are contributing to a high CDIMS which have not been explored in the systematic review, by subgroup analysis or meta-regression, then yet unknown effect modifiers (confounders) may still be a possibility and should be explored in the meta-analysis and in future updates of the metaanalysis.Even though a relevant PICO has been phrased in the protocol of a systematic review, clinical diversity within the panel of trials included may vary a great deal on several (other) clinical characteristics not covered by the inclusion or exclusion criteria.
The panel of items chosen in CDIM is based on the consensus panel of possible clinically important characteristics that ought to be explored in all meta-analysis and CDIM is not a panel of universally known effect modifiers.However, this does not rule out the possibility of combining knowledge from eg, large observational studies pointing to relevant risk factors or effect modifiers.If such knowledge exists and CDIM reveals clinical diversity, due to a diversity of suspected risk factors, among the included trials there is even more reason for exploring whether effect modification by these factors is present.If risk factors, derived from large observational studies has been identified, and not detected by CDIM, then we suggest that these factors should be assessed in subgroup analyses as well.
Our analyses illustrate that CDIM is a reliable mapping and screening tool for assessing clinical diversity in meta-analyses.We consider to use CDIM in the systematic review process to quantify overall clinical diversity, to highlight clinical diversity within specific domains and it may be practical when assessing indirectness and inconsistency in GRADE [5] .Other implications include the possibility of comparing CDIMS across meta-analyses and with statistical heterogeneity such as I 2 or D 2 [23] .However, our finding of lack of association between clinical diversity and statistical heterogeneity should be considered hypothesis-generating due to the limited number of investigated meta-analyses and scenarios.In any case, we recommend these results to be explored further.We encourage investigators to provide feedback and report experiences to the corresponding author.
In conclusion, CDIM is the first tool developed to assess clinical diversity in meta-analyses.Interrater scale reliability for overall CDIMSin various scenarios varied from moderate to almost perfect.Reliability was almost perfect between original review authors and between consensus scores of non-developers and codevelopers of the CDIM tool.We consider CDIM a reliable tool and recommend using CDIM for the assessment and mapping of the overall clinical diversity in meta-analyses.

Fig. 1 .
Fig. 1.(a) Process of the development of Clinical Diversity In Meta-analyses (CDIM).(b) Interrater scale reliability and agreement testing of Clinical Diversity In Meta-analyses Score (CDIMS).

Fig. 2 .
Fig. 2. Fitted regression line (Y = 0.90 • X + 0.13) of Clinical Diversity In Meta-analyses Score (CDIMS) from second original review author on CDIMS from first original review author in 20 meta-analyses from mainly non-Cochrane reviews.Hyperbolic lines around fitted line represent 95% CI for the regression line.R 2 = 0.82.

Fig. 3 .
Fig. 3. Interpretation of Clinical Diversity In Meta-analyses Score (CDIMS) and how to conclude on the assessment of clinical diversity between included trials in a meta-analysis.

Table 1 .
The Clinical Diversity In Meta-analyses (CDIM) tool

Table 2 .
Interrater agreements of Clinical Diversity In Meta-analyses Score (CDIMS) stratified for types of raters as developers, original review authors, and non-developers of CDIM a Low CDIMS: 0 to 10; Moderate CDIMS: 11 to 18; high CDIMS: 19 to 22.

Table 3 .
Interrater agreements of Clinical Diversity in Meta-analyses Score (CDIMS) stratified for ICU and non-ICU meta-analyses