Top

Release Notes

Latest version: v2024.09.20-cancer

BoostDM v2024.09.20-cancer

This boostDM release has been implemented with the output produced by Intogen v2024.09.20. Several improvements and features have been included, including changes in model implementation, reference datasets and bugfixes.
Herein we describe the main technical changes, we summarize the new output of the pipeline and provide a comparison with the previous release.


Model Implementation

Uniform cluster features across cohorts of same tumor type category

Although mutational cluster features (OncodriveCLUSTL, HotMAPS and smRegions) are generated on a cohort-by-cohort basis, boostDM models are specified by tumor-type, meaning that an aggregation or consensus step has to be taken in order to annotate the mutations in the training and test sets.

In the previous release, mutations with the same genomic identity (same genomic coordinates and alternate allele) would be annotated with different mutational cluster feature values depending on which cohort they were originally reported in, thereby yielding instances of mutations with same identity but different cluster feature values across samples of the same tumor type.

In the current version we have changed this criterion. Now if a mutation is mapped by a clustering feature in some cohort belonging to a given tumor type, all the instances of the mutation across the tumor type will get the same cluster feature value in the training and prediction corresponding to this tumor type.

Ordinal encoding of cluster features

Each cluster method (OncodriveCLUSTL, HotMAPS and smRegions) is now associated with a single ordinal feature with three levels:

  • 2 - the mutation is associated to a cluster in the same tumor type
  • 1 - the mutation is only associated to a cluster in another tumor type
  • 0 - the mutation is not associated to the cluster in any case

Therefore we no longer keep the distinction between cat_1 and cat_2 binary-encoded cluster features. Moreover, we do not use the OncodriveCLUSTL_SCORE numerical feature anymore due to the little gain in predictive ability and explainability along with the added complexity at interpreting the feature contributions at prediction.

Unweighted consensus of base classifiers

The forecast aggregator combining the 50 base classifiers per model is now unweighted, meaning that the base classifiers are no longer favored or penalized based on their cross-validation performance. Particularly in those cases where the cross-validation test set remaining after removing repeated mutation instances is very small, the performance score of a single base classifier can be biased and misleading to determine whether the base classifier behavior needs to be promoted in the forecast aggregator.

Model selection

Our model selection strategy is now conducted in two steps:

  1. Models are only trained if there are at least 30 mutations on average across the training splits of the 50 base classifiers. Note that the training splits are label-balanced sets comprising 70% of the total mutation training set. Note that the size of the mutation set available for training does in turn depend on the dN/dS excess per consequence type inferred at each cohort.
  2. We apply a composite rule that requires:
    • Mean CV performance F-score50 across base classifier >= 0.8
    • a minimum number of observed mutations per each tier defined by the discovery index

We have updated step 2 by changing the discovery index tiers and the minimum number of observed mutations for each tier. The new step 2 reads in Python code as follows:

DISCOVERY_TIERS = (0, 0.5, 0.75)
MUTATION_TIERS = (50, 30, 0)
FSCORE_THRESHOLD = 0.8

def meet_condition(fscore, discovery, n_muts):
  if fscore >= FSCORE_THRESHOLD:
    for discovery_thresh, n_muts_thresh in zip(DISCOVERY_TIERS, MUTATION_TIERS):
      if (discovery >= discovery_thresh) and (n_muts >= n_muts_thresh):
          return True
  return False

The rationale behind this change is that it is very difficult to assert qualitative differences between discovery index values within the (0, 0.5) range, making a case subdivision in this range not well justified. Therefore by providing broader discovery index tiers we make sure to provide more qualitatively distinct discovery classes.

Pan-genes model are no longer trained

We no more train meta-gene models based on the dichotomous mode-of-action classification into oncogenes (Act) and tumor suppressor (LoF) genes. This is in part because the models end up being dominated by a few genes contributing more mutations to each Act or LoF class. The pipeline now only renders models that are gene-specific. Consequently the oncogenic mode-of-action or “role” of the gene as a feature is no longer used. As a result, the column “selected_model_gene” is no longer used in the in silico saturation mutagenesis output tables.

Prediction output format

The in silico saturation mutagenesis output is displayed with some formatting differences, with filenames consistent with the expression [gene].model.[ttype_model].features.[ttype_features].prediction.tsv.gz. The term ttype_model represents the tumor type used as training context, the set of cohorts where the signals of positive selection were calculated and from which the training mutations were taken for the specific model used to cast the prediction, i.e., in which tumor type context the model was learned. The term ttype_features is the tumor type corresponding to the features that are used to encode the input mutation as a vector of feature values for prediction, i.e., the tumor type of the query mutation. In this release, we provide in silico saturation mutagenesis outputs where the predictive model has been trained in the same tumor type context of the query mutation. As a consequence of making explicit the context of the training and the query, the column selected_model_ttype is no longer part of the output tables.

Data Updates

IntOGen data

The current version uses the outputs of Intogen v2024.09.20.

Ensembl Variant Effect Predictor

The new release uses ENSEMBL Variant Effect Predictor (VEP) v.111 annotations instead of ENSEMBL VEP v.92.

Transcripts

Ensembl canonical transcripts are no longer the reference transcripts in order to interpret relevant sites and the consequence type of the mutations. The current release is built on the MANE (Matched Annotation from NCBI and EMBL-EBI) transcripts. We have switched to using MANE Select transcripts to align with common standards used in the clinics. This redefinition of the transcript is in accordance with the current Intogen release v2024.09.20, where it has an impact on the regions used in methods like OncodriveFML, OncodriveCLUSTLand and smRegions, also when computing the mutational profiles and when building coding sequence regions for dNdScv.

Genomic regions of interest

In the current release, the genomic regions considered for training and prediction include the entire coding sequence according to MANE Select, including stop codons (as in the previous release) as well as the 5 noncoding base-pairs flanking each exonic coding sequence segment. The inclusion of the non coding base-pairs intends to exploit the signals of positive selection cast by noncoding splicing-affecting mutations. The mapping is done according to the MANE Select transcripts retrieved from VEP v.111.

Updated consequence type definitions

We have extended the set of sequence ontology (SO) terms that map to the consequence classification we employ in the pipeline, which consists of four consequence types: “missense”, “nonsense”, “splicing” and “synonymous”.

  • The term nonsense now includes the SO terms:
    • stop_gained
    • stop_lost
    • start_lost
  • The term splicing now include the SO terms:
    • splice_donor_variant
    • splice_acceptor_variant
    • splice_region_variant
    • splice_donor_5th_base_variant
    • splice_donor_region_variant
    • splice_polypyrimidine_tract_variant
    • intron_variant

Pfam domains

We have updated the version of Pfam to v35.0.

Oncotree

As tumor type ontology for hierarchical model implementation across tumor types we now use the oncotree version boostdm2023 (see Downloads) which is based on the 2021 version of the MSKCC oncotree.

Bug Fixing and technical updates

Fixed reference genome incompatibility

Recently we found a misspecification in the codebase that prevented the randomization of passenger mutations as intended. As described in the main documentation, our method resorts to trinucleotide context probabilities derived from the observed frequencies in the cohort mutational data. With these probabilities we can then draw random mutations to create the set of passenger mutations used as a negative set in the supervised learning step.

Because of the unintended usage of a reference genome inconsistent with the rest of the pipeline in the step that converted genomic coordinates into reference triplets prior to randomization, there was a mismatch between sites and reference triplets. Consequently, the nondriver training set was drawn in a trinucleotide-agnostic way.

In the current version the hg38 genome build is consistently used throughout the entire pipeline.

Intogen integration

The output of Intogen now generates a unified data environment for the boostDM pipeline to be run from the new outputs of Intogen with minimum preprocessing required. For more details, check out the Intogen website.

New pipeline steps

The current version of the pipeline adds two supplemental steps to streamline post-hoc analysis, including model visualization and benchmarking analysis.

Plotting

All the plots used for basic model visualization in the boostDM website are automatically generated as part of a supplemental step in the pipeline. There are three types of plots:

  • Blueprints. Stack representation mapping the gene protein coordinates with different annotation layers, from top to bottom: observed mutation frequencies, boostDM score, missense/nonsense driver mutations with high and low confidence tiers, Pfam domains, feature annotations
  • Clustered blueprints. Hierarchical clustering representation of several in silico saturation mutagenesis predictions across the collection of tumor types for which we can cast a prediction with the same model tumor type model as features tumor type, i.e. there is a high quality model for the gene in the tumor type context and this model is used to cast the predictions feeding on the feature annotations calculated from this tumor type context.
  • Discovery index. Scatter plot and best inverse exponential fitting curve representing the subsampling experiment used in the calculation of the discovery index. The gray dots represent the number of unique mutations in the pool of KRAS mutations of random subsets of tumors taken from the whole tumor type cohort. The inverse exponential fitting bending indicates how close to saturation, i.e. the more the curve is bent, the less likely new mutations are discovered as we sequence more.

Benchmarks

Given the importance for post-hoc evaluation, the boostDM pipeline now includes an expanded version of the benchmarks accompanying the first release of the method (https://doi.org/10.1038/s41586-021-03771-1). We have included both experimental saturation mutagenesis datasets and bioinformatic scores retrieved through dbNSFP. Please, have a look at the full docs.

Summary of the run

The current release of BoostDM feeds on mutations observed in a total of 4,478 driver gene cohort combinations, comprising 633 driver genes and 260 cohorts covering a total of 87 tumor types according to our oncotree ontology. Of all observed mutations mapping to the region of interest, we only consider as positive examples those mutations matching a consequence type with a dN/dS excess in the cohort higher than 85%. There is a strong dichotomy between cohorts with high and low excess in the nonsense and splicing consequence type contexts, whereas the distribution of missense excess is more gradual.

Image 1
dN/dS excess per consequence type across driver genes and cohorts. Among the driver gene and cohort combinations, 53% have a high excess (higher than 85%) in the missense context, 51% in nonsense and 23% in splicing.

For each driver gene and tumor type combination we computed the discovery index, a score that measures how unlikely new mutations are to be found if new samples of the same tumor type were to be sequenced. The number of observed mutations in a gene and tumor type context alongside the discovery index gives us an indication of how representative our catalog of mutations is in order to derive conclusions about recurrence and driverness.


Image 1
Number of observed mutations and discovery index across all possible gene and tumor type combinations.

Using the catalog of observed mutations matching high dN/dS excess per consequence type and creating a catalog of synthetic mutations generated with the 96-trinucleotide context mutational profiles reflecting neutral mutagenesis, we can then conduct our supervised learning approach to train boostDM models.

For this release, we trained 1,451 models for distinct driver gene and tumor type combinations, of which 643 are models at the finest level of tumor type specificity that we can afford with the cancer type annotation of our input cohorts of tumors. Applying our model selection criteria based on cross-validation (CV), the discovery index and the number of mutations observed, only 736 of the models trained meet the minimum qualification in terms of inductive bias and representativeness of the training set (hereinafter “high-quality models”). There are 359 high-quality models with highest level of tumor type specificity (hereinafter “specific models”).
In terms of driver gene coverage, the high-quality models span 115 genes, with 94 represented in specific models and 21 additional genes represented by more general models (Figure 3B). In terms of tumor type coverage, the high-quality models span 89 tumor types, of which 58 are tumor types at the highest level of specificity (Figure 3C).

A. Count of boostDM models, specific models and high-quality models.
B. Count of genes covered across all models, specific models and high-quality models.
C. Count of tumor types across all models, specific models and high-quality models.

Models across tumor types. Graphical representation of the oncotree hierarchy: nodes are labeled with the tumor type acronyms and colored according to the number of genes with a high-quality model for that particular tumor type context.

Cross-validation (CV) performance plays an important role in our model selection criteria. As in the previous release, the CV performance was reported with the so-called F-score50 (F-beta score with beta=0.5, thus giving more weight to precision). CV performances showed a remarkable concordance with the discovery index.

Discovery index vs CV performance across specific models.

A few high-quality models are an exception to this general trend, whereby low values of the discovery index match a high CV performance. The analysis of SHAP feature explanations shows that these models are predominantly driven by the consequence type of the mutations: these are very simple models that accurately discriminate between positive and negative examples for which the set of observed mutations is far from saturation. These models have learned very simple although effective classification rules (e.g. any truncating mutation being a driver) in a context where we still expect to find many more new mutations as more tumors get sequenced, whence the low discovery index.

Discovery index vs CV performance across specific models, highlighting a group of models with high CV performance despite the low discovery index value.

Local feature explanations to the driver predictions corresponding to the catalog of unique variants observed in the regions of interest in ARID1A (BRCA), KMT2C (CESC), and NSD1 (HNSC) according to the Intogen 2024 mutational catalog. Each row in the heatmap represents a variant with colors representing SHAP values across the features (grouped in four types: clustering, conservation across vertebrates, consequence
Local feature explanations to the driver predictions corresponding to the catalog of unique variants observed in the regions of interest in ARID1A (BRCA), KMT2C (CESC), and NSD1 (HNSC) according to the Intogen 2024 mutational catalog. Each row in the heatmap represents a variant with colors representing SHAP values across the features (grouped in four types: clustering, conservation across vertebrates, consequence
Local feature explanations to the driver predictions corresponding to the catalog of unique variants observed in the regions of interest in ARID1A (BRCA), KMT2C (CESC), and NSD1 (HNSC) according to the Intogen 2024 mutational catalog. Each row in the heatmap represents a variant with colors representing SHAP values across the features (grouped in four types: clustering, conservation across vertebrates, consequence

Only 30 specific models above the minimum CV performance threshold are rejected because of their low number of observed mutations given their moderate/low discovery index value.

Low-quality models. Left: All specific models being rejected by our model selection criteria, where only 30 of them are above the CV performance threshold. Right: Projection of the 30 low-quality models with a CV performance F-score50 higher than 0.8. Red dashed lines represent the rectangular rejection boundary of our model selection criteria.

Comparison with previous run

In this section we will denote v2021 the previous release of boostDM and v2024 the current release.

New gene models

The current release v2024 includes models for 39 genes that previously did not have any high-quality models. The complete list of new genes is the following:

ARHGAP35, ATF7IP, BAP1, BCL7A, BCL9L, BCOR, BRCA1, BRCA2, BRD7, CCND3, CDKN1A, CNOT9, FGFR1, FGFR2, FOXO1, FUBP1, GNAS, ID3, MAPK1, MAX, MGA, MYCN, NF2, NFKBIA, NOTCH2, POU2F2, PTCH1, RASA1, RHPN2, RNF43, RXRA, SMO, SOX9, SPEN, STAT6, TSC2, USP6, WT1, XPO1

The user may note the inclusion of H3-3A among the genes with models. This gene was named H3F3A in Intogen v2024, thus neither the gene H3F3A has dropped nor the gene H3-3A is a new inclusion among the high-quality models.

Snapshots corresponding to gene models for genes that were not previously covered by high-quality models. Local feature explanations of the driver predictions corresponding to the catalog of unique variants observed in the MANE Select regions of interest.

A. BRCA1 (CANCER) is a model driven by nonsense consequence type

Snapshots corresponding to gene models for genes that were not previously covered by high-quality models. Local feature explanations of the driver predictions corresponding to the catalog of unique variants observed in the MANE Select regions of interest.

B. BCL7A (NHL) is a model driven by the enrichment of variants in the Pfam domain BCL_N.

Snapshots corresponding to gene models for genes that were not previously covered by high-quality models. Local feature explanations of the driver predictions corresponding to the catalog of unique variants observed in the MANE Select regions of interest.

C. BAP1 (CCRCC) casts a more nuanced signal explained by a mix of consequence type, PhyloP and enrichment in the Pfam domains Peptidase_C12 and UCH_C.

Snapshots corresponding to gene models for genes that were not previously covered by high-quality models. Local feature explanations of the driver predictions corresponding to the catalog of unique variants observed in the MANE Select regions of interest.

D. FOXO1 (NHL) is a model driven by PhyloP.

Snapshots corresponding to gene models for genes that were not previously covered by high-quality models. Local feature explanations of the driver predictions corresponding to the catalog of unique variants observed in the MANE Select regions of interest.

E. NOTCH2 (CANCER) is dominated by NMD skipping nonsense mutations, but there are also a few driver mutations explained by linear clustering and splicing consequence type.

Snapshots corresponding to gene models for genes that were not previously covered by high-quality models. Local feature explanations of the driver predictions corresponding to the catalog of unique variants observed in the MANE Select regions of interest.

F. MAPK1 (CANCER) is driven by 3D clustering.

Downgraded gene models

Nine (9) genes that had at least one high-quality model in boostDM v2021 do not have a high-quality model anymore. The complete list of downgraded genes is the following:

ATRX, CHD4, CMTR2, DDX3X, IRF4, PIK3R1, PPP6C, RPS6KA3, RUNX1

All the gene models except RUNX1 dropped because their CV performance fell below the minimum 0.8 mean F-score50 required. In the case of the RUNX1 (Acute Myeloid Leukemia) the model did not fulfill the criteria regarding the number of observed mutations and discovery index.

Downgraded genes. The nine genes that had at least one high-quality model in boostDM v2021 but do not have a high-quality model anymore in v2024.

Comparison set

Since the oncotree tumor type ontology that boostDM resorts to has changed since v2021, in order to carry out a proper comparison we must define a subset of gene tumor type combinations for which we can find equivalences across releases and for which high-quality models were implemented in the previous release. Concretely, the gene and tumor type combination pool for this analysis is defined as follows:

  • Select gene and tumor type combinations for which there was a high-quality model and prediction at release v2021.
  • From those we selected the ones with an equivalent tumor type in the current version of the oncotree.
  • Finally, we discarded all models that in the current version are not specific and high-quality.
Applying these criteria we ended with a collection of 178 gene tumor type combinations, which we refer to as our comparison set, spanning 68 genes and 38 tumor types.

The genes comprised are the following in alphabetical order: AKT1, ALK, AMER1, APC, AR, ARID1A, ARID2, AXIN1, BCLAF1, BRAF, CD79B, CDH1, CDK12, CDKN2A, CIC, CREBBP, CTCF, CTNNB1, EGFR, EP300, ERBB3, ERCC2, ESR1, EZH2, FAT1, FBXW7, FGFR3, FLT3, FOXA1, GNA11, GNAQ, GTF2I, HRAS, IDH1, IDH2, KDM6A, KMT2C, KMT2D, KRAS, MAP3K1, MEF2B, MYD88, NCOR1, NF1, NFE2L2, NOTCH1, NRAS, NSD1, PBRM1, PCBP1, PIK3CA, PPP2R1A, PTEN, PTPN11, RAC1, RB1, RBM10, SETD2, SF3B1, SMAD4, SMARCA4, SPOP, STAG2, STK11, TP53, TSC1, U2AF1, VHL.

Key metrics
Discovery index

The discovery index turns out to be stable between releases, with some instances where it has increased significantly, most prominently in the LNM cohort with genes like TP53, CREBBP and HLA-B showing a sharp increase. Although the discovery index is not guaranteed to always increase – the cohorts of tumors recruited may not be as homogenous despite matching the same tumor type, e.g. due to differences in tumorigenic selection across ancestral populations or due to differential environmental exposures – the fact that the discovery index does not drop significantly in any case is a proof of the robustness of the methodology.

Discovery index across the comparison set for v2021 and v2024. Marked are the gene tumor type models that have a discrepancy between releases higher than 0.1 in the discovery index scale.

Observed mutation count

The comparison of observed mutations simply reflects the fact that new mutational cohorts have been added in Intogen v2024, thereby leading to a general increase across all models, given the fact that Intogen v2024 comprises more that 4,000 more samples than the v2021 release. However, with changes in the way mutations are filtered and mapped to a genic region of interest based on MANE Select, there are a few cases where the number of mutations has slightly decreased.

Observed mutation counts across the comparison set for v2021 and v2024. Marked are the gene tumor type models that have a 50% or higher relative increment.

Cross-validation (CV) performance

We conducted a comparison of the CV performance between both releases: while the average performance is quite stable across models, there are a few outliers showcasing differences in the modeling outcome, including a few cases for which the performance decreased. As in our original study, we report the CV performance with the F-score50, which gives precedence to precision over recall.

CV performance for v2021 and v2024.

To understand this outcome we must take into consideration that the CV performance can reflect the new models learning more nuanced patterns at training as the mutational catalog gets richer. In particular, larger mutation catalogs come with an increased diversity of passenger mutations making part of the positive set, which constitutes an unavoidable form of data contamination in our methodology. In other words, because our supervised learning strategy uses fuzzy labels based on high dN/dS excess, perfect models are not expected to reach perfect performance.

For example, in the cases of SMARCA4 (LUNG) and EP300 (BLCA) we saw that the new models learned more nuanced patterns in the current release. Even though we can see that in these two cases the forecast of the model has not changed, the differences in terms of SHAP values underscore differences in the training outcome across base models. This can be best illustrated by comparing the blueprints and the SHAP feature explanations corresponding to the predictions of observed mutations.

Comparison of SMARCA4 (LUNG) models from v2024 and v2021. Top: Needle plots represent the distribution of driver (red) and nondriver (gray) observed mutations in v2024 (top) and v2021 (bottom). The internal plots represent (top) the distribution of driver mutation along the protein sequence and (bottom) the values of mutational features used to train both versions of models. The tracks are colored pink for v2024, blue for 2021, and purple if they represent the overlap of the values of the features in both models. Bottom: comparison of SHAP explanations cast by models v2024 vs v2021 across the set of unique variants observed in v2024. Top horizontal bar represents v2024 (red) and v2021 (pink).
Comparison of EP300 (BLCA) models from v2024 and v2021. Top: Needle plots represent the distribution of driver (red) and nondriver (gray) observed mutations in v2024 (top) and v2021 (bottom). The internal plots represent (top) the distribution of driver mutation along the protein sequence and (bottom) the values of mutational features used to train both versions of models. The tracks are colored pink for v2024, blue for 2021, and purple if they represent the overlap of the values of the features in both models. Bottom: Comparison of SHAP explanations cast by models v2024 vs v2021 across the set of unique variants observed in v2024. Top horizontal bar represents v2024 (red) and v2021 (pink).

With all this in mind, we also wanted to understand the extent to which models differ in terms of the concordance across comparable genomic sites, regardless of whether there are observed mutations mapping to those sites (in silico saturation mutagenesis). In order to do this, we first mapped the results of the in silico saturation mutagenesis from both releases in the overlap of the regions of interest and computed the concordance as a Matthews Correlation Coefficient (MCC) score between the vectors of binary predictions (blueprints) considering a mutation a likely driver if its boostDM score is higher than 0.5. We also computed the proportion of sites where thus defined drivers were gained or lost.

Concordance of blueprints across the gene and tumor type combinations of the comparison set (MCC score) versus the proportion of drivers gained.
...
Concordance of blueprints across the gene and tumor type combinations of the comparison set (MCC score) versus the proportion of drivers lost.

We observe an inverse relationship between the amount of drivers gained or lost and the concordance, which is expected. However, the proportion of drivers gained does not seem to influence the concordance as much as in the cases of drivers lost.

We also asked what is the relationship between the concordance (or lack thereof) and the CV performance of the models in the current release. Interestingly, some models show very little concordance despite the fact that the CV performance in v2024 is very high.

...
Concordance of blueprints across the gene and tumor type combinations of the comparison set (MCC score) versus the proportion of drivers lost.

An interesting example is the ALK (NBL) model, where we can see that the former model deemed any non-synonymous mutation in the intracellular kinase domain a driver, whereas the current model casts a much more specific prediction based on the clustering signals of positive selection (see figure in the examples below).

Examples

For illustration, we provide a few more examples of gene and tumor type models for which the models have changed sensibly, portraying interesting differences in terms of observed mutations, mutational features, blueprints, and feature explanations.

A. Comparison of VHL (RCC) models from v2024 and v2021. Left: Needle plots represent the distribution of driver (red) and nondriver (gray) observed mutations in v2024 (top) and v2021 (bottom). The internal plots represent (top) the distribution of driver mutation along the protein sequence and (bottom) the values of mutational features used to train both versions of models. The tracks are colored pink for v2024, blue for 2021, and purple if they represent the overlap of the values of the features in both models. Right: comparison of SHAP explanations cast by models v2024 vs v2021 across the set of unique variants observed in v2024. Top horizontal bar represents v2024 (red) and v2021 (pink)
B. Comparison of FOXA1 (PRAD) models from v2024 and v2021. Left: Needle plots represent the distribution of driver (red) and nondriver (gray) observed mutations in v2024 (top) and v2021 (bottom). The internal plots represent (top) the distribution of driver mutation along the protein sequence and (bottom) the values of mutational features used to train both versions of models. The tracks are colored pink for v2024, blue for 2021, and purple if they represent the overlap of the values of the features in both models. Right: Comparison of SHAP explanations cast by models v2024 vs v2021 across the set of unique variants observed in v2024. Top horizontal bar represents v2024 (red) and v2021 (pink).
C. Comparison of ALK (NBL) models from v2024 and v2021. Left: Needle plots represent the distribution of driver (red) and nondriver (gray) observed mutations in v2024 (top) and v2021 (bottom). The internal plots represent (top) the distribution of driver mutation along the protein sequence and (bottom) the values of mutational features used to train both versions of models. The tracks are colored pink for v2024, blue for 2021, and purple if they represent the overlap of the values of the features in both models. Right: comparison of SHAP explanations cast by models v2024 vs v2021 across the set of unique variants observed in v2024. Top horizontal bar represents v2024 (red) and v2021 (pink).
D. Comparison of GNAQ (UM) models from v2024 and v2021. Left: Needle plots represent the distribution of driver (red) and nondriver (gray) observed mutations in v2024 (top) and v2021 (bottom). The internal plots represent (top) the distribution of driver mutation along the protein sequence and (bottom) the values of mutational features used to train both versions of models. The tracks are colored pink for v2024, blue for 2021, and purple if they represent the overlap of the values of the features in both models. Right: Comparison of SHAP explanations cast by models v2024 vs v2021 across the set of unique variants observed in v2024. Top horizontal bar represents v2024 (red) and v2021 (pink).
E. Comparison of EZH2 (LNM) models from v2024 and v2021. Left: Needle plots represent the distribution of driver (red) and nondriver (gray) observed mutations in v2024 (top) and v2021 (bottom). The internal plots represent (top) the distribution of driver mutation along the protein sequence and (bottom) the values of mutational features used to train both versions of models. The tracks are colored pink for v2024, blue for 2021, and purple if they represent the overlap of the values of the features in both models. Right: comparison of SHAP explanations cast by models v2024 vs v2021 across the set of unique variants observed in v2024. Top horizontal bar represents v2024 (red) and v2021 (pink).

Coverage of variants

Considering all the models and in silico saturation mutagenesis predictions resulting in the current release, we asked what is the proportion of observed single-nucleotide variants for which there is an annotation as either pathogenic/oncogenic or non-pathogenic/non-oncogenic at well-known databases of clinical annotations and compare this with the proportion attained by boostDM models. For this exercise we used ClinVar (version from 2024-03-31) and OncoKb (obtained using API version v1.4.1 and data version v4.19) as reference datasets of clinical annotations. We conducted this analysis using observed mutations from the Intogen (v2024) and from the GENIE cohort (v15.1).

Comparison of the proportion of variants annotated by Clinvar, OncoKb and BoostDM models. Left: We included 97,942 variants from Intogen v2024, only those SNV in driver genes defined by Intogen.Proportion of observed mutations with a clinical annotation in ClinVar and OncoKb for the Intogen and GENIE cohorts. Right: We included 262,010 variants from the GENIE cohort, only those SNV in driver genes defined by Intogen.

Benchmark

In view of the significant fixes and updates in the making of boostDM models, we conducted some benchmarking analyses similar to those accompanying the first release of the method doi.org/10.1038/s41586-021-03771-1. In our analyses, we included additional experimental saturation mutagenesis datasets and new bioinformatic scores retrieved through dbNSFP. Some of these datasets and scores have been made available only recently, adding value to this benchmarking exercise.

Datasets

Experimental saturation mutagenesis datasets

We collected some experimental saturation mutagenesis assays whereby the fitness and/or functional impact of mutations at certain human genes has been put to test in highly multiplexed experimental setups where many of these variants were tested at once.

Bioinformatic scores

We have also taken advantage of a collection of bioinformatic scores aiming to measure the pathogenicity or functional impact of variants for comparison against boostDM. Although most of these tools were not designed specifically to measure the likelihood of the variants to be under positive selection in the context of somatic evolution, all these scores are relevant and have been used in cancer genomics related studies and methods.

Methodology

Whenever a new classification score of variants is proposed it is always convenient to evaluate how this new method compares with other scoring methods on an evaluation set that reflects the ground truth and guarantees that none of the methods has an artificially inflated performance. Typically one would like to compare against methods intended to solve the same problem. Even the experimental saturation mutagenesis assays that assess the effects of mutations in experimental systems (cell lines or mice) do not necessarily test their oncogenic effect in the human organism. This is why we want to compare the performance of boostDM models, experimental saturation mutagenesis, and bioinformatic tools against ground truth labels built upon observed and synthetic mutations.

Here we take as ground truth driver mutations the mutations that have been observed in cancer genes across human tumors and which have a consequence type with high enrichment in driver mutations – asserted from dN/dS analysis. In particular, this set might include some non-driver mutations. On the other hand, synthetic mutations generated following the mutational profile of neutral mutagenesis across samples in the same tumor type constitute a good approximation of the set of passengers. In order to build a good evaluation set, we considered mutations in the test sets held out from training across boostDM base classifiers as drivers (observed in high dN/dS consequence type context) and nondrivers (synthetic mutations generated in accordance with the background mutational profile). The evaluation set is built by taking all the mutation instances from the testing splits at each base classifier, with each mutation annotated with their mutational features, the boostDM score cast by the corresponding base classifier and the driver/nondriver ground truth label.

We built an evaluation pool by taking all the mutations from the testing splits of each base classifier, each mutation annotated with their mutational features, the boostDM score cast by the base classifier and the driver/nondriver ground truth label. By using held-out testing mutations and base classifiers as predictors we reduce the leakage at prediction time for the sake of an unbiased evaluation.

On the one hand this evaluation strategy is expected to still bear some information leakage because we do not completely overrule the possibility that a mutation used at training is present in their corresponding testing splits, but we enforce that if this happens the mutation has unique instance in the test set, i.e. we drop duplicates in the test set. On the other hand, our evaluation strategy makes use of the suboptimal predictions cast at the base classifier level (before aggregation of the base classifiers into a consensus model) which would play against performance inflation.

We mapped all the scores to be tested onto the evaluation set, and assessed the performance of each against the ground truth labels by computing the precision recall curves, given the fact that none of the scores admitted a clear cut dichotomic interpretation as they are given as continuous numerical scores.

Results

Experimental saturation mutagenesis

The comparison of boostDM against experimental saturation mutagenesis assays in TP53, PTEN and Ras genes was robust in general, with only a few instances where the number of mutations in the comparison set was deemed insufficient to draw any conclusions. We conducted the benchmarking exercise across the high-quality models of TP53, PTEN and Ras genes (HRAS, KRAS and NRAS). Note that some experimental saturation mutagenesis datasets cover only a subset of the mutations considered in the region of interest of boostDM (e.g. restricted to a domain or to missense mutations), hence we carried out the evaluation only on the intersection with the evaluation set, making sure to keep the driver and nondriver labels balanced by randomly pruning the evaluation set if necessary.

In general, boostDM performs similarly or better than the experimental saturation mutagenesis scores at the task of discriminating between driver and nondriver mutations, which is in accordance with the results obtained with the previous release v2021.

Due to the fact that our methodology uses different testing sets for each gene and tumor type combination, the evaluation of experimental saturation mutagenesis assays might change depending on the tumor type. Even if the score yielded by the assay and used throughout the testing is the same across models for a given gene, the fact that the observed and randomized mutations in the test set are not the same implies that there will be some differences. Whether these differences are biologically meaningful is something we must be looking into in the future.

Precision-recall curves. First: boostDM score for TP53 (BRCA) and TP53 experimental saturation mutagenesis assays. Second: boostDM score for PTEN (UCEC) and PTEN experimental saturation mutagenesis assays. Third: boostDM score for KRAS (COADREAD) and Ras experimental saturation mutagenesis assays.

Heatmap representation of area under the precision-recall curves of boostDM models and experimental saturation mutagenesis assays, across assays and tumor types. First: TP53. Second: PTEN. Third: Ras genes HRAS, KRAS and NRAS. Blank squares mean that the intersection between the Figureevaluation set and the sites where the score is defined did not yield sufficiently many mutations to draw a robust metric.
Heatmap representation of area under the precision-recall curves of boostDM models and experimental saturation mutagenesis assays, across assays and tumor types. First: TP53. Second: PTEN. Third: Ras genes HRAS, KRAS and NRAS. Blank squares mean that the intersection between the evaluation set and the sites where the score is defined did not yield sufficiently many mutations to draw a robust metric.
Heatmap representation of area under the precision-recall curves of boostDM models and experimental saturation mutagenesis assays, across assays and tumor types. First: TP53. Second: PTEN. Third: Ras genes HRAS, KRAS and NRAS. Blank squares mean that the intersection between the evaluation set and the sites where the score is defined did not yield sufficiently many mutations to draw a robust metric.

Bioinformatic scores

The evaluation of bioinformatic scores allowed us to compare the boostDM performance across a rich set of gene tumor type models, potentially the 736 gene tumor type combinations corresponding to high-quality models according to our model selection criteria. For a high proportion of gene and tumor type models and scores the comparisons were robust enough, with some exceptions where the number of mutations was insufficient to draw any conclusions.

Precision-recall curves. Precision-recall curves corresponding to TP53 (BRCA) boostDM score (base classifiers) and bioinformatic scores of mutations in TP53. In the panels from left to right: classical pathogenicity scores and metascores; recently proposed pathogenicity scores derived from generative deep learning approaches; CHASMplus cancer-specific pathogenicity score (grouped by amino acid change with mean and max functions, respectively).

We divided the set of bioinformatic scores into three broad classes: classical pathogenicity scores and meta-scores thereof (CADD, FATHMM, MetaLR, MetaRNN, MutationAssessor, PROVEAN, Polyphen2, REVEL, SIFT, VEST4); recently proposed pathogenicity scores derived from deep generative models (EVE, AlphaMissense, ESM1b); and CHASMplus, which is a cancer-specific pathogenicity score bound to conduct the same classification task sought by boostDM.

In general, boostDM performs similarly or better than the rest of bioinformatic scores, with few exceptions on a gene and tumor type specific basis. Among the bioinformatic scores analyzed, CHASMplus was the one that performed closest to boostDM’s performance for a number of gene and tumor type combinations.

Heatmaps representing the area under the precision-recall curves corresponding to an array of bioinformatic scores for TP53, PTEN, PIK3CA and EGFR across tumor types at the highest level of specificity where boostDM has produced high-quality models according to our model selection criteria.
Heatmaps representing the area under the precision-recall curves corresponding to an array of bioinformatic scores for TP53, PTEN, PIK3CA and EGFR across tumor types at the highest level of specificity where boostDM has produced high-quality models according to our model selection criteria.
Heatmaps representing the area under the precision-recall curves corresponding to an array of bioinformatic scores for TP53, PTEN, PIK3CA and EGFR across tumor types at the highest level of specificity where boostDM has produced high-quality models according to our model selection criteria.
Heatmaps representing the area under the precision-recall curves corresponding to an array of bioinformatic scores for TP53, PTEN, PIK3CA and EGFR across tumor types at the highest level of specificity where boostDM has produced high-quality models according to our model selection criteria.

Head to head comparisons of the area under the precision-recall curve of CHASMplus (top), MetaRNN (middle) and AlphaMissense (bottom). Each point represents a tumor type. Points being below (resp. above) the diagonal implies that the performance of boostDM is better (resp. worse) in the test set.