We have developed a novel information theory based multi-scale clustering algorithm for identifying variable
length mutational hotspots within cancer genes. We ran our algorithm on the combined mutation data from 23 tumor
types from The Cancer Genome Atlas (TCGA). We found a diverse set of clusters with wide variability in size and
mutation count. Additionally we integrated our mutation clusters with gene expression data from TCGA to
associate clusters with global changes in gene expression and specific molecular pathways. Our findings allow us
to identify mutation clusters which are associated changes in gene expression phenotype. Additionally we have
used our pathway association analysis to identify multiple clusters within individual genes which have
differential associations: specifically
PTEN,
FUBP1, and
BRCA
These cases may be indicative of differential
functional consequences to genetic mutations within different regions of the same gene.
Original data tables are available
here.

Our algorithm identifies mutation clusters at multiple scales. Each scale represents different sized genetic features. First, our algorithm converts TCGA mutation calls from all 23 cancers into multiple continuous probability density functions (A). This smoothing is done using a kernel density estimate (KDE) with a Gaussian kernel at 28 different bandwidths between 2 and 450 (amino acids units). Each bandwidth represents a different length scale of amino acid features ranging from single amino acids to entire protein domains (B). These KDEs are each used to seed a multivariate mixture model consisting of n Gaussians and 1 uniform distribution, where n is the number of local maxima in a given KDE. The noise weight is initially estimated by the fraction of silent mutations in the gene. The mean of each Gaussian is initially estimated by the locations of a local maxima of the KDE. The standard deviation of each Guassian is estimated by the distance between the two adjacent local minima around a given maxima. Finally, the weight of each Gaussian in the mixture model is estimated by the density at the local maxima minus one-nth of the noise weight. An expectation maximization algorithm then optimizes the mixture model ( C blue bars). This process results in a set of clusters for each scale.
