Search
Close this search box.

An improved epigenetic counter to track mitotic age in normal and precancerous tissues – Nature Communications

All data analyzed in this manuscript has already been published in the respective publications, as described below and in the “Data availability” section. As such, this research complies with all ethical regulations.

Illumina DNAm datasets used in the construction of stemTOC

Fetal tissue DNAm sets

We obtained and normalized Illumina 450k data from the Stem-Cell Matrix Compendium-2 (SCM2)87, as described by us previously16. There were a total of 37 fetal tissue samples encompassing 10 tissue-types (stomach, heart, tongue, kidney, liver, brain, thymus, spleen, lung, adrenal gland). We also obtained Illumina 450k data of 15 cord-blood samples88 and 34 fetal-tissue samples from Slieker et al.89 encompassing amnion, muscle, adrenal, and pancreas. Both sets were normalized like the SCM2 data, i.e. by processing idat files with minfi90 followed by type-2 probe bias adjustment with BMIQ91.

Cell-line data from Endicott et al.24

The Illumina EPICv1 cell-line data was downloaded from GEO under accession number GSE197512. Raw idat files were processed with minfi and BMIQ normalized, resulting in a normalized DNAm data matrix defined over 843,298 probes and 182 “baseline-profiling” samples. A total of 6 human cell lines were used, including AG06561 (skin fibroblast), AG11182 (veil endothelial), AG11546 (iliac vein smooth muscle), AG16146 (skin fibroblast), AG21839 (neonatal foreskin fibroblast), and AG21859 (foreskin fibroblast). One cell line (AG21837, skin keratinocyte), which displayed global non-monotonic DNAm patterns with population doublings (PDs) was removed.

Whole-blood DNAm datasets

We used 3 Illumina whole-blood datasets. One EPIC dataset encompassing 710 samples from Han Chinese was processed and normalized as described by us previously92. Briefly, idat files were processed with minfi, followed by BMIQ type-2 probe adjustment, and due to the presence of beadchip effects, data was further normalized with ComBat93. Another dataset is an Illumina 450k set of 656 samples from Hannum et al.94. The normalization of this dataset is as described by us previously92. Finally, we also analyzed a 450k dataset from Johansson et al.95. This dataset was obtained from NCBI GEO website under accession number GSE87571. The file “GSE87571_RAW.tar” containing the IDAT files was downloaded and processed with minfi R-package. Probes with P-values <0.05 across all samples were kept. Filtered data was subsequently normalized with BMIQ, resulting in a normalized data matrix for 475,069 probes across 732 samples.

Independent buccal swab and cord blood Illumina DNAm datasets

We analyzed two independent EPIC DNAm datasets to assess the robustness and stability of the 371 stemTOC CpGs. One dataset consists of 44 buccal swabs from infants96 and the other of 128 cord blood samples from neonates97. Briefly, idat files were downloaded from GEO under accession numbers GSE229463 and GSE195595, respectively, and subsequently processed with minfi90 and BMIQ91 as described for the other datasets.

Illumina DNAm cancer datasets

The SeSAMe98 processed Illumina 450k beta value matrices were downloaded from Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/) with TCGAbiolinks99. We analyzed 32 cancer types (ACC, BLCA, BRCA, CESC, CHOL, COAD, DLBC, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LAML, LGG, LIHC, LUAD, LUSC, MESO, PAAD, PCGP, PRAD, READ, SARC, SKCM, STAD, TGCT, THCA, THYM, UCEC, UCS, UVM). We did further processing of the downloaded matrices as follows: For each cancer type, probes with missing values in more than 30% samples were removed. The missing values were then imputed with impute.knn (k = 5)100. Then, the type-2 probe bias was adjusted with BMIQ91. Technical replicates were removed by retaining the sample with the highest CpG coverage. Clinical information on TCGA samples was downloaded from Liu et al.101. In addition, we analyzed an Illumina 450k DNAm dataset from Pipinikas et al.102, consisting of 45 primary pancreatic neuroendocrine tumors (PNETs). Processing of the idat files and QC was performed with minfi90, impute, and BMIQ as described for the TCGA datasets.

Illumina DNAm dataset from eGTEX

The ChAMP103 processed Illumina EPIC beta-valued data matrix was downloaded from GEO (GSE213478)45. The dataset includes 754,119 probes and 987 samples from 9 normal tissue-types: breast mammary tissue (n = 52), colon transverse (n = 224), kidney cortex (n = 50), lung (n = 223), skeletal muscle (n = 47), ovary (n = 164), prostate (n = 123), testis (n = 50), and whole blood (n = 54) derived from 424 GTEX subjects.

Illumina DNAm datasets of sorted immune-cell types

We analyzed Illumina 450k DNAm data from BLUEPRINT42, encompassing 139 monocyte, 139 CD4+ T-cell, and 139 neutrophil samples from 139 subjects. This dataset was processed as described by us previously104. In addition, we analyzed Illumina 450k DNAm data from GEO (GSE56046) encompassing 1202 monocyte and 214 naïve CD4+ T-cell samples from Reynolds et al.105, and GEO (GSE56581) encompassing 98 CD8+ T-cell samples from Tserel et al.106. Data was normalized as described by us previously107. We also analyzed Illumina 450k DNAm data from EGA (EGA: EGAS00001001598) encompassing 100 B cell, 98 CD4+ T cell, and 104 monocyte samples from Paul et al.43. Data was normalized as described by us previously43.

Other Illumina whole-blood DNAm datasets

We analyzed a large collection of 18 whole-blood cohorts encompassing a total of 14,515 samples. This is a subset of the over 20,000 samples we previously analyzed41, consisting mostly of healthy samples. Illumina DNAm data was normalized as described by us previously41. Briefly, the 18 whole-blood cohorts chosen here were: LiuMS(n = 279): The 450k dataset from Kular et al.108 was obtained from the NCBI GEO website under the accession number GSE106648.

Song (n = 2052): The EPIC dataset from Song et al.109 was obtained from the NCBI GEO website under the accession number GSE169156.

HPT-EPIC (n = 1394) & HPT-450k (n = 418): These datasets110 were obtained from the NCBI GEO websites under the accession numbers GSE210255 and GSE210254.

Barturen (n = 5740): The EPIC dataset from Barturen et al.111 was obtained from GEO under accession number GSE179325.

Airwave (n = 1032): The EPIC dataset from the Airwave study112 was obtained from GEO under accession number GSE147740.

VACS (n = 529): The 450k dataset from Zhang et al.113 was obtained from GEO under accession number GSE117860.

Ventham (n = 380): The 450k dataset from Ventham et al.114 was obtained from NCBI GEO website under accession number GSE87648.

Hannon−1 and 2 (n = 636 and 665): The 450k datasets from Hannon et al.115,116 were obtained from NCBI GEO websites under accession numbers GSE80417 and GSE84727.

Zannas (n = 422): This 450k dataset117 was obtained from GEO under accession number GSE72680

Flanagan/FBS (n = 184): The 450k dataset Flanagan et al.118 was obtained from NCBI GEO under the accession number GSE61151.

Johansson (n = 729): The 450k dataset from Johansson et al.95 was obtained from GEO under accession number GSE87571.

Lehne (n = 2707): This 450k DNAm dataset consists of peripheral blood samples119, and we used the already QC-processed and normalized version previously described by Voisin et al.120.

TZH(n = 705), Hannum (n = 656), LiuRA (n = 689), Tsaprouni (n = 464): The TZH (EPIC)92, Hannum (450k)94, LiuRA (450k)121, Tsaprouni (450k)122 were downloaded and normalized as described by us previously92.

Illumina DNAm atlas from Moss et al.

We downloaded Illumina 450k & EPIC DNAm data (idat files)44 from GEO under accession number GSE122126. We processed the 450k and EPIC data separately with minfi, removing low-quality controls (detection P-value <0.05) and further normalized the data with BMIQ, resulting in a merged dataset defined over 449,156 probes and 28 samples. These 28 samples included sorted pancreatic beta cells (n = 4), pancreatic ductal (n = 3), pancreatic acinar (n = 3), adipocytes (n = 3), hepatocytes (n = 3), cortical neurons (n = 3), leukocyte (n = 1), lung epithelial (n = 3), colon-epithelial (n = 3), and vascular endothelial (n = 2).

Illumina DNAm datasets of sorted neurons

We analyzed a total of 4 Illumina DNAm datasets of sorted neurons (neuronal nuclei NeuN+), in all cases only using normal control samples. In all cases, raw Illumina EPIC/450k DNAm files were downloaded from GEO under accession numbers GSE112179 (n = 28, post-mortem frontal cortex)123, GSE41826 (n = 29, post-mortem frontal cortex)124, GSE66351 (n = 16, post-mortem human brains)125 and GSE98203 (n = 29, post-mortem human brains)126. In all cases, probes not detected above the background were assigned NA. CpGs with coverage 0.99 were kept (in the 2nd and 4th sets, this threshold was relaxed slightly to 0.98). The remaining NAs were imputed with impute.knn (k = 5). Finally, the beta-valued data matrix was normalized with BMIQ.

Illumina DNAm datasets from normal healthy and normal “at cancer-risk” tissue

  1. 1.

    Lung preinvasive dataset : This is an Illumina 450k DNAm dataset of lung-tissue samples that we have previously published76. We used the normalized dataset from Teschendorff et al.76 encompassing 21 normal lungs and 35 age-matched lung-carcinoma in situ (LCIS) samples, and 462,912 probes after QC. Of these 35 LCIS samples, 22 progressed to an invasive lung cancer (ILC).

  2. 2.

    Breast preinvasive dataset : This is an Illumina 450k dataset of breast tissue samples from Johnson et al.127. Raw idat files were downloaded from GEO under accession number GSE66313, and processed with minfi. Probes with sample coverage <0.95 (defined as a fraction of samples) with detected (P < 0.05) P-values were discarded. The rest of the unreliable values were assigned NA and imputed with knn (k = 5)100. After BMIQ normalization, we were left with 448,296 probes and 55 samples, encompassing 15 normal-adjacent breast tissue and 40 age-matched ductal carcinoma in situ (DCIS) samples, of which 13 were from women who later developed an invasive breast cancer (BC).

  3. 3.

    Gastric metaplasia dataset : Raw idat files were downloaded from GEO (GSE103186)128 and processed with minfi. Probes with over 99% coverage were kept and missing values imputed using impute R-package using impute.knn (k = 5). Subsequently, data was intra-array normalized with BMIQ, resulting in a final normalized data matrix over 482,975 CpGs and 191 samples, encompassing 61 normal gastric mucosas, 22 mild intestinal metaplasias, and 108 metaplasias. Although age information was not provided, we used Horvath’s clock129 to confirm that normal and mild intestinal metaplasias were age-matched. This is justified because Horvath’s clock is not a mitotic clock30 and displays a median absolute error of ±3 years129.

  4. 4.

    Barrett’s Esophagus and adenocarcinoma dataset : This Illumina 450k dataset49 is freely available from GEO under accession number GSE104707. Data was normalized as described by us previously130. The BMIQ-normalized dataset is defined over 384,999 probes and 157 samples, encompassing 52 normal squamous epithelial samples from the esophagus, 81 age-matched Barrett’s Esophagus specimens, and 24 esophageal adenocarcinomas.

  5. 5.

    Colon adenoma dataset131: Illumina 450k raw idat files were downloaded from ArrayExpress E-MTAB-6450 and processed with minfi. Only probes with 100% coverage were kept. Subsequent data was intra-array normalized with BMIQ, resulting in a normalized data matrix over 483,422 CpGs and 47 samples, encompassing 8 normal colon specimens and 39 age-matched colon adenomas. Although age information was not made publicly available, we imputed them using Horvath’s clock, confirming that normals and adenomas are age-matched.

  6. 6.

    Cholangiocarcinoma (CCA) dataset132: Raw idat files were downloaded from GEO (GSE156299). This is an EPIC dataset and was processed with minfi. Probes with >99% coverage (fraction of samples with P < 0.05) were kept. NAs were imputed with impute.knn (k = 5). Subsequent data was intra-array normalized with BMIQ, resulting in a normalized data matrix over 854,026 probes and 137 samples, encompassing 50 normal bile duct specimens, 60 premalignant, and 27 cholangiocarcinomas. Normal bile duct, premalignant, and CCA specimens were derived from the same patient. Thus, normal and premalignant samples are mostly age-matched.

  7. 7.

    Prostate cancer progression dataset : Illumina 450k raw idat files were downloaded from GEO (GSE116338)133 and processed with minfi. CpGs with coverage >0.95 were kept. NAs were imputed with impute.knn (k = 5). Subsequent data matrix was intra-array normalized with BMIQ. Samples included benign lesions (n = 10), neoplasia (n = 6), primary tumors (n = 14) and metastatic prostate cancer (n = 6). In this dataset, benign lesions were significantly older compared to neoplasia and primary tumors, thus this dataset provides a particularly good test that mitotic clocks are measuring mitotic age as opposed to chronological age.

  8. 8.

    Oral squamous cell carcinoma (OSCC) progression dataset : The processed beta-valued Illumina 450k DNAm data matrix was downloaded from GEO (GSE123781)134. Probes not detected above background had been assigned NA. CpGs with coverage >0.95 were kept. The remaining NAs were imputed with impute.knn (k = 5). Then the beta matrix was BMIQ normalized. Samples included 18 healthy controls, 8 lichen planus (putative premalignant condition), and 15 OSCCs. This is the only dataset where premalignant samples older than the healthy controls.

  9. 9.

    Colon adenoma dataset 2. Processed beta-valued Illumina 450k DNAm data matrix was downloaded from GEO (GSE48684)135, encompassing 17 normal-healthy samples, 42 age-matched colon adenoma samples, and 64 colon cancer samples. Probes not detected above background were assigned NA. CpGs with coverage >0.95 were kept. The remaining NAs were imputed with impute.knn (k = 5). Then the beta matrix was BMIQ normalized. Although age information was not made publicly available, ages were imputed with Horvath’s clock confirming that normals and adenomas are age-matched.

  10. 10.

    Normal breast Erlangen dataset : This Illumina 450k dataset is freely available from GEO under accession number GSE69914. Data was normalized as described by us previously32. The BMIQ-normalized dataset is defined over 485,512 probes and 397 samples, encompassing 50 normal-breast samples from healthy women, 42 age-matched normal-adjacent samples, and 305 invasive breast cancers. We note that this dataset was excluded from the formal comparison of stemTOC to all other clocks, because this dataset was used to select the upper-quantile threshold in the definition of stemTOC (see later).

Illumina DNAm datasets of normal tissues exposed to cancer-risk factors

  1. 1.

    Buccal swabs+smoking (n = 790): This Illumina 450k DNAm dataset was generated and analyzed previously by us76. We used the same normalized DNAm dataset as in this previous publication. Briefly, the samples derive from women all aged 53 years at sample draw and belong to the MRC1946 birth cohort. This cohort has well-annotated epidemiological information, including smoking-status information. Among the 790 women, 258 were never-smokers, 365 ex-smokers, and 167 current smokers at sample draw.

  2. 2.

    Lung-tissue+smoking (n = 204): This Illumina EPIC DNAm dataset of normal lung-tissue samples derives from eGTEX45 and was processed as the other tissue datasets from eGTEX. Age information was only provided in age-groups. Smoking status distribution was 89 current smokers, 54 ex-smokers, and 61 never-smokers.

  3. 3.

    Liver-tissue+obesity (n = 325): This Illumina EPIC DNAm dataset of liver tissue is derived from GEO: GSE180474, and encompasses liver tissue samples from obese individuals (minimum BMI = 32.6), all diagnosed with non-alcoholic fatty liver disease (NAFLD)78. Of the 325 samples, 210 had no evidence of fibrosis (grade-0), 55 had grade-3 fibrosis, 36 intermediate grade 3-4 fibrosis, and 24 grade-4 fibrosis. We downloaded the processed beta and P-values and only kept probes with 100% coverage across all samples. Data was further adjusted for type-2 probe bias using BMIQ.

Construction of stemTOC: identification of mitotic CpGs

The construction of stemTOC initially involves a careful selection of CpGs that track mitotic age. Initially, we follow the procedure as described for the epiTOC+epiTOC2 mitotic clocks, identifying CpGs mapping to within 200 bp of transcription start sites and that are constitutively unmethylated across 86 fetal tissue samples encompassing 15 tissue-types. Here, by constitutive unmethylation, we mean a DNAm beta value <0.2 in each of 86 fetal tissue samples encompassing 13 tissue-types. Of note, this means that all these CpGs occupy the same methylation state in the fetal ground state, i.e. these CpGs are not cell-type or tissue-specific in this ground state. Next, we use the cell-line data from Endicott et al.24 to further identify the subset of CpGs that display significant hypermethylation with population doublings, but which do not display such hypermethylation when the cell lines are treated with a cell-cycle inhibitor (mitomycin-C) or when the cell-culture is deprived of growth-promoting serum. In detail, for each of 6 cell lines (AG06561, AG11182, AG11546, AG16146, AG21837, AD21859) representing fibroblasts (3), smooth muscle (1), keratinocyte (1) and endothelial cell types, we ran linear regressions of DNAm against population doublings (PD) (a total of 182 samples), identifying for each cell-line CpGs where DNAm increases with PDs (q-value (FDR) < 0.05). Out of a total of 843,98 CpGs, 14,255 CpGs displayed significant hypermethylation with PDs in each of the 6 cell lines. Of these 14,225 CpGs, we next removed those still displaying hypermethylation (unadjusted P < 0.05) with days in culture in cell lines treated with mitomycin-C or in cell lines deprived of serum. This resulted in a much-reduced set of 629 “vitro-mitotic” CpG candidates. In the next step, we asked how many of these “vitro-mitotic” CpGs display age-associated hypermethylation in-vivo, using 3 separate large whole-blood cohorts. The rationale for using whole blood is that this is a high-turnover-rate tissue, and so chronological age should be correlated with mitotic age. Moreover, large whole-blood cohorts guarantee adequate power to detect age-associated DNAm changes. And thirdly, by intersecting CpGs undergoing hypermethylation with PDs in-vitro with those undergoing hypermethylation with age in-vivo, we are more likely to be identifying CpGs that track mitotic age in-vivo. We used the 3 large whole-blood datasets as described in the previous section, each containing approximately 700 samples, whilst adjusting for all potential confounders including cell fractions for 12 immune-cell subtypes40. Specifically, we used a DNAm reference matrix defined over 12 immune cell types, which we have recently validated across over 23,000 whole-blood samples from 22 cohorts41, to estimate cell-type proportions using our EpiDISH procedure136. These proportions were then included as covariates when identifying age-DMCs in each cohort separately. To arrive at a final set of age-DMCs, we used the directional Stouffer method over the 3 large cohorts to compute an overall Stouffer z-statistic and P-value, selecting CpGs with z > 0 and P < 0.05. Of the 629 “vitro-mitotic” CpGs, 371 were significant in the Stouffer meta-analysis of whole-blood cohorts. This set of 371 CpGs defined our “vivo-mitotic” CpGs making up stemTOC.

Estimating relative mitotic age with stemTOC

Given an arbitrary sample with a DNAm profile defined over these 371 CpGs, we next define the mitotic age of the sample as the 95% upper quantile of DNAm values over the 371 vivo-mitotic CpGs. The justification for taking an upper quantile value, as opposed to taking an average is as follows: extensive DNAm data from previous studies indicate that DNAm changes that mark cancer risk (and hence also mitotic age) are characterized by an underlying inter-CpG and inter-subject stochasticity19,31,32,84. Specifically, relevant DNAm changes in normal tissues at cancer risk constitute mild outliers in the DNAm distribution representing subclonal expansions, that occur only infrequently across independent subjects, with the specific CpG outliers displaying little overlap between subjects. Thus, for a given pool of mitotic CpG candidates (i.e. the pool of 371 vivo-mitotic CpGs derived earlier), only a small subset of these may be tracking mitotic age in any given subject at-risk of tumor development. Thus, taking an average DNAm over the 371 CpGs is not optimized to capturing the effect of subclonal expansions that track the mitotic age. To understand this, we ran a simple simulation model, with parameter choices inspired by real DNAm data19,31,32,84, for a reduced set of 20 CpGs and 40 samples representing 4 disease stages (normal, normal at-risk, preneoplastic, cancer) with 10 samples in each stage. In the initial stage, all CpGs are unmethylated with DNAm beta-values drawn from a Beta distribution Beta(a = 10,b = 90), where Beta(a,b) is defined by the probability distribution

$$pleft(X=x , | , a,bright) sim {x}^{a-1}{(1-x)}^{b-1}$$

(1)

which has a mean (Eleft[Xright]=a/(a+b)) and variance ({Var}left[Xright]={ab}/((a+b+1){(a+b)}^{2})). For each “normal at-risk” sample, we randomly selected 1–3 CpGs (precise number was drawn from a uniform distribution), simulating DNAm gains for these CpGs by drawing DNAm values from Beta(a = 3,b = 7). For each preneoplastic sample, we randomly selected 5–10 CpGs with DNAm values drawn from Beta(a = 5,b = 5). Finally, for invasive cancer, we randomly selected 11–17 CpGs with DNAm gains drawn from Beta(a = 8,b = 2). Under this model, discrimination of normal-at-risk samples from normal healthy is difficult if one were to average DNAm over 20 CpGs. However, taking a 95% upper quantile (UQ) of the DNAm distribution over the 20 CpGs, one can discriminate the normal at-risk samples. In this instance, a 95% UQ over 20 CpGs corresponds to taking the maximum value over these 20 CpGs. Another way to understand this is by first identifying CpGs that display DNAm outliers in the normal at-risk group compared to normal. This can be done using differential variance statistics19,137 to identify hypervariable CpGs, or alternatively, by finding CpGs for which a suitable upper quantile over the normal at-risk samples is much greater than the corresponding UQ value over the normal samples. To determine what UQ-threshold may be suitable, we analyzed our Illumina 450k DNAm dataset encompassing 50 normal-breast tissues from healthy women and 42 age-matched normal-adjacent samples (“at-risk” samples) from women with breast cancer32. For each of the 371 CpGs, we computed the mean and UQ over the 50 normal-breast samples, and separately over the 42 normal-adjacent “at-risk” samples. We considered a range of UQ thresholds from 0.75 to 0.99. We then compared the difference (i.e. effect size) in the obtained values between the normal-adjacent and normal-healthy, averaging over the 371 CpGs. The effect size increases with higher UQ values. We selected a 95% UQ because at this threshold we maximized effect size without compromising variability (choosing higher UQs leads to increased random variation). Although in this analysis, the UQ is taking across samples, it is reasonable, given the underlying stochasticity of the patterns, to apply the same UQ-threshold across CpGs. Thus, for any independent sample, we define the relative mitotic age of stemTOC as the 95% upper quantile of the DNAm distribution defined over the 371 CpGs. We note however that results in this manuscript are not strongly dependent on the particular UQ value, i.e. results are generally very robust to UQ values in the range 75–95%.

Estimating cell-type fractions in solid tissues with EpiSCORE and HEpiDISH

In this work we estimate the proportions of all main cell types within tissues from the TCGA using our validated EpiSCORE algorithm37 and its associated DNAm atlas of tissue-specific DNAm reference matrices38. This atlas comprises DNAm reference matrices for lung (7–9 cell types: alveolar epithelial, basal, other epithelial, endothelial, granulocyte, lymphocyte, macrophage, monocyte and stromal), pancreas (6 cell types: acinar, ductal, endocrine, endothelial, immune, stellate), kidney (4 cell types: epithelial, endothelial, fibroblasts, immune), prostate (6 cell types: basal, luminal, endothelial, fibroblast, leukocytes, smooth muscle), breast (7 cell types: luminal, basal, fat, endothelial, fibroblast, macrophage, lymphocyte), olfactory epithelium (9 cell types: mature neurons, immature neurons, basal, fibroblast, gland, macrophages, pericytes, plasma, T-cells), liver (5 cell types: hepatocytes, cholangiocytes, endothelial, Kupffer, lymphocytes), skin (7 cell types: differentiated and undifferentiated keratinocytes, melanocytes, endothelial, fibroblast, macrophages, T-cells), brain (6 cell types: astrocytes, neurons, microglia, oligos, OPCs and endothelial), bladder (4 cell types: endothelial, epithelial, fibroblast, immune), colon (5 cell types: endothelial, epithelial, lymphocytes, myeloid and stromal) and esophagus (8 cell types: endothelial, basal, stratified, suprabasal and upper epithelium, fibroblasts, glandular, immune). Hence, when estimating cell-type fractions in the TCGA tissue samples, we were restricted to those tissues with an available DNAm reference matrix. EpiSCORE was run on the BMIQ-normalized DNAm data from the TCGA with default parameters and 500 iterations. EpiSCORE was also used to estimate cell-type fractions in solid tissues from non-TCGA datasets. For instance, we used EpiSCORE to estimate 7 lung cell-type fractions (endothelial, epithelial, neutrophil, lymphocyte, macrophage, monocyte, and stromal) in normal lung-tissue samples from eGTEX45, and 5 liver cell-type fractions (cholangiocyte, hepatocyte, Kupffer, endothelial and lymphocyte) in liver tissue78. For the buccal swab dataset, because buccal swabs only contain squamous epithelial and immune-cells, we used the validated HEpiDISH algorithm and associated DNAm reference matrix to estimate these fractions in this tissue77.

Validation of EpiSCORE fractions using a WGBS DNAm atlas

Whilst EpiSCORE has already undergone substantial validation38, here we decided to further validate it against the recent human WGBS DNAm atlas from Loyfer et al.58, which comprises 207 sorted samples from various tissue-types. Briefly, we downloaded the beta-valued data and bigwig files (hg38) from the publication website. We required CpGs to be covered by at least 10 reads. We then averaged the DNAm values of CpGs mapping to within 200 bp of the transcription start sites of coding genes, resulting in a DNAm data matrix defined over 25,206 gene promoters and 207 sorted samples. The 207 sorted samples represent 37 cell types from 42 distinct tissues (anatomical regions). We validate EpiSCORE in this WGBS-atlas by asking if it could predict the corresponding cell type. This was only done for tissues for which there is an EpiSCORE DNAm reference matrix. For instance, for breast tissue, the WGBS DNAm atlas profiled 3 luminal and 4 basal epithelial samples. Hence, for breast tissue, we applied EpiSCORE with our breast DNAm reference matrix, derived via imputation from a breast-specific scRNA-Seq atlas, to these 7 sorted samples to assess if it can predict the basal/luminal subtypes using a maximum “probability” criterion (using the estimated fraction as a probability). This analysis was done for WGBS-sorted cells from 15 anatomical sites (brain, skin, colon, breast, bladder, liver, esophagus, pancreas, prostate, kidney tubular, kidney glomerular, lung alveolar, lung pleural, lung bronchus, and lung interstitial) using our EpiSCORE DNAm reference matrices for brain, skin, colon, breast, bladder, liver, esophagus, pancreas, prostate, kidney and lung. An overall accuracy was estimated as the number of sorted WGBS samples where the corresponding cell type was correctly predicted.

Benchmarking against other DNAm-based mitotic clocks

We benchmarked stemTOC against 6 other epigenetic mitotic clocks: epiTOC16, epiTOC230, HypoClock18, RepliTali24, epiCMIT-hyper and epiCMIT-hypo33. Briefly, epiTOC gives a mitotic score called pcgtAge, which is an average DNAm over 385 epiTOC-CpGs. In the case of epiTOC2, we used the estimated total cumulative number of stem-cell divisions (tnsc). In the case of HypoClock, the score is given by the average DNAm over 678 solo-WCGWs with representation on Illumina 450k arrays. Of note, for Hypoclock, smaller values mean a larger deviation from the methylated ground state. In the case of RepliTali, which was trained on EPIC data, the score is calculated with 87 RepliTali CpGs and their linear regression coefficients as provided by Endicott et al.24. Of the 87 RepliTali CpGs, only 30 are present on the Illumina 450k array. As far as epiCMIT33 is concerned, this clock is based on two separate lists of 184 hypermethylated and 1164 hypomethylated CpGs. Because the biological mechanism by which CpGs gain or lose DNAm during cell division is distinct, the strategy recommended by Ferrer-Durante to compute an average DNAm over the two lists to then select the one displaying the biggest deviation from the ground state as a measure of mitotic age, is in our opinion not justified. Their strategy could conceivably result in the hypermethylated CpGs being used for one sample, and the hypomethylated CpGs being used for another. Instead, in this work we separately report the average DNAm of the hypermethylated and hypomethylated components. Thus, for the hypermethylated CpGs, the average DNAm over these sites defines the “epiCMIT-hyper” clock’s mitotic age, whereas for the hypomethylated CpGs we take 1-average(hypomethylated CpGs) as a measure of mitotic age, thus defining the “epiCMIT-hypo” clock.

We compare all mitotic clocks using the following evaluation strategies. In the analysis correlating mitotic ages to the fractions of the tumor cell-of-origin in the various TCGA cancer types, we compare the inferred Pearson Correlation Coefficients (PCCs) between each pair of clocks across the TCGA cancer types using a one-tailed Wilcoxon rank sum test. Likewise, in the analysis where we correlate mitotic age to chronological age in the normal-adjacent samples from the TCGA, in the normal samples from eGTEX, and the sorted immune-cell subsets, we compare the inferred PCCs between each pair of clocks across the independent datasets using a one-tailed Wilcoxon rank sum test. Finally, when assessing the clocks for predicting cancer risk, for each of the 9 datasets with normal-healthy and age-matched “normal at-risk” samples and for each clock, we first computed an AUC-metric, quantifying the clock’s discriminatory accuracy. For each pair of mitotic clocks, we then compare their AUC values across the 9 independent datasets using a one-tailed Wilcoxon rank sum test.

Comparison to stem-cell division rates and somatic mutational signatures

We obtained estimated tissue-specific intrinsic stem-cell division rates of 13 tissue-types (bladder, breast, colon, esophagus, oral, kidney, liver, lung, pancreas, prostate, rectum, thyroid, and stomach) from Vogelstein & Tomasetti3,8, supplemented with estimates for skin and blood30, corresponding to 17 TCGA cancer types (BLCA, BRCA, COAD, ESCA, HNSC, KIRP & KIRC, LIHC, LSCC & LUAD, PAAD, PRAD, READ, THCA, LAML, STAD, and SKCM). Thus, for each normal-adjacent sample of the TCGA, we can estimate the total number of stem-cell divisions (TNSC) by multiplying the intrinsic rate (IR) of the tissue with the chronological age of the sample. Somatic mutational clock signature−1 (MS1) and signature-5 (MS5) were derived from Alexandrov et al.11. These loads represent the number of mutations per Mbp. Of note, because these mutational loads and stemTOC’s mitotic age both correlate with chronological age, in specific analyses we divide these values by the chronological age of the sample to arrive at age-adjusted values.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.