A comprehensive benchmarking with interpretation and operational guidance for the hierarchy of topologically associating domains – Nature Communications

Compendium of hierarchical TAD callers

Increasingly more researches are exploring subTAD or the hierarchical structure of TAD, starting from HMM-DI2 to GRiNCH41, with over 20 algorithms developed for TAD hierarchy detection (Table 1). There are five principal strategies underlying these tools: linear score, clustering, network features, structural entropy and statistical model (Fig. 1a).

Table 1 Computational methods for prediction of TAD hierarchy
Fig. 1: Compendium of TAD hierarchy and callers.
figure 1

a Categories of TAD hierarchy callers. b Definition and calculation of TAD hierarchy. Top: Structure of TAD at chromosome (the left panel) and A/B compartment scales (the right panel), red area indicates A compartment, blue area indicates B compartment. Bead-like structure refers to TAD on the genome. Middle: spatial morphology of TAD and subTAD in the nucleus, light purple represents TAD, dark purple represents subTAD, dark yellow represents TAD boundary, light yellow represents subTAD boundary. Lower: TAD hierarchical structure in Hi-C heatmap, the color is consistent with the middle figure. The darker purple color represents higher TAD layers. Darker yellow represents higher TAD boundary hierarchy.

Each strategy represents reasonable interpretation of TAD structures from unique perspective. Linear score shows perspectives for TAD distribution of different sizes by tuning one single parameter, such as Arrowhead by corner score7, CaTCH by reciprocal insulation (RI)42, HiTAD25 by window sizes, OnTAD by size of sliding diamond window and average contact frequency within it38, Multi-CD37 by a free parameter λ related to domain solution, Armatus23 and matryoshka43 by a resolution parameter γ. The size of TADs is almost positively correlated with the value of the single parameter, and small TADs from low value are usually positioned in large TADs from high value. Thinking TADs as a series of contiguous blocks on a chromosome, clustering iteratively merges TADs neighbors based on similarity of interactions between contact domains to a larger TAD until reaching a chromosome-arm size, and regards the layer-by-layer clustering relationship as the TAD hierarchies, including BHi-Cect44, SpectralTAD45, IC-Finder46, and TADpole47. This continuity is also manifested in graph theory, and its two big branches: network features (3DNetMod48, HBM49, GRiNCH41, and spectral50) and structural entropy (deDoc51, and SuperTAD40). Network features assume TAD-like structure as the best structural separation on the chromosome, making one TAD as a node and the relation between TADs as an edge. By calculating the edge weight, nodes in the network are divided into vintage clusters, and each cluster covers a large TAD and the nested subTADs within it. The structural entropy is defined over the coding tree of a graph by fixing and decoding the graph in a way that minimizes the uncertainty occurring in random walks. The essence of structural entropy algorithm is to fix the genomic loci at which the uncertainty of the structure is maximized. The less information there is, the more possibility that contact domains are in the same TAD. As for the statistical model, it characterizes TAD hierarchy and biological properties by certain statistical distribution, for example: Gaussian mixture distribution (GMAP26), general mixed distribution combined generalized likelihood ratio test (HiCKey52), and probability distribution model with dynamic programming (TADtree53 and PSYCHIC54).

Given these methods define the level of TAD hierarchy by different start (from level 0 or level 1) and various directions (from the inside out or the opposite), we make uniform provisions for the level of TAD and boundary (Fig. 1a). TAD is a chromatin structure at the sub megabase scale, which is shown as an isosceles right triangle significantly above the background in Hi-C thermograms. the TAD hierarchical structure is shown as a nested isosceles right triangle (Fig. 1b). To assess how many layers are nested in the TAD structure, we define level 1 for the TADs that don’t belong to any larger TAD outside, and the level increases by 1 as smaller TADs position an inner layer. For boundaries, the rule follows that of OnTAD: the maximum of TADs it belonging to in a single direction, suggesting one boundary may belong to different numbers of TADs by left and right (Fig. 1b).

Evaluation among callers and within each caller

Among the algorithms above, 3D-NetMod requires redundant parameters to be set and tested. For high-resolution data (especially over 40 Kb), there is no precise parameter range, resulting in low confidence for TAD hierarchy prediction; CaTCH seldom gives so proper pair of RIs determining TAD and subTAD that subTADs are not always well located in TADs; results from BHi-Cect are recorded in a complicated form, making it not easy to extract the location of TADs; descriptions for parameters of HMM-DI, HBM, spectral, IC-Finder, PSYCHIC and Multi-CD in each step are not clear enough to perform, even without parameter options. Therefore, these methods do not participate in this evaluation.

After the above screening, we intend to use the Hi-C data available under the Gene Expression Omnibus (GEO)55 accession number GSE63525 to compare performance of the following 13 methods: Arrowhead, Armatus, TADtree, HiTAD, GMAP, deDoc, matryoshka, OnTAD, TADpole, SpectralTAD, HiCKey, SuperTAD and GRiNCH. According to length range of TADs/subTADs, we pick data resolutions including 5 Kb, 10 Kb, 25 Kb, 50 Kb and 100 Kb on chromosome 7 from 7 human cell lines (GM12878, IMR90, HMEC, HUVEC, NHEK, K562, KBM7), and chose MAPQ > 30 data normalized with the iterative correction and eigenvector decomposition (ICE)56 (see the Methods), under certain condition. Chromosome 7 contains moderate genetic messages that ensure the reliability of the analysis, and covers vital genes like the HOXA gene family, which are involved with genome architecture, limb development, and multiple types of cancer28,57,58,59.

Variation of hierarchical TAD structures from different algorithms

Short of the metric to compare such domain-within-domain structure, we developed hierarchy structural similarity (Hier_SSIM) (see the Methods) using structural similarity to judge the similarity of the output heatmap (Fig. 2a).

Fig. 2: Hier_SSIM and evaluation across all callers (ICE-normalized Hi-C data at 10Kb resolution).
figure 2

a Diagram of Hier_SSIM process. b Clustering of TAD hierarchy callers. Source data are provided as a Source Data file. c Hierarchical structures of representative region by callers. The green image in the first row represents the Hi-C heatmap. The remaining blue images show the distribution of TAD levels from each method. d Peak signals for structural protein CTCF around the boundary.

We obtain the output results from all the tools on 10 Kb data of GM12878 cell line (Fig. 2b, c). We categorize these methods by average linkage (see the “Methods” section), a kind of hierarchical clustering, and obtain four main clusters: the first includes HiCKey and SuperTAD; the second involves only GRiNCH; the third contains matryoshka, SpectralTAD, Arrowhead and deDoc; the last covers Armatus, OnTAD, GMAP and HiTAD (Fig. 2b). Overall, there is a high degree of similarity between all clusters. Though such clusters are not significantly consistent with methods groups, the diagram of TAD hierarchy directly exhibits two possible reasons. For example, TADs located at 1–1.5 Mb have the following conditions: (1) belong to the upstream or the downstream larger TAD; (2) consist of scattered monolayers or hierarchical construction (Fig. 2c). Disagreement of the four clusters above partly consults from the division at a larger scale (larger TADs), confirming that TAD confines the nested subTADs within it whatever their reconstruction36. That explains the difference within the cluster. While the hierarchy in the same genome region has greater power to distinct clusters. Hence, TAD hierarchy could prove a decisive factor in chromosome topology and deserves further concern.

Since lack of a conclusive gold standard to evaluate the accuracy of TAD prediction, it’s a feasible option to judge the correlation with biological features. Given that TAD boundaries are always anchored by architectural proteins such as CTCF and cohesion2,7,8,9, we compared the enrichment of CTCF and SMC3 (subunit of cohesin protein complex) at TAD boundaries from all tools. Both markers show sharp signal peak on OnTAD, HiTAD, Armatus, deDoc, matryoshka, Arrowhead, and SpectralTAD. GMAP and GRiNCH show good correlation with CTCF signal peak around boundaries (Fig. 2d, Supplementary Fig. 1a), reflecting perfect accuracy and efficiency in detecting TAD boundaries and best recognition of TAD segregation. To verify the robustness of our results across species, we applied TAD hierarchy recognition methods to Hi-C data of mouse CH12-LX cell line and drosophila S2 cell line. Notably, we consistently found that CTCF enrichment occurred at TAD boundaries in different species and achieved the most enrichment in the OnTAD boundaries (Supplementary Fig. 1b, c). Apart from CTCF and cohesin, TAD boundaries tend to enrich active regulators, like H3K4me3 (active promoter), H3K27ac (active enhancer and promoter), POLR2A (subunit of transcriptional factor) and H3K9ac (active promoter) but don’t collect repressed elements such as H3K27me3 (repressed promoter) and H3K9me3 (heterochromatin)2,35. We conduct the same analysis on these markers: DNase-seq, H3K4me3, H3K27ac, POLR2A, and H3K9ac. These markers permanently show significant signal peak around boundaries from OnTAD and Armatus and occasionally show it on HiTAD, matryoshka, Arrowhead and SpectralTAD (Supplementary Figs. 1d, e, and 2). And HiCKey stably shows little but clear peak of all the above markers (Fig. 2d, Supplementary Figs. 1 and 2). While H3K27me3 is apparent on Armatus and matryoshka, H3K9me3 peak is clear on GRiNCH, deDoc, and GMAP (Supplementary Fig. 3a and b). Moreover, we search for overlap of boundaries with protein-coding genes, including promoter and coding sequence (CDS), and consistently find boundaries from SuperTAD cover the greatest number of genes, followed by HiCKey, matryoshka and SpectralTAD (Supplementary Fig. 3c and d). Together, we conclude that OnTAD, HiTAD, SpectralTAD, and Arrowhead are perfect for detecting architectural proteins and active epigenomic indicators. GRiNCH, deDoc, and GMAP perform well in capturing inactive regulators and architectural proteins. Armatus and matryoshka do good in all three aspects.

Hierarchical TAD identification across data resolution, normalization, sequencing depth and biological replicates

For some basic features, we first evaluate these methods with ICE Hi-C data in TAD segment length, number/percentage of TAD/boundary at all levels, and the genomic coverage of TADs. Researches generally consider the length of TAD and subTAD roughly in the range of 30 kb~2 Mb, so we confine the size of TAD segments within this limit for subsequent analysis and comparison (see the Methods), except for comparison of TAD segment length.

We first explore the length distribution of TAD segments on 7 cell types and all five resolutions without size filtration on the ICE Hi-C data (Fig. 3a, Supplementary Fig. 4). Except Arrowhead and TADpole, the length of TAD fragments is generally within 2 Mb, of which is within 1 Mb in deDoc, matryoshka and SuperTAD. There is a trend of length shortening with the lifting of resolution in most of the methods, while that of GRiNCH fluctuates in a stable range (Fig. 3a, Supplementary Fig. 4).

Fig. 3: Identification of hierarchical TADs in chromosome 7 from ICE-normalized Hi-C data.
figure 3

a Length range of hierarchical TADs called by different methods and resolutions. The line that divides the box into 2 parts represents the median of the data. The ends of the box show the upper (Q3) and lower (Q1) quartiles. The difference between Quartiles 1 and 3 is called the interquartile range (IQR). The extreme line shows Q3 + 1.5xIQR to Q1-1.5xIQR (the highest and lowest value excluding outliers). b, c Numbers of TADs (b) and boundaries (c) at various levels of the GM12878 cell line on 10 Kb. d Genomic coverage of hierarchical TADs of various callers and resolutions. Source data are provided as a Source Data file.

Then we compare the number and percentage respectively of TADs and their boundaries of all levels on the 50 Kb and 10 Kb ICE Hi-C data of GM12878 and K562 (Fig. 3b, c, Supplementary Fig. 5), and we find a relatively consistent distribution in most approaches (except SuperTAD and HiCKey). The number and percentage of TAD segments decrease as the level arises, with the same tendency seen in TAD boundaries (Fig. 3b, Supplementary Fig. 5a, c, d). With the increase of resolution, the number of TAD rises at all levels, especially the higher level in Armatus, OnTAD, HiTAD, and GMAP, as well the proportion (Fig. 3b, Supplementary Fig. 5a, c, d). As for TAD boundaries, the increase in resolution makes the number of boundaries measured by all methods increases in each level, but the ratio between the levels remains permanent (Fig. 3c, Supplementary Fig. 5b, e, f).

Next, we define the percentage of TAD segment coverage on the whole chromosome as the genomic coverage and count it on chromosome 7 (see the “Methods” section). The genomic coverage of most methods is over 90% (Fig. 3d, Supplementary Data 1). For OnTAD and TADtree, the values are stable at 80% and 70%, respectively, while the ratios of SuperTAD, HiCKey, and TADpole are almost up to 100%. In general, the genomic coverages of the majority of tools are slightly affected by resolution, while those of matryoshka and Arrowhead are largely affected by resolution.

Robustness is an important metric to evaluate performance of TAD hierarchy callers. We set a series of testing conditions by changing resolution (5 Kb, 10 Kb, 25 Kb, 50 Kb, and 100 Kb), matrices normalization (raw matrix and ICE-normalized matrix), and sequencing depth (20%, 50%, and 100%) (Fig. 4). Here, two metrics are used to measure the similarity of TAD hierarchy: overlap ratio (OR, derived from SuperTAD40 to evaluate the similarity between two coding trees) and Hier_SSIM (see the Methods).

Fig. 4: Evaluation of each caller across data resolution, normalization, and sequencing depth.
figure 4

a Hier_SSIM between TAD hierarchy obtained at different resolutions was assessed in a pairwise manner (e.g., 5 Kb vs. 10 Kb, 5 Kb vs. 25 Kb, etc.; results for the ICE data only are shown here). Hier_SSIM varies from 0 (no similarity, white) to 1 (full similarity, dark blue), showed in the Heatmap simulation diagram (the left panel). TAD hierarchy callers are ranked based on the average values of the Hier_SSIM across all resolutions (from highest to lowest, the right panel). b, c Concordance between TAD hierarchy obtained with each caller from raw and ICE-normalized matrices at different resolutions (5, 10, 25, 50, 100 Kb) using the Overlap ratio (b) and the Hier_SSIM (c). Overlap ratio and Hier_SSIM vary from 0 (no similarity, white) to 1 (full similarity, dark red). TAD hierarchy callers are ranked based on the average values of the overlap ratio and the Hier_SSIM (from highest to lowest). Samples are beyond the resolution range of callers (backslash) and results of certain resolutions can not computed (gray). d Ratio of Hier_SSIM of TAD hierarchy between 20% and 100% versus that between 50% and 100% from GM12878 cell line 50 Kb ICE data. The dashed line indicates the linear fit. e Overlap ratio between TAD hierarchy obtained with 7 cell lines on 50 Kb ICE data by OnTAD. Source data are provided as a Source Data file.

To objectively assess the performance, we set similarity of 0.7 as a standard, which conveys a neutral correlation. Our Hier_SSIM results reveal that matryoshka, HiTAD, TADpole, deDoc, Armatus, SuperTAD, OnTAD, Arrowhead, TADtree, GMAP and SpectralTAD are less affected by resolution (Fig. 4a), while OR shows good robustness of deDoc, GRiNCH, OnTAD, SuperTAD, and TADpole (Supplementary Table 1). As for raw data and ICE-normalized data, SpectralTAD, Arrowhead, SuperTAD, deDoc, and TADpole show high similarity, indicating that they are superior in processing raw data (Fig. 4b, c). Finally, after downsampling the ICE-normalized data of 50 Kb by 50% and 20%, we find that SpectralTAD, matryoshka, deDoc, TADpole, OnTAD, GMAP, TADtree, GRiNCH, and SuperTAD are seldom affected by the sequencing depth of input data (Fig. 4d, Supplementary Table 2). Owing to the diversity in sequencing depths of 7 cell types (GM12878, HMEC, HUVEC, K562, KBM7, and NHEK) in GSE63525 dataset, we apply OnTAD on 50 Kb and 10 Kb ICE-normalized data and calculate OR between every two samples to seek the leading factors of TAD hierarchy variation. Results are mainly divided into two distinct clusters by resolution, with OR ranges from 0.7 to 0.8 within the same cluster and that around 0.5 between distinct clusters (Fig. 4e, Supplementary Fig. 6a). Sequencing depth and cell-specificity may give rise to the fluctuation within the cluster. Together, data resolution serves as the principal component of discrepancy among results from one single algorithm. Lastly, to measure the reproducibility of TAD hierarchy calling results on biological replicates of Hi-C data, we applied TAD hierarchy recognition methods to GM12878 Hi-C data, and found that matryoshka, SpectralTAD, OnTAD, Arrowhead, GMAP, and HiCKey can achieve great reproducible results (Supplementary Fig. 6b). And SpectralTAD can achieve the most reproducible results with 0.976539 similarity. This suggested that these methods are robust in identifying TAD in biological replicates.

In summary, we find that TAD hierarchy is greatly affected by resolution and sequencing depth, but seldom varies with normalization and biological replicates. While there are still plenty of hierarchical TAD callers that show stability whatever the resolution, with similarity ranging around 0.7. As for traditional TAD callers, the similarity between multi-resolution is about 0.524. That means TAD hierarchy is less influenced by resolution, compared to single TAD structures. Thus, focusing on TAD hierarchy is promising to overcome the shortage or limitation on resolution and convey potential information of chromatin, which would be a wiser choice than single TAD.

Comprehensive performance and guidance of hierarchical TAD callers

Based on the testing above, we summarize a comprehensive evaluation of all methods (Fig. 5a), including biological correlation, robustness, and actual user experience (software installation, code instructions, input processing, parameters setting, downstream procession, time consumed, resolution self-identification and built-in visualization). We also provide details of running time and memory cost for each of the 13 methods (Supplementary Table 3). From the perspective of biological correlation, OnTAD and Armatus perform extraordinarily in architecture proteins, chromatin accessibility, and active histone modifications, while Armatus, matryoshka, GRiNCH, deDoc, and GMAP are suitable for inactive molecular markers. As for input, spectral requires matrix from Matlab; CaTCH needs catch file produced by hicpro2catch; the cool matrix in HiTAD could be obtained by HiCExplorer; deDoc deals with sparse matrix with max bin order in the first line; TADtree, HiCKey, HiTAD and 3DNetMod conduct configure files containing parameters, the path of input and software. Such variety can be seen in output: OnTAD, HiTAD, matryoshka, and GMAP directly show coordinate and level of each TAD; deDoc, SpectralTAD, and GRiNCH just send TADs at different levels in separate files; the rest of the methods require the calculation of TAD level additionally. This evaluation is not totally equal to the real performance of tools, affected by the quality of library construction, sequencing depth, and background noise. The input of single-cell Hi-C data provides a more realistic three-dimensional structure of chromatin, but at the same time poses the problem of high noise and data sparsity. In addition, disagreement of the results partly originates from various strategies or principles of TAD prediction, showing diverse understandings of TAD structure and hierarchy. Together, we provide a commendation of TAD hierarchy callers with proper input format for colleagues interested in it from fields of chromatin 3D structure, life science, and medicine (Fig. 5).

Fig. 5: Comprehensive evaluation of TAD hierarchy callers.
figure 5

a Summary performance of methods. The grey circles, colored circles, and colored squares represent normal, good, and excellent respectively. b Input format required by methods.

Besides, it’s important to flexibly select the appropriate method according to the resolution of sample data, sequencing depth or data sparsity, platform sources, and sequencing technologies. First, what should be identified include the sequencing depth (or data sparseness) of data and the optimal range of resolution. When dealing with higher-resolution data, OnTAD, HiTAD, matryoshka are ideal options, and matryoshka also performs well on low-resolution data (~500 kb)43. In terms of sequencing depth, SpectralTAD, deDoc, and GRiNCH can handle ultra-sparse data. It is worth mentioning that IC-Finder and SpectralTAD work well in high-noise inputs. Second, for samples from various sequencing technologies and platforms, the following methods have their special focus. All methods are capable of processing bulk Hi-C data, while Armatus and chromoHBM-3C additionally deal perfectly with 3 C data; deDoc is friendly to ultra-sparse data and even pooled single-cell Hi-C data; TADpole is suitable for Capture Hi-C data after combing DiffT scores; PSYCHIC shows compatibility for data from multiple platforms, such as SPRITE, HiChIP, and Hi-C.

Since performance related to the biological significance of all methods shows no huge difference, we summarize a guidance more focused on usability (Supplementary Fig. 7). The guidance takes three main points into consideration: (1) complexity of format for input/output files; (2) computational memory and resource consumption required by software; (3) abundance of the TAD hierarchy obtained. OnTAD shows predominant smooth in format resolvent of files, abundant levels of TAD hierarchy, and little consumption in running. GMAP is user-friendly while presenting steady two layers. If the computing platform is strong, SpectralTAD, Armatus, matryoshka, and deDoc are wise choices. Researchers who are skilled in file format conversion can try deDoc, Arrowhead, matryoshka, and HiTAD.

Applications of TAD hierarchy in biomedicine

Studies to date using these methods mainly focus on the formation of chromatin construction, gene regulation, embryonic development, and disease. First, the mechanism of structural formation has been explored throughout the field. OnTAD is used to describe the topological structure in normal and H1 depletion T cells, finding the impact of histone H1 in chromatin compaction60, and capturing chromosome conformation of human sister chromatids61. It also shows distinct TAD hierarchy in normal colon and colorectal carcinoma39. HMM-DI is applied to identify a subset of TADs exhibiting strong core-periphery mesoscale network in T cell62. Armatus provides evidence that TAD cliques are general phenomena across kinds of cell types and reveals genomic differences between large TAD cliques and small ones63. deDoc, HiTAD, and GMAP respectively detect TAD construction in pig embryos64, maize65, and mouse erythroid cell populations36. Moreover, the spatial scale of chromatin architecture is vital for tool selection: Multi-CD, GRiNCH, and CaTCH can deal with TADs, subTADs, and compartments, while chromoHBM (one of the algorithms of HBM) and PSYCHIC perform well in loop interactions, thus especially suitable for studying interactions between regulatory elements and target genes.

Second, the relationship between chromatin topology and genome function is still contentious. And some methods have made an attempt at gene dysregulation. Two clear TADs detected by SpectralTAD are mapped with numerous gene-gene and gene-enhancer interactions around the Regulators of Complement Activation (RCA) gene cluster in B cells, revealing extensive co-regulation66. deDoc is used in heat shock (HS) and calls TADs in the K562 cell line under normal HS (NHS), short-term HS (SHS), and long-term HS (LHS). And it helps researchers find strong stability of chromatin in response to SHS, compared to little alteration of chromatin accessibility67.

Third, a few methods have generated progression in embryonic development. HiCKey, Arrowhead, OnTAD, SpectralTAD, and TADpole are used to confirm insulation ranking of germ cells at various stages and lineages of differentiation, facilitating the principle for the nucleus programming that creates gametogenic progenitors in both sexes68. 3DNetMod, together with allelic expression states and chromatin marks, is used to explore the alteration of chromosome topology among oocytes, sperm, and early preimplantation embryos, exploring the complex dynamics of 3D-genome organization during early development69.

Last but not least, researches on chromosome spatial structure have been centered around disease progression. HMM-DI reveals that the linkage disequilibrium (LD) blocks encompassing schizophrenia disease-associated single nucleotide variants (daSNVs) are significantly enriched at core nodes, whereas obsessive-compulsive disorder (OCD) daSNVs are enriched at periphery nodes and autism spectrum disorder (ASD) daSNVs are equally distributed across core or periphery nodes, indicating link between 3D-genome’s core-periphery network structure and neuropsychiatric daSNVs62. OnTAD can distinguish normal colon from colorectal carcinoma both with hierarchical level of TAD and involved genes39. Chromosome spatial architecture, identified by SpectralTAD, shows the downstream efforts of loss of NSD1 in head and neck squamous cell carcinoma (HNSCC), together with RNA-seq results. That presents genome structure’s proficiency in targeted treatments70. 3DNetMod identifies subTADs in ESCs to help discover the relationship between high-scale genome topology alteration and disease-associated short tandem repeats (STRs), which is well known to contribute to over 25 inherited disorders71. In addition, TADpole performs perfectly in insertion points of Inv1 mutations47. Thus, it is necessary to say that some methods are suitable for specific biological problems.

The impact of cell heterogeneity on TAD hierarchy

As for bulk Hi-C data, TAD levels are consistent with the extent of gene expression, but the formation and existence of TAD hierarchy require further studies. Currently, the hypothesis for the formation of TAD hierarchy is divided into three categories: one is that the TAD hierarchy exists in a single-cell adjusting the gene expression (the pink box); the second regards it as just single superimpose of TAD layers derived from millions of cells (the blue box); the third supports concurrence of the previous two points (the orange box) (Fig. 6a). In clinical samples, the authenticity of TAD hierarchy greatly affects the study on triggers of disease. Our recent work proposed that the heterogeneity of cancer cells may be one of the reasons for the formation of TAD multi-scale layers in complicated TAD hierarchy in colorectal carcinoma39.

Fig. 6: Exploration for existence of TAD hierarchy.
figure 6

a Schematic diagram of three main hypotheses for the formation of TAD hierarchy. One is only the superposition of heterogeneous TADs from bulk data (the upper panel, pink), the second shows that exact TAD hierarchy exists in individual cell (the lower panel, blue), and the third supports the coexistence of the two (orange). Twisted light purple chromatin filaments form the TAD, highlighted by light blue circular shading. Various colored points or short curve located at convergence points indicate TAD boundaries (the left panel, pink and blue box). Hierarchical TAD pattern diagrams (the middle panel, pink and blue box). Actual hic heatmap (the right panel, pink and blue box). b Number of TAD boundaries at separate levels and various resolutions in mixed samples. Source data are provided as a Source Data file. c TAD hierarchy in 40.8-42.8 Mb on chr7 of all mixed samples.

To explore the impact of cell heterogeneity on TAD hierarchy, we perform a simulation of cellular heterogeneity by mixing GM12878 and K562 in 11 different ratios (Supplementary Fig. 8a, see the Methods). Given the evaluation above, we observe that OnTAD exhibits better performance in identifying TAD hierarchy, thus it enables us to provide convincing results for subsequent studies. OnTAD is used on the mixed data, and to search changes in TAD/boundary levels. If TAD hierarchy is the result of image superposition, the number of TAD boundaries with high levels will get obviously higher at a certain mixing ratio than that of a pure cell line. Interestingly, we find that the number doesn’t significantly excess that of K562 regardless of the mixing ratio (Fig. 6b). Further, we add IMR90 to be mixed with GM12878 and K562 in equal proportion, compare them with cases of their pure cell lines, and obtain consistent results (Supplementary Fig. 8c). As mixing ratio changes, the number of TADs tends to be approximate to that of the pure cell line which has more TADs. A similar phenomenon can be seen for boundaries (Supplementary Fig. 8b, c).

Next, to better simulate the impact of cell heterogeneity on TAD hierarchy, we collected single-cell Hi-C data of GM12878 and IMR90 from Kim et al.72. Then, we generated pseudo-bulk Hi-C matrices based on different cell mixing ratios and calculated the distribution of TAD numbers at different levels (Supplementary Fig. 8d). The results were consistent with bulk Hi-C mixing. We noticed the outputs never excess greatly to one pure cell line whatever the mixing ratio. In this way, the distribution of TAD hierarchy is seldom affected by cellular heterogeneity. Hence, we infer that TAD hierarchy would never just be a superimpose of millions of Hi-C heatmaps. This suggested that TAD hierarchy could be a real architecture in single cells.

Our discovery is consistent with recent findings. Xiaowei Zhuang group conducted a super-resolution chromatin tracing method based on multiplexed stochastic optical reconstruction microscopy (STORM) imaging of chromatin architecture and discovered TAD-like domains with spatially segregated globular conformations in single cells, of which the boundary shows cell-to-cell variety73. Besides, Jian Ma Lab and Zhihua Zhang group separately developed Higashi74 and DeTOKI75 to identify TAD-like domains with scHi-C data. Now that the existence of TAD-like domains has been confirmed with imaging technology, the discovery of TAD hierarchy at the single-cell level is just around the corner and has great research potential.

Hierarchical TADs act as air conditioner

Previous studies have pointed out that TAD boundaries are formed by cohesin protein complex sliding on chromatin fibers until encountering convergent CTCF10, nominated as ‘loop extrusion’ model76. Thus, TAD and subTAD can be described as ‘loop domains’77. Beyond this, the subTAD boundary generates insulating neighbors featured by significantly distinct genome function, and the ‘asymmetric ring extrusion’ model is proposed to explain the mechanism of TAD hierarchy, among which the boundary with levels 5+ obsesses significantly high transcriptional activity38.

In the section of the evaluation, OnTAD performs well in the majority of aspects, thus we use it to conduct deep analysis on ICE-normalized data of GM12878 (10 Kb). We found that the enrichment of H3K4me1 and H3K4me3 (related to the active promoter) increased as the TAD and boundary levels went up (the yellow box), while that of H3K27me3 (associated with the repressed promoter) was the opposite (Fig. 7a, Supplementary Fig. 9a). This is consistent with previous findings of TAD hierarchy38. In addition, we predict the hierarchical TAD structure of GM12878 and K562 cell lines, and compare the gene expression along with epigenetic features (Fig. 7b, Supplementary Fig. 9b). It is obvious that in the 86–88 Mb region of chromosome 7, the TAD hierarchy is rich in GM12878, while barren in K562, which coincides with the epigenetic landscape and expression abundance. It also confirmed the previous finding that hierarchical TADs are more active than single TAD or TAD gaps38. Moreover, both cell lines have the TAD encompassing the cyclin D binding myb-like transcription factor 1 gene (DMTF1) (86.79–86.87 Mb, the yellow shading). What the two cell lines have in common is that the TAD border is level 3 at 87.86 Mb (the pink shading). But the TAD border in GM12878 shows the interaction with the DMTF1 TAD, while failing to link with the DMTF1 TAD in K562. Moreover, the DMTF1 TAD is encompassed by the larger TAD (86.88–87.85 Mb) in GM12878 with more plentiful hierarchy, but that of K562 is distinct. These two points may result in the contrast expression of DMTF1. Thus, higher-level boundaries of the DMTF1 TAD will affect the expression of DMTF1.

Fig. 7: Further analysis of TAD hierarchy with OnTAD.
figure 7

a Enrichment of histone modifications within TADs and around boundaries of all levels. Level 1 (dark blue), level 2 (blue), level 3 (green) and level 4 plus (orange) are overlaid. The yellow shaded area indicates where the TAD boundary is located. b Representative example of TAD hierarchy and multi-omics landscape in GM12878 cell (left panel) and K562 cell (right panel). The upper green images show Hi-C heatmaps in both sides, and the middle blue images show the distribution of TAD hierarchy. Areas with yellow and pink shading are selected to depict inter-cellular variation. c Schematic representation of the air conditioner model for TAD hierarchy. TAD boundaries are marked according to the level.

Based on this, we propose an air conditioner model for the mechanism of TAD hierarchy (Fig. 7c, Supplementary Fig. 9c): The air conditioners represent enhancers. Then various levels of TADs and TAD boundaries are represented by shadows and curves of different colors. As shown in Fig. 7c, the high-level TAD boundary mediates the formation of three TADs. As a result, high-level TAD boundary has high interaction frequency with other TAD boundaries. Since active modifications and gene transcription were enriched at the TAD boundary, we inferred that there were active enhancers and transcriptional-activated genes on the TAD boundaries. A high-level TAD boundary can gather a number of enhancers by interacting with other TAD boundaries. Enhancers play the role in activating transcription by near-space interaction, so there is an analogy between enhancer and air conditioner. The more concentrated the distribution of enhancers, the stronger the activation of genes. In different cells types, states or conditions, TAD hierarchy might change. The reduction of TAD boundary level leads to reduction of enhancer concentration (Supplementary Fig 9c). And the reduction of enhancer concentration limits the activation of gene expression. This is consistent with our model that the concentration of air conditioner affects the ability of controlling the temperature of the aggregation area.

Hence, in GM12878, the air conditioner concentrated in the high-level TAD boundary, indicating DMTF1 TAD boundary has highly active modification (Fig. 7b, Supplementary Fig. 9b, yellow shadowing). Thus, the expression of DMTF1 is high with the help of highly concentrated air conditioners. While in K562, the reduction of the DMTF1 TAD boundary level may lead to lower concentration of enhancers, which limits DMTF1 gene expression. Same result can be also seen for ADAM metallopeptidase domain 22 gene (ADAM22) (the pink shading). DMTF1 protein helps to suppress cell growth or induce apoptosis, and ADAM22 encodes protein without metalloprotease activity. Lower expression of these genes in K562 may be propitious to tumorigenesis. To further exploration, we analyzed the TAD hierarchy in colorectal carcinoma (BRD3187) and paracancerous tissue (BRD3187N), together with gene expression (Supplementary Fig. 10a). Interestingly, the boundary of TAD containing the acylglycerol kinase (AGK) gene in colon cancer tissues was more spatially folded, resulting in an increased density of active regulatory elements in the surrounding region. Transcription level of AGK was also elevated. While in paracancerous tissues, the reduction of AGK TAD boundary level made the transcriptional active elements sparser in the adjacent space, contributing to lower AGK expression. The AGK gene encodes a mitochondrial membrane protein involved in lipid/glycerolipid metabolism and oncogenic MAPK signaling, and its higher expression in colorectal cancer tissues also promotes cancer development. Generally, TAD hierarchy shows more complexity in genome regions with higher transcriptional activity.

The air conditioner phenomenon can also be observed in TAD hierarchy between K562 and IMR90 in previous study52. The spatial folding degree of chromatin in region anchored by CTCF caused the difference in density of transcription active elements, resulting in different transcription levels of genes at the same location between different cell types, states, or developmental stages. The degree of folding resulted in the spatial proximity of genomic fragments, and this spatial proximity does not have the same high insulation fraction as the CTCF anchor, allowing greater flexibility in the arrangement of subTADs within a large TAD. This is consistent with the inter-cellular heterogeneity of subTADs. In sum, the high-level TAD boundary indicates the convergence of numerous powerful regulatory factors like super-enhancers and is functionally similar to ‘hub-boundaries’38. Together, the air conditioner model will provide a broader and more flexible perspective to explore genes, transcription factors, enhancers, histone modifications, and their interaction.