Methodology overview and dataset description
A workflow was designed to combine multiple unsupervised machine learning methods for the analysis of the whole transcriptome of samples obtained during hiPSC cardiac differentiation experiments (Fig. 1A). The final goal was to visualize expression patterns of the complete gene set and to identify, in an unbiased way, the most relevant genes for the analysis of the differentiation process. The complete description of the methodology can be found in the detailed methods section.
After pre-processing the raw data, HC was used on the samples’ dimension. Hierarchical Agglomerative Clustering algorithms organize data into a hierarchical structure, known as a dendrogram. In the beginning, each gene is considered a cluster and, progressively, the algorithms join the most similar clusters until a single cluster, containing all the genes, is obtained. Distinct clustering methods are available, based on the dissimilarity measure between samples and on the definition of dissimilarity between clusters (eg. single-link, average-link or complete-link)13. A data partition is obtained by cutting the dendrogram at a specific threshold. In this workflow 9 combinations of metrics (Euclidean, Cosine and Pearson correlation-based distance) and methods (single-link, average-link and complete-link) were tested.
In parallel, SOM, a model-based algorithm, was applied to the genes to portray gene expression landscapes for every sample. SOM starts by randomly choosing a set of (k) centroids which are linked in a grid structure and are called nodes14. At each iteration, one gene is selected and assigned to the most similar node. This node, and its neighbours, will be updated to become more similar to the input gene. As the number of iterations increases, the neighbourhood decreases, and eventually, the nodes become stable. This process of node and neighbourhood updating will cause near nodes in the grid structure to be more similar than those in far positions of the grid. At this point, all genes are mapped to the most similar node and all nodes are, for the last time, updated to the average of all genes mapped to it forming an entity here called metagene. This final step is what allows for the creation of a topology representation of the input genes’ expression patterns and directly comparing different samples. Among the SOM implementations used to study transcriptomic data, the OposSOM package15 has been widely used and has proved to be a versatile tool in different contexts16,17,18 and therefore was the one selected.
The next step consisted of using the K-means algorithm for gene selection and to form biologically relevant gene clusters through the differentiation and between culture methods. The K-means algorithm is a partitional clustering method that subdivides the genes into a predefined number of (k) clusters13. Starting with (k) centroids randomly chosen from the gene set, the algorithm iterates between a partitioning step, which consists in assigning genes to the cluster corresponding to the nearest centroid, and a centroid update step, which is the computation of the average of the genes in each cluster, until a stable partition is obtained.
After this step, other biological analyses were made, such as Gene Set Enrichment analysis and Gene Ontology over-representation. Lastly, Hierarchical Bi-Clustering (HBC) was performed simultaneously on the genes and samples’ dimensions. HBC is a method based on HC which is widely used for gene expression data. This method simultaneously performs clustering over the samples and the genes, providing a final representation, usually as a heatmap with the samples’ and genes’ dendrograms on the top and side, allowing the visualisation of gene expression patterns for different samples.
As previously mentioned, the RNA-seq data from Branco et al.11, which is available through Gene Expression Omnibus19 (GEO) Accession Number GSE116574, was reanalysed in this work. This data set is composed of samples from cardiac differentiation using the temporal modulation of the Wnt signalling pathway under a standard 2D condition and as 3D aggregates. Briefly, after a short hiPSC expansion phase (2–3 days), differentiation is induced at day 0 through the activation of the Wnt signalling pathway, with the small molecule CHIR99021. The cells were kept in an RPMI medium supplemented with B-27 without insulin. From day 3 to day 5 the Wnt pathway was inhibited by supplementing the medium with the IWP-4 small molecule. From day 7 onwards the culture medium was changed to RPMI medium with insulin.
For both conditions, samples were collected at ten different time points, as represented in Fig. 1B. Each condition has 3 replicates (except day 12 of the 2D differentiation that was performed in duplicate) which together with the hiPSCs seeding samples from which each both 2D and 3D differentiation started, accounts for a total of 62 samples.
Hierarchical agglomerative clustering to establish a whole transcriptome sample hierarchy, replicate variability and information loss
In the previous work by Branco et al.11, differential expression analysis between the 2D and 3D differentiation protocols and culture days was made using a set of 249 cardiac-related genes. Considering that the relationship between samples, based on the entire transcriptome, is not yet known and that differential expression analysis of such high number of genes is too laborious, hierarchical clustering (HC) was used to organize and distinguish groups of similar samples as well as to establish the overall hierarchy created with all the samples.
We started by using HC to generate dendrograms considering the average of the replicates for each time point and the whole gene set, as shown in Fig. 2A. The dendrogram reveals a clear separation of data into two main clusters, one containing samples until day 5 (orange cluster) and another gathering samples from day 7 (green cluster).
By analysing the dendrogram in more detail it was possible to see that the hiPSCs sample is more similar to the 2D day 0 sample than to 3D day 0, which was already noted before11, and that day 1 samples are quite similar between culture conditions. Days 3 and 5 group first according to the differentiation protocol (2D or 3D) and then cluster with the previous samples (days 0 and 1). The green cluster is organized temporally, but with complete segregation between 2D and 3D samples. This segregation between 2D and 3D samples is an indication that the culture formats are diverging in gene expression patterns. For both differentiation protocols, there is a cluster shift on day 7 possibly associated with IWP4 supplementation, which promoted cardiac specification and, consequently, a major change in gene expression patterns. Moreover, day 7 samples are isolated from the rest of the cluster which can also be hypothesised to happen due to the change to an insulin-containing medium afterwards.
The same HC methodology was followed considering only the subset of cardiac-related genes selected in the previous work by Branco et al.11. As shown in Fig. 2B, two clusters were also generated, divided by day 7, but, in contrast to the results obtained with the whole gene set, in the cluster with the early samples (orange), days 3 and 5 are not arranged according to the differentiation protocol but temporarily, which is an indication that considering the subset of genes here used, the samples from the same day are more similar to each other regardless of the culture format. Similarly, in the green cluster, day 7 samples are grouped which did not occur considering the entire gene set. Moreover, the remaining 3D samples form a subcluster, just then merged with the 2D samples. This hierarchy may suggest that 3D samples are at a later stage of the differentiation, as far as the cardiac-related genes are concerned. These differences confirm that there is information loss when considering such a small amount of genes.
In the previous examples, the average of the time point replicates was used to generate the dendrograms. However, since variability between replicates is commonly observed in hiPSC differentiation experiments, a dendrogram was generated with each replicate individually plotted (Fig. 2C). Despite the increased visual complexity of the dendrogram, two clusters divided by day 7 are obtained, increasing evidence that between days 5 and 7 major shifts in the transcriptome are seen and the impact of the Wnt signalling pathway inhibition can be advanced as the major driver for the transcriptional changes occurring during this differentiation.
Analysing the orange cluster, it is possible to observe that the replicate samples from the initial time points are similar, but from day 3 onwards, different time points and protocols are interspersed, probably representing the gradual increase in variability as the differentiation progresses. On the green cluster, there is still a separation between 2D and 3D samples but not directly into two sub-clusters, as in Fig. 2A, being no longer possible to create clusters with all samples from each differentiation method. With a closer look, it is possible to notice that some samples start to be grouped not according to the differentiation stage, but by the replicate, which is unexpected and suggests that different gene expression patterns may be observed in the replicates.
To complement the dendrogram results, a PCA analysis was also performed (Fig. 2D). By analysing PC1 vs. PC2 it is possible to understand that the samples are displayed in a temporal fashion according to PC1 and divided between 2D and 3D protocols by PC2. There is some grouping by replicate (example 2D replicate 2 marked in orange), but it is not possible to retrieve the grouping by replicate observed in the dendrogram. Overall, the first 2 principal components do not allow to infer the hierarchy observed with hierarchical clustering, most probably because just around 50% of all variation is observed with these components while the dendrogram C represents the calculated distances between each sample considering all genes.
Self-organising map portraits reveal different gene expression patterns throughout differentiation and between protocols
Hierarchical clustering on the dimension of the sample can provide information on the relationship between samples and even replicates. However, this approach does not identify which genes are responsible for such structure and does not extract biological information regarding the genes involved in the differentiation process. To analyse in more detail the changes in the transcriptome and to visualise this high dimensional data, a SOM algorithm was used. As performed in the HC analysis, initially, SOM portraits of the average of the time point replicates were plotted (Fig. 3A) and then individual sample portraits were created (supplementary figure 1). All SOM portraits are plotted on a relative scale in which for each sample, dark blue and red represent the lowest and highest expression values, respectively, in log Fold Change (logFC).
From the analysis of the average SOM portraits (Fig. 3A), it was possible to confirm the high similarity between the hiPSCs seeding and day 0 samples. However, this data representation highlights the differences between 2D_D0 and 3D_D0 samples, visually attributed to a group of metagenes in the lower-left corner of the portrait (black rectangle on days 0 portraits), which are not expressed on the 3D day 0 sample. Surprisingly, the portraits for samples from days 1 until 5, which form the first cluster observed previously (Fig. 2A), are considerably different between culture formats. The major difference between these samples consists of the lack of expression of the genes in the lower-left corner of the grid (black rectangle) during the 3D protocol. Furthermore, the 3D protocol displays a higher expression of the genes in the upper right corner of the grid (black rectangle on portrait 3D_5) when compared with the 2D one. Overall these portraits raise the hypothesis that the culture format strongly influences the patterns of gene expression during the initial days of differentiation and the impact of such variation may be further studied. In both differentiation protocols, the transcriptomes converge, in the final time points studied, to similar states with an area of over-expression in the upper-left corner of the portraits (black rectangle on days 12 portraits). However, an over-expressed spot in the middle of the left edge of the 2D differentiation portraits (black arrow on portrait 2D_12) is also observed. This subset of genes represents a major difference between the end-stage of the two protocols. In the 2D protocol, the expression of this subset of genes starts around day 9. Although the 2D portraits seem more similar to the 3D ones at this stage, this may be causing the complete segregation between protocols later observed in the HC dendrograms (Fig. 2A, green cluster).
In order to better understand the variability among the individual time point replicates, SOM portraits for each replicate were generated (supplementary figure 1). It is possible to observe that until day 1 all replicates are essentially identical to each other, but, from day 3 onwards, much more variability is noticeable, as previously described for the dendrogram representation (Fig. 2C). In particular, on day 9 in the 2D protocol, the second replicate does not present the over-expression of the metagenes at the middle of the left edge of the grid (highlighted with a black arrow Fig. 3C). Moreover, for the subsequent days, this area is never over-expressed in opposition to the increase in expression for the other replicates. As for the 3D replicates, on day 15 of replicate 1, the same previously mentioned spot is expressed but with less intensity (highlighted with a black arrow on Fig. 3C).
To better understand the differences between replicates of the same protocol, the final percentage of cTNT(^+) cells for each replicate, which is a measure of the final efficiency of CM differentiation, was also considered (Fig. 3B). It is possible to notice that the replicates with lower differentiation efficiency are the ones with higher expression of the metagenes previously highlighted with a black arrow on the left edge of the portrait (Fig. 3A). Also, the second replicate of the 2D differentiation, which does not show expression of this group of metagenes, had a comparable differentiation efficiency, for instance to the first replicate from the 3D differentiation. However, due to the low number of samples, it cannot be stated that the expression of these genes is correlated with the lower efficiency of the process.
Self-organising map portraits accurately display commonly known genes for different stages of hipsc cardiac differentiation
To validate the SOM portraits, the location of commonly known genes that play a crucial role at different stages of hiPSC cardiac differentiation2 were plotted onto the SOM grid (Fig. 4). The Pluripotency marker genes, such as Oct4, Sox2 or Nanog, are located in the area of maximum expression for the samples containing cells in the pluripotent state (hiPSCs, 2D day 0 and 3D day 0).
Regarding the Mesoderm markers, it is interesting to see that some are located in the upper right corner of the grid and others in the lower part, near the “pluripotency area”. The distribution of the Mesoderm markers between the areas of over-expression for day 3 reveals that, although these genes are characteristic of the mesoderm stage, their expression profile is substantially different. TBXT is located in the area of over-expression shared with the pluripotency samples, indicating that the expression of this gene is rapidly decreasing afterwards, and may have started even before day 3. ROR2, which is located in the upper right corner of the grid on the contrary is not expressed before day 3 and sustains its expression for longer periods of time. This pattern was previously experimentally identified for a similar differentiation protocol20 and is a good indicator that the SOM is indeed able to accurately represent the subtle differences in expression patterns. Additionally, genes located in the same node and/or area of the grid could be further tested as alternative marker genes since they have the same expression pattern throughout the experiment. As for the Cardiac Mesoderm markers, they are located on three different edges of the grid, coincident with the over-expressed metagenes for day 5.
Likewise, the Cardiac Progenitor and Cardiomyocyte markers were also found in the area of over-expression for the final samples for both culture formats. However, the markers of CMs are spread in the area just expressed in the 3D samples at the end of the differentiation, corroborating the previous analysis by Branco et al.11 which reveals a higher degree of maturity of the 3D-generated CMs.
Biological interpretation and single gene expression dynamics
To study in more detail the over-expressed areas of the grid, the K-means algorithm was used to divide the metagenes into clusters. The resulting partition is presented in Fig. 5A where clusters considered to have a non-significant expression in any sample (low variability over time) are marked with white circles. HC dendrograms were recomputed to assess whether the genes contained in these clusters are sufficient to reconstruct the hierarchy observed using the whole geneset (Fig. 2A). In fact, as shown in supplementary figure 2 using the average of the replicates for each time point, it was possible to obtain the same clusters as the ones previously obtained for the whole geneset.
Gene Set Enrichment (GSE) analysis and Gene Ontology (GO) over-representation were done for the 13 significantly expressed clusters and a complete set of results can be found in supplementary tables 1 and 2 and a summary of the most relevant findings is plotted in Fig. 5B.
Clusters M and N, which are over-expressed at the end of the differentiation, presented an over-representation of genes from several ontologies related to cardiac development as well as an enrichment in gene sets as the hallmarks of myogenesis.
Cluster S, which is the main cluster expressed in pluripotent hiPSC samples, is enriched in gene sets over-expressed in stem cells, however, from the GO no such clear over-representation was seen. Clusters B and R, which are expressed in hiPSCs samples and at the beginning of the 2D differentiation but not in the 3D protocol, are correlated with genes regulating the translation process and ribosomal assembly. It was recently proposed that both the translational process and ribosome biogenesis and homeostasis are key elements in the regulation of stem cell characteristics21. The fact that the day 0 samples from the 3D protocol do not express this cluster also suggests that the 3D culture format is priming the cells to a more committed state.
For cluster G, which corresponds to the area of over-expression in the replicates which yielded a lower percentage of cTNT(^+) cells, it was found two genesets with genes over-expressed in the liver and also GO correlated with liver and kidney functions. These results allow hypothesising that endoderm differentiation may be occurring concurrently, leading to a reduction in CM differentiation efficiency. From the developmental perspective, this result is not surprising as both mesoderm and endoderm are derived from the mesendoderm which, in mammals, is formed by the activation of the Wnt/b-catenin pathway together with Nodal and Activin22. Furthermore, recent results from single-cell RNA-seq also revealed the presence of endoderm-like cells arising around day 5 of the differentiation23,24.
Since expression of the genes in this cluster was found to be correlated with the final efficiency of CM differentiation, the 10 genes with higher expression values (AFP, SERPINA1, AHSG, TTR, AMBP, TF, TM4SF4, FGB, APOC3, SPINK1) were retrieved and their expression through time plotted for each experiment (Fig. 5C and supplementary figure 3). As these genes were found to have an expression of up to 7 logFC at differentiation day 9, they may be further investigated as markers for early prediction of low cardiac differentiation efficiency.
The SOM portraits generated allowed to have a visualization of differences in the transcriptome throughout hiPSC differentiation using two different protocols. However, using only this representation, it is not straightforward to assign these differences to specific genes. HBC was then used to further investigate and visualize the transcriptomic data. Initially, a bi-cluster heatmap with the entire gene set and all the replicates was performed, as well as with the samples’ average (supplementary figure 4A). However, due to the high number of genes, it was not possible to extract meaningful information from these representations. A simplified bi-cluster heatmap was then generated, using only the genes from the K-means clusters previously considered relevant (supplementary figure 4B). Using this subset of genes, different expression patterns became more evident and it was possible to clearly identify a set of genes which are highly expressed only in the 2D differentiation samples, replicates 1 and 3, from day 9 to 20.
Nonetheless, this representation is still difficult to interpret and therefore we decided to further investigate the SOM results by generating a heatmap (supplementary figure 5A) considering only the SOM clusters G, M, N and S, which have the higher expression values and their biological significance already clarified. Also, to facilitate the interpretation, in supplementary figure 5 the samples were not clustered. Using this simplified heatmap, a 5-cluster cut was done, providing a cluster where the highly expressed genes for 2D replicates 1 and 3 can be isolated. Supplementary figure 5B presents a magnification of this cluster and, remarkably, 9 out of the 10 genes previously proposed as potential early markers of differentiation efficiency are here depicted and marked in light green. With this comparative analysis between SOM and HBC, it can be concluded that in contrast to SOM, HBC is not suitable for a whole-genome representation, but the use of both can be complementary since they provide visualizations with different resolutions of the gene expression patterns. While SOM portrays global gene expression patterns, in the form of metagenes, HBC can be used to generate heatmaps that allow the visualisation of single gene expression dynamics.