Complex regulatory networks influence pluripotent cell state transitions in human iPSCs – Nature Communications

Subject information

We used hiPSC lines from 219 individuals (Supplementary Data 1) recruited as part of the iPSCORE project18,20,68. There were 140 individuals belonging to 40 families composed of two or more subjects (range: 2–14 subjects) and 134 genetically unrelated individuals (some individuals were in the same family but only related by marriage). The iPSCORE_ID (i.e, iPSCORE_4_1) indicates family (4) and individual number (1). Each subject was assigned a Universal Unique Identifier (UUID). Sex, age, and self-reported race/ethnicity were reported at the time of enrollment of each subject. We previously estimated the ancestry of each participant by comparing their genomes to those of individuals in the 1000 Genomes Project (KGP)69. Recruitment of these individuals was approved by the Institutional Review Boards of the University of California, San Diego, and The Salk Institute (project no. 110776ZF).

Molecular data sources

We used the following datasets from the iPSCORE resource:

  • Previously published18,20 50X WGS (Illumina; 150-bp paired-end) generated from the blood or skin fibroblasts of the 219 individuals in this study. Of the 219 individuals, 127 had both RNA-seq and ATAC-seq libraries, 86 only had RNA-seq and 6 only had ATAC-seq.

  • 150 ATAC-seq libraries generated from 143 hiPSC lines (collected after expanding in mTeSR1 medium with ROCK inhibitor on Matrigel) from 133 individuals, Supplemental Data 6;

  • Previously published18,20,68 RNA-seq data from 213 hiPSC lines (collected after culturing in mTeSR1 medium on Matrigel) from 213 individuals, Supplementary Data 2.

  • Previously published18 RNA-seq data from 3 hiPSC lines (collected after expanding in mTeSR1 medium containing ROCK inhibitor on Matrigel) from 3 individuals. Technical triplicates were collected for each hiPSC line.

Of note, both the RNA-seq and ATAC-seq data were generated from the same hiPSCs in the iPSCORE collection. However, the RNA-seq data and ATAC-seq data were generated from different passages of the hiPSC lines cultured under different experimental conditions. The RNA-seq data was generated from earlier passage ROCK inhibitor-naïve hiPSCs and the ATAC-seq data was generated from later passage hiPSCs that had been cultured with ROCK inhibitor. All iPSCORE resource molecular data is publicly available: RNA-seq at dbGaP under the accession code “phs000924 [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000924.v4.p1]”; ATAC-seq at GEO under the accession code “GSE203377 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE203377]”; gVCF files at dbGaP under the accession code “phs001325 [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/ study.cgi?study_id=phs001325.v5.p1]”. The TOBIAS transcription factor binding predictions, the TMM-normalized counts, the kinship matrix, and the reference narrow peak file have been deposited on “Figshare [https://figshare.com/projects/iPSC_coaccessibility/136585]”.

We used the following publicly available datasets:

  • 28 RNA-seq samples corresponding to either the formative state (early postimplantation epiblast-like; EPE) or the general population1.

  • 6 ATAC-seq samples corresponding to either the formative state (GCTM-2highCD9highEPCAMhigh) or the primed state (GCTM-2mid-CD9mid)1.

  • ChIP-seq data for 18 Transcription Factors from ENCODE37,41,70 (Supplementary Data 9);

  • Seven hiPSC cell state gene sets obtained and curated from 3 published studies:

    • Two gene sets for the primed and formative states1

    • Four gene sets for epiblast, conventional PSCs, naïve PSCs, primitive endoderm29

    • One gene set for 8-cell like cells3

  • Fifty-four sets of fetal tissue-specific ATAC-seq peaks from the Descartes; Human Chromatin Accessibility During Development Atlas53

  • ChromHMM chromatin states in the ROADMAP Epigenomics hiPSC-18 line33

Human iPSC (hiPSC) generation

Generation of the 219 hiPSC lines has previously been described in detail18. Briefly, cultures of primary dermal fibroblast cells were generated from a punch biopsy tissue, expanded for approximately 3 passages and cryopreserved. In batch, the fibroblasts were thawed and plated at a density of 250 K cells/well of 6-well plate and infected with the Cytotune Sendai virus (Life Technologies) per manufacturer’s protocol to initiate reprogramming. The Sendai infected cells were maintained with 10% FBS/DMEM (Invitrogen) for Days 4–7 until the cells recovered and repopulated the well. These cells were then enzymatically dissociated using TrypLE (Life Technologies) and seeded onto a 10-cm dish pre-coated with mitotically inactive-mouse embryonic fibroblasts (MEFs) at a density of 500 K/dish and maintained with hESC medium, as previously described71. Emerging hiPSC colonies were manually picked after Day 21 and maintained on Matrigel (BD Corning) with mTeSR1 medium (Stem Cell Technologies). Multiple independently established hiPSC clones (i.e. referred to as lines) were derived from each individual, which were cultured typically to passage 12, and at least ten stock vials were frozen from each cell line. hiPSC pellets collected from each cell line were frozen in RTL plus buffer (Qiagen) and used for total RNA isolation. Sendai virus clearance typically occurred at or before P9 and was not detected in the hiPSC lines at the P12 stage of cryopreservation. A subset of the hiPSC lines was evaluated by flow cytometry for the expression of two pluripotent markers: Tra-1-81 (Alexa Fluor 488 anti-human, Biolegend) and SSEA-4 (PE anti-human, Biolegend). Pluripotency was also examined using PluriTest-RNAseq18. This iPSCORE resource was established as part of the Next Generation Consortium of the National Heart, Lung, and Blood Institute and is available to researchers through the biorepository at WiCell Research Institute (www.wicell.org; NHLBI Next Gen Collection). For-profit organizations can contact the corresponding author directly to discuss line availability.

iPSCORE data generation and processing

Generation of RNA-seq data for 213 hiPSC lines

We analyzed previously published20 RNA-seq data from 213 hiPSC lines from 213 iPSCORE individuals (Supplementary Data 2). Briefly, using the AllPrep DNA/RNA Mini Kit (QIAGEN) we extracted total RNA from lysed pellets frozen in RLT Plus buffer (collected after culturing hiPSC lines in mTeSR1 medium on Matrigel), assessed quality to make sure RNA integrity number (RIN) was 7.5 or greater, prepared libraries using the Illumina TruSeq stranded mRNA kit and sequenced with 100 bp paired end reads on HiSeq2500 (~22 M reads/per sample).

Generation of RNA-seq for 3 hiPSC lines expanded with ROCK inhibitor

While the RNA-seq and ATAC-seq data were generated from same hiPSC lines, there were batch effects because of differences in culture conditions (ie, the ATAC-seq data were generated from hiPSCs that had been cultured with ROCK inhibitor). Therefore, the following data were used to examine the correlation between ATAC-seq peak co-accessibility network modules and gene co-expression network modules (see below: Identifying Associations Between Gene and Regulatory Networks). As previously described18, for three individuals (iPSCORE_2_2, iPSCORE_2_3, and iPSCORE_2_9) one vial of a frozen hiPSC line was thawed into mTeSR1 medium containing 10 μM Y27632 (ROCK Inhibitor), plated onto one well of a Matrigel-coated six-well plate and incubated overnight. The media was replaced daily with mTeSR1 until the hiPSCs were visually estimated to be 80% confluent. The hiPSCs were then expanded by passaging from one well onto three wells of a six-well plate using Versene (Lonza) in mTeSR1 medium containing 5 μM ROCK inhibitor. When cells reached 80% confluency, the three wells of hiPSC line were individually dissociated using Accutase (Innovative Cell Technologies Inc.) in mTeSR1 medium containing 5 μM ROCK inhibitor, collected, counted, and 1 × 106 cells frozen in RLT plus buffer (three technical replicates per hiPSC line). Total RNA was extracted using AllPrep DNA/RNA Mini Kit (QIAGEN), RNA quality was assessed based on RNA integrity number (RIN) and nine libraries (three per hiPSC line) were prepared using the Illumina TruSeq stranded mRNA kit and sequenced with 100-bp paired end reads on a HiSeq2500.

RNA-seq data processing

We obtained transcript per million bp (TPM) as previously described20. Briefly, FASTQ files were aligned to the hg19 reference genome using STAR 2.5.0a72 and Gencode V.34lift3773 with parameters: outFilterMultimapNmax 20, –outFilterMismatchNmax 999, –alignIntronMin 20, –alignIntronMax 1000000, –alignMates-GapMax 1000000. We sorted and indexed the BAM files using Sambamba 0.6.774 and marked duplicates using biobambam2 (2.0.95) bammarkduplicates75 and then re-indexed. To quantify gene expression (TPM), we used RSEM v1.2.2076 with the following parameters: —rsem-calculate-expression, –bam, –num-threads 16, –no-bam-output, –seed 3272015, –estimate-rspd, –paired-end, –forward-prob 0. We identified 16,110 expressed genes on autosomes (TPM ≥ 1 in at least 20% of the 213 hiPSC lines) using rowQuantiles from the matrixStats R package (version 0.52.2).

RNA-seq gene expression transformation

To normalize gene expression and usage across samples, we performed inverse normal transformation using normalize.quantiles (preprocessCore package) and qnorm functions in R, in order to obtain mean expression = 0 and standard deviation = 1, as we previously described19,77.

Generation of ATAC-seq data for 143 hiPSC lines

150 ATAC-seq libraries were prepared from 143 hiPSC lines (for 7 lines we prepared two libraries) from 133 individuals (5 individuals each had two or three independent clones). The hiPSCs were harvested on Day 0 of our previously published large-scale iPSC-derived cardiovascular progenitor cells study prior to the initiation of WNT activation78. In brief, for each of the 150 ATAC-seq libraries, one vial of a frozen hiPSC line was thawed into mTeSR1 medium containing 10 μM ROCK Inhibitor (Sigma), plated on one well of a six-well plate coated with Matrigel, and incubated overnight. When hiPSCs were visually estimated to be 80% confluent they were passaged 1–2 times using Dispase II (2 mg/ml; Gibco/Life technologies) in mTeSR1 and plated onto Matrigel coated plates. The hiPSCs were then expanded by passaging using Versene (Lonza) in mTeSR1 medium containing 5 μM ROCK inhibitor. Finally, the hiPSCs were dissociated using Accutase (Innovative Cell Technologies Inc.) in mTeSR1 medium containing 5 μM ROCK inhibitor, collected, counted, and frozen as nuclear pellets of 2.5 × 104 cells for the ATAC-seq assay.

We performed ATAC-seq on the 150 hiPSC samples using a modified version of the Buenrostro et al.79 protocol to: (1) selectively permeabilize the nuclear envelope of the cells without disrupting mitochondrial membrane, and (2) increase the number of reads that do not contain sequences spanning multiple nucleosomes. Frozen nuclear pellets were thawed on ice and tagmented in total volume of 25 μl in permeabilization buffer containing digitonin (10 mM Tris-HCl pH 7.5, 10 mM NaCl, 3 mM MgCl2, 0.01% digitonin) and 2.5 μl of Tn5 from Nextera DNA Library Preparation Kit (Illumina) for 45–75 min at 37 °C in a thermomixer (500 RPM shaking). Inclusion of digitonin, which is a mild detergent capable of selective permeabilization of cholesterol-rich bilayers80, permeabilized the nuclear membranes (containing 20–35% cholesterol and cholesterol esters80,81 without disrupting the mitochondrial membranes (0.5–3% cholesterol)82. To enrich for reads that span a single nucleosome, we included a double size selection step during purification using AMPure XP DNA beads (Beckman Coulter). This step enriched for reads containing inserts under 140 bp in length (shorter than the 146 bp wrapper around a single nucleosome). To eliminate confounding effects due to index hopping, all libraries within a pool were indexed with unique i7 and i5 barcodes. Libraries were amplified for 12 cycles using NEBNext® High-Fidelity 2X PCR Master Mix (NEB) in total volume of 25 µl in the presence of 800 nM of barcoded primers (400 nM each) custom synthesized by Integrated DNA Technologies (IDT).

Sequencing of ATAC-seq

The 150 ATAC-seq libraries were batched and sequenced with 100-bp paired-end reads on a HiSeq4000. To improve overall read depth 113 ATAC-seq libraries were sequenced twice (the second time with 150-bp paired end reads) resulting in 263 FASTQ files (Supplementary Data 6). The 263 FASTQ files were aligned to the hg19 reference genome with STAR72 with the following flags: —outFilterMultimapNmax 20, —outFilterMismatchNmax 999, –outFilterMismatchNoverLmax 0.04, —seedSearchStartLmax 20, —outFilterScoreMinOverLread 0.1, outFilterMatchNminOverLread 0.1. We then filled in mate coordinates using samtools fixmate, marked duplicates using samtools markdup83. We used samtools merge to combine BAM files from the same library and indexed the merged BAM files with samtools index. We then filtered poorly mapped reads (MAPQ < 20%), duplicates and reads less than 38 bp and greater than 2000 bp with samtools view to obtain reads passing filters (PF). We re-indexed the filtered merged BAM file with samtools index (version v6.7)83. After quality control, all BAM files from the same library were combined (1–2 paired files per hiPSC line) resulting in 150 BAM files (Supplementary Data 6).

ATAC-seq peak calling and quality control

MACS2 v.2.2.784 was used to call broad peaks for all 150 ATAC-seq libraries individually with settings: –nomodel –shift −100 –extsize 200 -f BAMPE -g hs –broad. To obtain one single set of high-quality peaks, 34 reference libraries were selected from unrelated individuals with 20–35 million reads passing filters, 60,000–90,000 broad peaks, and 100–225 bp mean fragment size to establish discrete regions of accessible chromatin (i.e., a reference set of ATAC-seq peaks) (Supplementary Fig. 5B, Supplementary Data 6). MACS2 v.2.2.7 was used to call narrow peaks on the 34 reference libraries jointly with settings: -f BAMPE -g hs -t –nomodel –shift −100 –extsize 200 –narrow. Narrow peaks were filtered by MACS2 score (<100), resulting in 136,333 peaks (including 132,225 autosomal peaks and 4108 peaks on sex chromosomes). For each of the 150 individual ATAC-seq samples coverage on the 136,333 peaks were obtained using featureCounts in Subread package v1.5.085. Next, counts were trimmed mean of M value (TMM)-normalized for each peak across all 150 individual samples using the cpm function from edgeR v3.30.386.

ATAC-seq peak transformation

TMM values for all 136,333 ATAC-seq peaks were inverse normal transformed using the normalize.quantiles from the preprocessCore package and qnorm functions in R, in order to obtain mean expression = 0 and standard deviation = 1 for each peak across all 150 ATAC-seq samples.

Identification of 56,978 autosomal reference peaks

To reduce the computational burden in downstream analyses, we selected a subset of 56,978 reference peaks based on their MACS2 peak score and enrichment in active chromatin. We filtered ATAC-seq peaks that overlapped blacklisted regions87, peaks in ZNF genes & repeats, heterochromatin, quiescent chromatin states from the ChromHMM model for iPSC-1833, and peaks on sex chromosomes. We binned the remaining 87,659 ATAC-seq peaks into 20 MACS2 score quantiles each containing 4383 peaks, where quantile 20 consisted of the peaks with the highest scores. We then examined the enrichment of the 87,659 ATAC-seq peaks in iPSC-18 ChromHMM chromatin states by MACS2 score quantile (Supplementary Fig. 5C). We created 20 bed files containing the coordinates of the peaks in each quantile and calculated their enrichment (Odds Ratio) in each of the 12 chromatin states for iPSC-18, using bedtools fisher. The higher quantiles tended to be enriched for active (TssA, TssAFlnk, TxFlnk), bivalent (TssBiv, BivFlnk, EnhBiv), and repressed polycomb (ReprPC, ReprPCWk) chromatin. Based on these observations, we calculated pairwise co-accessibility for the 56,978 ATAC-seq peaks in MACS2 quantiles 8–20 (see Identifying co-accessible peaks section below), which alleviated the computational burden by eliminating 7 × 109 pairwise tests between peaks with low accessibility and peaks in inactive regions of the genome. For downstream analyses, we resolved a single chromatin state for each peak by intersecting the maximum summits with the iPS-18 chromatin states (described below in Collapsing into 5 hiPSC chromatin states) (Supplementary Data 8).

Cellular deconvolution of pluripotency states using gene expression signatures

From GEO (GSE119324)1 we downloaded 28 paired RNA-seq FASTQ files, including, fourteen from five unique human embryonic stem cells (hESCs) sorted (GCTM-2highCD9highEPCAMhigh) for the formative (EPE) population and fourteen files for five unique samples generated from the primed population (unsorted). We aligned and filtered the 28 FASTQ files, with STAR 2.5.0a72 and RSEM v1.2.2076 as described above in RNA-seq data processing section. We performed differential expression analysis between the formative (EPE) and general population RNA-seq samples using DESeq2 (v.1.34) R package88 and created a signature matrix containing 300 of the most differentially expressed genes (150 upregulated in the GCTM-2highCD9highEPCAMhigh cells and 150 downregulated relative to the unfractionated cells; Supplementary Data 3). We deconvoluted the 213 hiPSC RNA-seq samples by supplying the signature matrix and TPM expression matrix as input to the CIBERSORT.R script21.

Covariates for co-expression and co-accessibility analyses

To perform the gene co-expression analysis, we used the following covariates: (1) sex; (2) age; (3) hiPSC passage number; (4) number of properly paired reads; (5) 20 genotype principal components to account for global ancestry; and (6) kinship matrix. All covariates are available in Supplementary Data 12.

To perform the ATAC-seq peak co-accessibility analysis, we used the following covariates: (1) sex; (2) age; (3) hiPSC passage number; (4) number of reads passing filters; (5) the mean fragment size; (6) the number of broad peaks; (7) the ratio of 100 bp reads to 150 bp reads; (8) 20 genotype principal components to account for global ancestry; and (9) kinship matrix. All covariates are available in Supplementary Data 1 and Supplementary Data 6.

Sex

To account for sex-dependent chromatin, we assigned binary values to each sex and included it as a covariate.

Biological and technical covariates

We scaled age, hiPSC passage number, number of reads passing filters, and mean fragment size for each library to the mean across libraries and included these normalized values as covariates.

Broad peaks

As described above in the ATAC-seq Peak Calling and Quality Control section, we used MACS2 v.2.2.7 to call peaks on all 150 ATAC-seq libraries independently to assess sample quality. We normalized the number of broad ATAC-seq peaks for each sample to mean across all 150 samples.

Read length ratio

As described above in the Sequencing of ATAC-seq section, several ATAC-seq libraries were sequenced multiple times with 100 or 150 bp paired end reads and merged. To account for the different proportions of read lengths in merged libraries, we included the ratio of reads passing filters from 100 and 150 bp sequencing runs for each library as a covariate.

Genotype principal component analysis

We previously performed principal component analysis (PCA) on WGS variants to determine the global ancestry of each individual in this study19. Briefly, we used the genotypes of 1,634,010 SNPs that had allele frequencies between 30% and 60% in the 1000 Genomes Phase 3 Project and genotyped in both iPSCORE and GTEx. We merged the VCF files from 1000 Genomes, iPSCORE, and GTEx, and performed PCA using the pca function in plink 1.90b3x89. The top 20 genotype principal components used as covariates to account for global ancestry for all 219 individuals in this study can be found in Supplementary Data 1.

Kinship matrix

We included a kinship matrix generated for a previous iPSCORE study19 as a random effects term to account for the genetic relatedness between individuals. The matrix was constructed using the kinship function in plink 1.90b3x89 and the same set of 1,634,010 SNPs employed in the genotype PCA.

Gene co-expression analyses

We leveraged the previously published RNA-seq dataset of 213 hiPSC lines from iPSCORE individuals18,20,68 to identify gene networks that are active in stem cells. We integrated and curated gene sets of defined pluripotency cell states from external sources1,3,18,29 to annotate gene networks with stem cell states.

Identification and characterization of gene co-expression

To identify pairwise correlations between the 16,110 expressed autosomal genes across the 213 hiPSC lines, we performed a gene co-expression analysis using a Linear Mixed-effects Model (LMM) with the lmekin function in the coxme R package v.2.2-1724, which incorporates a kinship matrix to control for random effects from genetic relatedness. We included normalized values for age, hiPSC passage number, sex, number of properly paired reads, and the top 20 PCs from ancestry as fixed effect covariates (see above: Covariates for co-expression and co-accessibility analyses).

$${Y}_{{ij}}={beta }_{k}{Y}_{{ik}}+mathop{sum }limits_{m=1}^{M}{gamma }_{m}{{{{rm{PC}}}}}_{{im}}+mathop{sum }limits_{p=1}^{P}{gamma }_{p}{C}_{{ip}}+{u}_{i}+{epsilon }_{{ik}}$$

(1)

Where ({Y}_{{ij}}) is the normalized expression value for gene j in sample i, ({Y}_{{ik}}) is the normalized expression value for gene k in sample i, ({beta }_{k}) is the effect size (fixed effect) of gene k, ({{{{rm{PC}}}}}_{{im}}) is the value of the mth genotype principal component for the individual associated with sample i, ({gamma }_{m}) is the effect size of the mth genotype principal component, M is the number of genotype principal components used (M = 20), ({C}_{{ip}}) is the covariate of the pth covariate for sample i, ({gamma }_{p}) the effect size of the pth covariate, P is the number of covariates used (P = 5), ({u}_{i}) is a vector of random effects for the individual associated with sample i defined from the kinship matrix, and ({epsilon }_{{ik}}) is the error term for individual i at gene k.

Construction of the genome-wide gene co-expression network (GN)

We created the genome-wide co-expression network (GN) using the graph_from_data_frame function from the igraph R package26 by assigning the 16,110 expressed genes as nodes, the 3,533,609 co-expressed genes (Bonferroni-corrected p-value < 0.05 and Effect Size > 0) as edges. We extracted the genome-wide degree of each gene using the degree function on the GN.

To identify gene co-expression network modules (GNMs), we applied the unsupervised Leiden community detection algorithm (from the igraph R package26) to the GN. We optimized module detection by analyzing 1,700 combinations of three parameters: resolution (range: 0–5), beta (range: 0–0.1), and n_iterations (range 5–50). For each combination of parameters, we permuted the nodes to confirm the Leiden algorithm was clustering co-expressed genes better than NULL background. We calculated modularity, fractions of genes in major GNMs (membership >100) and the number of modules for each combination of parameters. We selected the modules obtained by resolution = 2.25, beta = 0.05, n_iterations = 45, which exhibited a decent modularity (Q = 0.41) and a high fraction of genes captured by major GNMs (91.8%). Under these parameters, genes within the same GNM were significantly more co-expressed than genes randomly connected through permutation (P-value = 0). As there is no consensus on modularity thresholds or network validation methods90, we used downstream analyses, such as co-expression and gene set enrichment for validation. For each gene, we calculated its intramodular degree (the number of co-expressed genes within the module), using the same functions described above, for each GNM independently, excluding inter-module edges (Supplementary Data 4).

Gene module identification using weighted gene correlation network analysis (WGCNA)

To benchmark our gene module detection approach (Supplementary Note 1), we processed the 213 RNA-seq samples using the standard workflow (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/) for the WGCNA R package6, which cannot account for covariates and kinship. We used the WGCNA corPvalueStudent function to calculate the associations with the 24 covariates used in the co-accessibility LMM, the estimated formative fraction (Supplementary Fig. 2A), and binary annotations for samples from 6 families with 5 or more individuals (Supplementary Fig. 3). We then compared our 13 GNMs with the 17 WGCNA modules by calculating the enrichment of the 7 cell state gene sets (Fig. 1J, Supplementary Fig. 3).

PCA and UMAP analyses of GNMs

To assess whether genes within a GNM had similar expression profiles across hiPSCs, we performed PCA and UMAP analyses on the 213 RNA-seq samples. We first identified the most interconnected genes within each GNM by ranking the intramodular degree for each GNM independently. Biological networks are scale-free27 and follow the Pareto Principle28, which states that 20% of the nodes are responsible for 80% of a network’s connectivity, therefore we defined intramodular Pareto genes as the top 20% interconnected genes within each GNM. We performed a PC analysis on the inverse normal transformed TPM expression of the intramodular Pareto genes in the 13 major GNMs (membership >100 genes) with the prcomp function in base R. We performed UMAP dimensionality reduction using the umap function from the umap R package.

Calculation of GNM score

For each of the 213 RNA-seq samples we calculated a GNM score for each of the 13 GNMs by summing the inverse normal transformed TPM expression of the corresponding GNM-specific Pareto genes (i.e., each RNA-seq sample had 13 GNM scores).

Functional characterization: gene co-expression network modules

Calculation of GNM co-expression enrichment

To assess whether genes within each GNM were more co-expressed with each other across the 213 hiPSC lines than with genes in different GNMs, we enumerated the number of co-expressed genes shared between all pairwise combinations of GNMs. We then performed a Fisher’s Exact test to calculate the enrichment of genes showing co-expression within each GNM, using the genes co-expressed between each set of paired GNMs as background.

GNM enrichment in stem cell states

We determined if the GNMs were enriched for expressing marker genes from three published studies describing 7 hiPSC subpopulations representing different pluripotency cell states including: (1) 199 8-cell like cell (8CLC)-associated genes from Mazid et al.3; (2) 123 primitive endoderm (PrE)-associated genes, 248 epiblast-associated genes, 175 trophectoderm genes; 96 Naïve-associated PSCs genes from Stirparo et al.29; and (3) 266 EPE-associated genes and 452 general population-associated genes that we identified by reprocessing data from Lau et al.1 (See Cellular deconvolution of pluripotency states using gene expression signatures). We performed a Fisher’s Exact test to calculate the enrichment of hiPSC subpopulation or cell state associated genes in each GNM, using the genes in the remaining 12 major GNMs as background.

Cellular deconvolution of pluripotency states using ATAC-seq peak accessibility signatures

We downloaded 6 ATAC-seq samples that were performed on human embryonic stem cells (hESCs) that had been sorted for the formative (GCTM-2highCD9highEPCAMhigh) and the primed (GCTM-2mid-CD9mid) cell states from GEO (GSE147338)1. We aligned and filtered the FASTQ files, with STAR 2.5.0a72 as described in the Sequencing of ATAC-seq section. MACS2 v.2.2.7 was used to establish a reference set of narrow peaks on the 6 ATAC-seq samples simultaneously, using the parameters; -f BAMPE -g hs -t –nomodel –shift −100 –extsize 200 –narrow. For each of the 6 individual ATAC-seq samples, read counts in the 193,147 reference peaks were obtained using featureCounts in Subread package v1.5.085. We performed differential accessibility analysis on the counts using DESeq2 (v.1.34) R package88. We used bedtools intersect -r 0.25 to identify 938 formative-associated peaks that overlapped iPSCORE ATAC-seq peaks with a 25% reciprocal overlap (Supplementary Data 8). We used the ATAC-seq peaks that were differentially accessible in the GCTM-2mid-CD9mid hiPSC subpopulation1 to annotate 2,981 non-formative associated peaks using the same approach. We then deconvoluted the 150 ATAC-seq samples using the CIBERSORT.R script21 with a signature matrix containing 200 of the most differentially accessible formative and primed peaks (Supplementary Data 7).

ATAC-seq co-accessibility analyses

We leveraged our newly generated 150 ATAC-seq samples to profile co-accessibility of open chromatin across the hiPSC epigenome and discovery regulatory networks that are active in different pluripotency cell states.

Identifying co-accessible peaks

Since pairwise calculations for all 132,225 autosomal ATAC-seq peaks would have been computationally intensive, requiring ~8.74 × 109 tests, we focused our analyses on the 56,978 reference peaks (see above: Identification of 56,978 autosomal reference peaks) which reduced the number of tests to 1.62 × 109. To identify pairwise correlations of accessibility between the 56,978 peaks across the 150 ATAC-seq samples, we performed a genome-wide analysis using a Linear Mixed-effects Model (LMM) with the lmekin function in the coxme R package v.2.2-1724, which incorporates a kinship matrix to control for random effects from genetic relatedness. The following covariates were included: (1) sex; (2) age; (3) hiPSC passage number; (4) number of reads passing filters; (5) the mean fragment size; (6) the number of broad peaks; (7) the ratio of 100 bp reads to 150 bp reads; (8) 20 genotype principal components to account for global ancestry (see above: Covariates for co-expression and co-accessibility analyses).

Formula #2:

$${X}_{{ij}}={beta }_{k}{X}_{{ik}}+{sum }_{m=1}^{M}{gamma }_{m}{{{{rm{PC}}}}}_{{im}}+{sum }_{p=1}^{P}{gamma }_{p}{C}_{{ip}}+{u}_{i}+{epsilon }_{{ik}}$$

(2)

Specifically, we utilized inverse normal transformed TMMs across the 150 ATAC-seq samples for each of the 56,978 peaks. In Formula #2 co-accessibility, ({X}_{{ij}}) is the normalized accessibility value for peak j in sample i, ({X}_{{ik}}) is the normalized accessibility value for peak k in sample j, ({beta }_{k}) is the effect size (fixed effect) of peak k and the remaining terms were consistent with co-expression variable. Of the 56,978 reference peaks, 47,761 were present in the 13 major RNMs. In total, we identified 8,696,814 pairs of co-accessible peaks (P-value < 5 × 10−8, Effect Size > 0).

Construction of the genome-wide regulatory co-accessibility network (RN)

We created the genome-wide co-accessibility network (RN) using the graph_from_data_frame function from the igraph R package26 by assigning the 56,978 accessible peaks as nodes, the 8,696,814 co-accessible peaks (P-value < 5 × 10−8, Effect Size > 0) as edges. We extracted the genome-wide degree of each peak using the degree function on the RN. To find the optimal Leiden community detection model, we followed the same approach as described in the Construction of the Genome-wide Gene Co-expression Network section. We selected modules obtained from the model using the following parameters; resolution = 2.5, beta = 0.09, n_iterations = 30, which had a modularity of 0.36 and 47,761 peaks (83.8%) in 13 major RNMs (membership ≥ 500 ATAC-seq peaks). For each combination of parameters, we permuted the nodes to confirm the Leiden algorithm was clustering co-expressed genes better than the null background. For each peak, we calculated its intramodular degree (the number of co-accessible peaks within the module), using the same functions described above, excluding intermodule edges and considering each RNM independently (Supplementary Data 8).

PCA and UMAP analyses of RNMs

To assess whether ATAC-seq peaks within an RNM had similar accessibility profiles across hiPSCs, we performed PCA and UMAP analyses on the 150 ATAC-seq samples. We first identified the most interconnected ATAC-seq peaks within each RNM by ranking the intramodular degree for each RNM independently. We performed a PC analysis on the inverse normal transformed TMM of the peaks with the top 10% intramodular degree (top 10%) from the 13 major RNMs (membership ≥ 500 peaks) with the prcomp function in base R. We performed UMAP dimensionality reduction using the umap function from the umap R package.

Calculation of ATAC-seq peak co-accessibility enrichment

To assess whether peaks within each RNM were more co-accessible with each other than peaks in different RNMs, we enumerated the number of co-accessible peaks shared between all pairwise RNMs. We then performed a Fisher’s Exact test to calculate the enrichment of co-accessibility between RNM pairs.

Calculation of RNM score

For each of the 150 ATAC-seq samples we calculated an RNM score for each of the 13 RNMs by summing the inverse normal transformed TPM matrix of the corresponding RNM-specific Pareto (top 20% intramodular degree) peaks (i.e., each ATAC-seq sample had 13 RNM scores).

Correlation of GNMs and RNMs

We examined the correlation between ATAC-seq peak co-accessibility network modules and gene co-expression network modules.

Identifying associations between gene and regulatory networks

The GNMs and RNMs were identified using overlapping hiPSC lines from the iPSCORE collection. However, the RNA-seq data and ATAC-seq data were generated from different passages of the hiPSCs cultured under different experimental conditions. The RNA-seq data was generated from earlier passage ROCK inhibitor-naïve hiPSCs and the ATAC-seq data was generated from later passage hiPSCs after culturing with ROCK inhibitor. Therefore, to annotate peaks with putative gene targets we: (1) identified genes expressed after culturing with 3 hiPSC lines with ROCK inhibitor (see above: Generation of RNA-seq for 3 hiPSC lines expanded with ROCK inhibitor), and then (2) annotated each peak with a single gene (distance <100 kb and highest expressed gene). Specifically, to identify candidate target genes for the 47,761 ATAC-seq peaks in the 13 major RNMs, we generated a bed file of the TSSs for the 16,110 autosomal genes expressed (TPM > 1 in at least one of the nine samples (triplicates of each hiPSC line) cultured with 5 μM ROCK inhibitor and performed bedtools closest to identify the closest TSS within 100 kb of each ATAC-seq peak. For ATAC-seq peaks that overlapped the TSSs of multiple genes, we calculated the maximum TPM expression across all 3 hiPSC lines cultured with ROCK inhibitor and annotated the ATAC-peak with the gene with the maximum expression. Finally, to identify associations between the GNMs and RNMs we only used genes: (1) expressed in both ROCK inhibitor-naïve hiPSCs and ROCK inhibitor-exposed hiPSCs, (2) in one of the 13 major GNMs, and (3) annotated as a putative target in one of the 13 major RNMs. In total, 12,078 unique genes corresponding to 32,327 peaks were used for the association test (Supplementary Data 8). We calculated the number of genes in common between all GNM-RNM pairwise combinations and performed Fisher’s Exact tests to calculate enrichments.

Functional annotation of hiPSC ATAC-seq peaks

To functionally characterize the hiPSC epigenome, we annotated the 47,761 ATAC-seq peaks in the 13 major RNMs with three epigenetic annotations: (1) hiPSC-specific chromatin states, (2) TF binding, (3) pluripotency cell state.

Collapsing into 5 hiPSC chromatin states

Using the single chromatin state annotation (described above in Identification of 56,978 autosomal reference peaks), we binned the 12 chromatin states into 5 collapsed states by molecular similarities (Supplementary Fig. 6B). “Active promoters” (TssA) were not collapsed, the collapsed “Enhancer” annotation consisted of peaks in enhancers (Enh), genic enhancers (EnhG), and flanking active promoters (TssAFlnk), “Bivalent Chromatin” consisted of bivalent promoters (TssBiv), bivalent enhancers (EnhBiv), and regions flanking bivalent chromatin (BivFlnk), “Transcription” consists of strong (Tx) and weak (TxWk) transcription, and flanking transcription (TxFlnk), and “Repressed Polycomb” consists of both polycomb states (ReprPC, ReprPCWk, Supplementary Data 8).

Prediction of transcription factor binding with TOBIAS

The TOBIAS algorithm35 leverages distribution of reads across the genome for a given sample, therefore to profile TF occupancy, we ran TOBIAS to predict binding at 187 motifs across all 136,333 ATAC-seq peaks (GEO: GSE203377, Supplementary Data 11). First, we identified 187 transcription factors with experimentally validated high-confidence motifs (Quality A and B) included in the HOCOMOCO V 11 collection36 that were expressed (TPM > 1 in >20% samples) in the 213 hiPSC lines. We used samtools 1.9 to merge, sort and index the 34 BAM files of the reference hiPSC samples (see above: ATAC-seq Peak Calling and Quality Control). We then ran TOBIAS ATACorrect on the merged reference BAM file to correct for cut site biases introduced by the Tn5 transposase within the 136,333 ATAC-seq peaks, using the following parameters: –genome hg19 fasta (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/) and –blacklist hg19-blacklist.v2.bed (http://github.com/Boyle-Lab/Blacklist/tree/hg19-blacklist.v2.bed.gz). Next, we calculated footprints scores with TOBIAS ScoreBigwig, using the narrowPeak file with the 136,333 ATAC-seq peaks (MACS2 score >100) for –regions. Finally, to identify the predicted transcription factor binding sites, we ran TOBIAS BINDetect with all 187 expressed TFs, using hg19 fasta file and narrowPeak file as the genome and regions, respectively. TOBIAS predicted a total of 2,349,030 TFBSs across all 187 motifs on 49,070 ATAC-seq peaks which represented 37.1% of the 132,225 peaks on autosomes. We annotated the 14,208 ATAC-seq peaks without a TFBS association as “Not Bound”.

Validation of predicted TF binding sites using experimental TF ChIP-seq data

To validate the TOBIAS TF binding predictions, we evaluated the concordance with binding profiles that have been experimentally validated via TF ChIP-seq37,41,70 (Supplementary Data 9). We obtained peaks from 19 sets of transcription factor ChIP-seq peak files for 18 TFs in H1 embryonic stem cells from ENCODE37,41,70 (Supplementary Data 9). The REST TF had two independent ChIP-seq experiments, which we merged using bedtools merge. We intersected the TOBIAS predicted TF binding sites for the 18 TFs with corresponding TF ChIP-seq peaks using bedtools intersect -loj. We performed a Fisher’s Exact test to calculate whether TOBIAS predicted TF binding sites were enriched in corresponding TF ChIP-seq peaks (Supplementary Fig. 7).

Collapsing into 92 transcription factor groups

Many different transcription factors have very similar binding motifs, and TOBIAS often predicted that TFs with similar motifs bound at the same site. For example, NANOG.0.A and PO5F1.0.A are the same length and have nearly identical position weight matrices (PWMs, Supplementary Fig. 8B); hence, the identity of the exact TF bound could not be resolved. To reduce the effects of motif sequence similarity, we collapsed the 187 motifs into TF groups by hierarchically clustering the Euclidean distances based on the overlap of bound sites generated by TOBIAS thresholds using cutree (Supplementary Data 10). We selected a 0.75 threshold and obtained 92 TF groups, of which 49 consisted of a single motif, 24 were composed of motifs from the same TF family (n = 73 TFs) and 19 consisted of multiple TFs from different families (n = 65 TFs), hereby referred to as complexes (Supplementary Fig. 8, Supplementary Data 10). The 187 TF motifs and their corresponding collapsed TF groups are in Supplementary Data 10.

Functional characterization of RNMs

Epigenetic feature enrichment within hiPSC regulatory networks

To molecularly characterize hiPSC regulatory networks, for each of the 13 RNMs we calculated enrichment of the formative and primed cell states, 5 collapsed chromatin states and 93 TF groups (including “Not Bound”). We performed a Fisher’s Exact test to calculate enrichment for each epigenetic annotation in the Pareto peaks for each major RNM by using the Pareto peaks for the 12 remaining major RNMs as background (Supplementary Data 12). We considered enrichments with a Bonferroni-corrected p-value < 0.05 significant.

hiPSC regulatory networks conserved in fetal cell types

Annotating hiPSC ATAC-seq peaks with REs with fetal cell type-specific peaks

To identify hiPSC ATAC-seq peaks that correspond to active chromatin in fetal tissue, we integrated single cell ATAC-seq peaks from 54 fetal cell types in the Descartes; Human Chromatin Accessibility During Development Atlas53. To obtain cell type-specific peaks, we used the Z-score corrected single cell ATAC-seq peaks for the 54 fetal cell types (n = ~9500 peaks on autosomes per cell type) (https://atlas.brotmanbaty.org/bbi/human-chromatin-during-development/). We performed bedtools intersect -f -r 0.25 on the 47,761 reference ATAC-seq peaks in the 13 major RNMs and identified 11,830 hiPSC ATAC-seq peaks with a 25% reciprocal overlap with at least one fetal cell type-specific peak. To calculate the enrichment of each RNM for each of the 54 fetal cell types, we performed Fisher’s Exact tests of the overlap with these 11,830 hiPSC ATAC-seq peaks, using the remaining 12 RNMs as background (Supplementary Data 13). To calculate the enrichment of hiPSC TFBSs in the 54 fetal cell type-specific ATAC-seq peaks, we performed Fisher’s Exact tests with bedtools fisher (Supplementary Data 14). For background, we merged the bed files for all 54 fetal cell type-specific peaks and calculated the number of base pairs in the merged peaks for each chromosome.

Allele-specific chromatin accessibility (ASCA) analyses

To identify regulatory variants that impact transcription factor binding we performed allele-specific chromatin accessibility (ASCA) using the 150 ATAC-seq samples from the 133 iPSCORE individuals.

Calculation of allele-specific chromatin accessibility (ASCA)

A VCF file from WGS data of 273 iPSCORE individuals68 was obtained from dbGaP (phs001325). We extracted SNPs in the 47,761 ATAC-seq peaks in the 13 major RNMs: (1) with minor allele frequency > 0.05 in all 273 iPSCORE individuals68 using bcftools view with parameters: –types snps, –f PASS, -q 0.05:minor; and (2) in Hardy-Weinberg equilibrium (p > 1 × 10−6) in the 133 iPSCORE individuals with ATAC-seq data using vcftools –hwe 0.000001. To increase the power to detect allele-specific chromatin accessibility, we performed phasing on these variants using the Michigan Imputation Server with the 1000 Genomes Phase 3 as a reference panel and converted them into the hdf5 format using snp2hd5 in WASP (version 0.2.2), as suggested by the original developers. We realigned the BAM files after WASP correction and applied the same filters as described above in the Sequencing of ATAC-seq section, except for removing duplicates. Specifically, we excluded poorly mapped reads (MAPQ < 20%), and reads less than 38 bp and greater than 2000 bp with samtools view. We then identified allele mapping bias at heterozygous sites in each sample using the WASP mapping pipeline with default parameters91 and duplicates were removed in a non-biased manner using the rmdup_pe.py script in WASP. Coverage of bi-allelic heterozygous variants was calculated using GATK ASEReadCounter (version 3.4-46) with parameters: -overlap COUNT_FRAGMENTS_REQUIRE_SAME_BASE, -U ALLOW_N_CIGAR_READ92. To maximize detection of ASCA, we aggregated allele read counts for each SNP across all heterozygous individuals and required: (1) a minimum of 5 heterozygous individuals were tested, and (2) at least 50 total reads mapped to the position, and (3) at least 10 reads mapped to each of the reference and alternate alleles. This resulted in 105,055 bi-allelic SNPs in 35,614 accessible ATAC-seq peaks used for ASCA analysis. We annotated 104,938 SNPs with their corresponding rsid from the gnomAD database (v2)93, and used the chromosome, position, reference, and alternate allele to annotate the 117 SNPs that were not in gnomAD. ASCA was determined using a two-sided binomial test with equal probability (probability = 0.5) for each allele being accessible. P-values were corrected using Benjamini-Hochberg. SNPs with an adjusted p-value < 0.05 were considered to display allele-specific chromatin accessibility (ASCA-SNPs) (Supplementary Data 15). We annotated the 6,323 ASCA SNPs with the RNMs associated with the ATAC-seq peak and the overlapping TF group(s). To identify RNMs and TF groups with enriched allelic imbalance fractions (AIFs), we performed a one-sided Mann-Whitney U test, using the 12 other RNMs or 22 significant TF groups (Fig. 5C) as background. To calculate the enrichment of ASCA SNPs in predicted TFBSs, we performed a Fisher’s Exact test using bedtools fisher.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.