Evolution of translational control and the emergence of genes and open reading frames in human and non-human primate hearts – Nature Cardiovascular Research

Ethics statement

The tissue bank of the Biomedical Primate Research Centre (BPRC) complies with Article 4 (Principle of replacement, reduction, and refinement) and Article 47 (Alternative approaches) of Directive 2010/63/EU on the Protection of Animals Used for Scientific Purposes. The BPRC has been accredited by the Association for Assessment and Accreditation of Laboratory Animal Care since 2012. The generation of the iPSC line was accredited by the Ethikkommission der Universität Ulm (approval number: Antrag-Nr. 19/12).

Collection of NHP tissues

Snap-frozen samples of heart LVs of four rhesus macaques (Macaca mulatta, three male and one female individuals) and five chimpanzees (Pan troglodytes, three male and two female individuals) were provided from the NHP tissue bank at the BPRC in Rijswijk, The Netherlands (Supplementary Table 1). The tissues from rhesus macaques were obtained from BPRC animals that were euthanized for welfare and ethical reasons. The chimpanzees’ tissues were derived from animals residing at Safari Park Beekse Bergen and on which necropsies were done at the BPRC in The Netherlands.

Tissue processing

Snap-frozen NHP cardiac tissues were powdered using a pre-cooled mortar and pestle while constantly introducing liquid nitrogen15. This procedure was performed on days when humidity levels were below 30%. For each sample, we ideally reserved approximately 100 mg of powdered tissue (with a minimum of 50 mg) for ribosome profiling. Additionally, a smaller amount of 5–10 mg was gathered for the purpose of total RNA extraction and subsequent RNA-seq analysis.

Human and NHP iPSC reprogramming

Dermal fibroblasts from human and NHP individuals with no known cardiovascular disorder or phenotype were reprogrammed to iPSCs using CytoTune 2.0 iPS-sendai virus (Thermo Fisher Scientific). A female human subject was recruited as part of a study at the Charité, Berlin, with broad consent given for human iPSC line generation and use for research purposes. A skin punch biopsy was taken, from which dermal fibroblasts were outgrown and cultured for several passages. Similarly, dermal fibroblasts were outgrown from skin punch biopsies taken from a female macaque (courtesy of Göttingen DPZ), a male gorilla (Rostock Zoo) and a male chimpanzee (Magdeburg Zoo). After colony outgrowth and isolation, clonal iPSC lines with stable uniform morphology were established under feeder-free conditions on Geltrex (Thermo Fisher Scientific) in E8 (Thermo Fisher Scientific) or mTeSR1 (STEMCELL Technologies) pluripotency media (Supplementary Table 1). This step could be performed directly with human (Homo sapiens), chimpanzee (P. troglodytes) and gorilla (Gorilla gorilla) iPSCs; however, rhesus (M. mulatta) iPSCs required an intermediate culture step on mitomycin C-inactivated mouse embryonic fibroblasts in an adapted pluripotency maintenance medium, based on StemMACS iPS-Brew (Miltenyi Biotec) supplemented with 1 µM IWR-1 (Selleck Chemicals) and 0.5 µM CHIR99021 (CHIR, Bio-Techne), referred to as ‘StemMACS UPPS’ from Stauske61. Several passages after this, the rhesus iPSC colonies were non-enzymatically dissociated and transferred to feeder-free conditions on Geltrex-coated plates in StemMACS UPPS for further culture, displaying a stable epithelial cell morphology and growth rate.

Consistent expression of pluripotency factors was confirmed in all human and NHP iPSC lines (see ‘Immunocytochemistry’ subsection; Extended Data Fig. 2a).

IPSC-CM differentiation

Between three and five targeted iPSC-CM differentiation experiments were carried out for each species’ iPSC line. For human, gorilla and chimpanzee iPSCs, an adapted version of established protocols modulating WNT signaling using small-molecule inhibitors could be employed62,63, with minor adjustments in GSK3β-inhibitor CHIR99021 (CHIR, Bio-Techne) concentration and starting cell density needed between lines to achieve efficient differentiation (Fig. 2a and Supplementary Table 1).

In brief, iPSCs were enzymatically dissociated using TrypLE (Thermo Fisher Scientific) or non-enzymatically (gorilla iPSCs) and replated at 2–5 × 105 cells per cm2 (or 1:6 split ratio for gorilla) on Geltrex-coated plates in pluripotency medium containing 10 μM Rho-kinase inhibitor Y-27632 (Selleck Chemicals). After 3–5 d, differentiation to mesoderm was initiated by exchanging for RPMI-based medium containing B-27 minus insulin (Thermo Fisher Scientific) and CHIR (differentiation ‘day 0’). After 3 d, CHIR was replaced with 5 μM WNT-inhibitor IWR-1 (Selleck Chemicals) to drive cardiac specification. After 7 d, CM maintenance medium comprising RPMI plus B-27 was used for all CMs, except during selection; after approximately 10 d of differentiation, iPSC-CMs underwent enrichment via metabolic selection, replacing glucose with sodium lactate (Sigma-Aldrich) for 2–3 d64.

For rhesus iPSCs, we found the initial stages of the differentiation protocol insufficient for generating iPSC-CMs with suitably high efficiency (data not shown), so we employed a strategy from Stauske61. This adapted approach included 5 ng ml−1 BMP4 and 9 ng ml−1 Activin A (both PeproTech) together with CHIR during mesodermal differentiation (Supplementary Table 1).

After approximately 2 weeks of differentiation, iPSC-CMs were dissociated using 10× TrypLE (Thermo Fisher Scientific) and replated at 2.5–3 × 105 cells per cm2 onto fresh Geltrex-coated plates. After approximately 4 weeks of differentiation, iPSC-CMs were collected for quality control (see ‘Immunocytochemistry’ and ‘Flow cytometry’ subsections) or snap frozen as cell pellets.

After 4 weeks, human and NHP iPSC-CMs expressed classic cardiac markers, such as the transcription factor NKX2-5 and TNNT2 (cardiac troponin T), and variable expression of ventricular-specific MYL2 (myosin light chain 2), higher in gorilla iPSC-CMs and lower in rhesus (Fig. 2a and Extended Data Fig. 2b).

Immunocytochemistry

Cells were plated onto chamber slides (ibidi) pre-coated with Geltrex. After several days, cells were fixed with 4% paraformaldehyde (Science Services), washed with PBS and incubated with blocking solution containing 10% normal donkey or goat serum (Abcam) and 0.1% Triton X-100 (Sigma-Aldrich) in PBS with 0.05% Tween 20 (Sigma-Aldrich). Primary and secondary antibodies were used for labeling (Supplementary Table 8), followed by DAPI for nuclear counterstain. Microscopy was performed using a DMi8 microscope fitted with a K5 camera and processed using LASX software (all from Leica).

Flow cytometry

Cells were harvested using 10× TrypLE (Thermo Fisher Scientific), stained for viability using VioBility Blue (Miltenyi Biotec), fixed and permeabilized using FoxP3 staining buffer kit (Miltenyi Biotec) and stained with conjugated antibodies (Supplementary Table 8). Population gates were set for whole single live cells, with positive gates for expression targets set based on isotype antibody staining. Expression was analyzed using a MACSQuant VYB flow cytometer (Miltenyi Biotec) with gating and plots visualized using FlowJo 10 software. A figure exemplifying the gate strategy is included in Supplementary Data 3.

Ribosome profiling

In this study, we performed ribosome profiling on primate iPSC-CM cells and LVs, following established protocols described by our group and others15,65,66,67. In brief, a total of 9–15 million cells were lysed on ice for 10 min in 1 ml of lysis buffer. The lysis buffer composition was as follows: 20 mM Tris-Cl (pH 7.4), 150 mM NaCl, 5 mM MgCl2, 1% Triton X-100, 0.1% NP-40, 1 mM dithiothreitol, 10 U ml−1 DNase I, 0.1 mg ml−1 cycloheximide and nuclease-free water. To ensure complete lysis of the cells and to dissociate cell clumps, the lysate was homogenized through repeated pipetting and multiple passes using a syringe equipped with a 21-gauge needle. After homogenization, the samples were centrifuged at 20,000g for 10 min at 4 °C to pellet cellular debris. For each sample, 200 µl (cells) or 400 μl (tissue) of the lysate was digested with RNase 1 (Biozym, N6904K) and purified using MicroSpin Sephacryl S-400 HR columns (Sigma-Aldrich, GE27-5140–01), and 2 μg of footprints was obtained. Ribosomal RNA removal was performed using the Ribo-Zero Gold rRNA Removal Kit (Human/Mouse/Rat) (Illumina, MRZG12324). First, the footprints were excised from a Novex 15% TBE Urea PAGE gel (Thermo Fisher Scientific, EC68852BOX), and the 3′ ends were treated with T4 PNK (Biozym, P0503K) to enable ligation to a pre-adenylated linker. Subsequently, the RNA was reverse transcribed using Reverse Transcriptase (Biozym, ERT12925K), and the resulting cDNA was purified using a Novex 10% TBE Urea PAGE gel (Thermo Fisher Scientific, EC68752BOX). To circularize the fragments, Circligase I (Biozym, CL4115K) was employed, followed by PCR amplification and size selection using a Novex 8% TBE PAGE gel (Thermo Fisher Scientific, EC62152BOX). To ensure quality control, the size distributions of the ribosome profiling libraries were assessed using a Bioanalyzer 2100 (Agilent, 5067-1511) with a high-sensitivity DNA assay (Agilent, 5067–4626). The libraries were then multiplexed and sequenced on a NovaSeq 6000 Illumina platform, generating single-end 1 × 51-nucleotide reads.

Stranded mRNA sequencing

To isolate polyA+ RNA, 3 million cells per iPSC and iPSC-CM sample, or 5–10 mg of snap-frozen and powdered tissue per adult heart sample, matching the identical samples previously processed for ribosome profiling, were treated with TRIzol reagent (Invitrogen, 15596018). The quality of the isolated RNA was assessed by measuring the RNA integrity number (RIN) using the RNA 6000 Nano assay on a Bioanalyzer 2100. High-quality RNA samples with an average RIN of 7.4 (tissue) and 9.7 (cells), respectively, were subjected to polyA purification to generate mRNA sequencing (mRNA-seq) libraries. The library preparation followed the TruSeq Stranded mRNA Reference Guide, using 1,000 ng of total RNA as input. Subsequently, the libraries were multiplexed and sequenced on an Illumina NovaSeq 6000 platform, generating paired-end 2 × 101-nucleotide reads.

Generation of de novo transcript assemblies

First, we clipped Ribo-seq and RNA-seq datasets for residual adaptor sequences using cutadapt68. Ribo-seq datasets were additionally filtered for common contaminants: mitochondrial, rRNA and tRNA sequences.

To build a complete cardiac transcriptome of human, NHP and rodent species, we mapped the generated iPSC-CM and LV paired-end mRNA-seq datasets to the respective Ensembl genomes: GRCh38/hg38, Pan_tro_3.0/panTro5, gorGor4, Mmul_10/rheMac10, GRCm38/mm38 and Rnor6.0/rn6. Due to the absence of iPSC-CM datasets for mouse and rat, we additionally included the cardiac datasets from Cardoso-Moreira et al.9 for transcriptome assembly. We used the standard STAR (version 2.7.3a)9 settings with the following modified parameters: –outSAMtype BAM SortedByCoordinate, –outFilterMismatchNmax 4, –outFilterMultimapNmax 20, –alignSJDBoverhangMin 6, –alignSJoverhangMin 500, –outFilterType BySJout, –limitOutSJcollapsed 10000000, –limitIObufferSize = 300000000and –outFilterIntronMotifs RemoveNoncanonical. We next included the XS tags with strand information to the generated BAM files, and we ran StringTie (version 2.1.1)69 with parameters -M 0.5 -j 3 -f 0.2 to reconstruct de novo transcriptome for each sample, guided by a reference transcriptome annotation (Ensembl release 98 (ref. 70)). Genes were defined as independent loci composed of either annotated or assembled overlapping transcripts, regardless of the protein-coding status of the loci. We classified genes as ‘protein coding’, ‘lncRNA’ and ‘pseudogene’ based on defined gene biotypes in Ensembl, whereas the rest of genes not mapped to any Ensembl annotations were defined as ‘unannotated’. Later, all novel annotations were merged into species-specific consensus GTFs using StringTie merge, with parameters -m 200 -F 0.5 -f 0.2 and cuffcompare. Transcripts assigned to codes ‘p’, ‘s’ and ‘c’ were excluded, and any unnanotated transcript with full overlap to an annotated CDS was re-annotated as protein coding.

Dataset alignment to de novo transcript assemblies

We next remapped all Ribo-seq and RNA-seq datasets to the generated de novo transcript assemblies. The first read pair of each paired-end RNA-seq dataset was trimmed to 29 base pairs (bp) to process and map RNA-seq reads with the exact same parameters as the Ribo-seq data. We then ran STAR using similar settings as indicated before (see ‘Generation of de novo transcript assemblies’ subsection) but readjusting the maximum number of mismatches: -outFilterMismatchNmax 2. For subsequent analysis, we filtered out transcripts and genes with low RNA-seq expression values (fragments per kilobase of transcript per million mapped reads (FPKM) < 0.5 using pooled RNA-seq for iPSC-CMs and LVs).

Detection of ORFs predicted by Ribo-seq

For each Ribo-seq sample, we searched for ORFs in each species transcriptome running ORFquant71 (version 1.00, cognate ORFs) and PRICE72 (version 1.0.3b, cognate and non-AUG ORFs), with standard settings and considering only uniquely mapped reads. Significant ORFs by sample in any of the two software (adjusted P < 0.05) were pooled into unique lists per species, collapsing in-frame ORFs where multiple instances of ORF isoforms shared ≥90% of the sequence, prioritizing cognate ORFs starting with AUG and selecting the longest ORF as representative. Later, each ORF was assigned to the most highly expressed transcript isoform based on the pooled RNA-seq FPKM levels calculated by StringTie. Any gene harboring at least one translated ORF or CDS was defined as translated.

The total numbers of pooled ORFs are displayed in Extended Data Fig. 4b. A high fraction of identified ORFs outside of CDS sequences (~61%) started with non-AUG initiation codons (Extended Data Fig. 4e) and exhibited similar characteristics compared to AUG cognate ORFs (Extended Data Fig. 4f,g).

Defining sets of actively transcribed genes and replicated ORFs

To maximize the robustness of each inter-species comparison, we defined a set of ‘actively transcribed’ genes for each species based on their mean RNA-seq expression levels in iPSC-CMs and/or LV, measured in FPKM. Genes were included if they had a mean RNA-seq expression level of at least 1 FPKM and were expressed (FPKM ≥ 0.5) in at least three independent replicates. For pseudogenes and unannotated genes, only unique reads were considered for quantification. The total number of actively transcribed genes and their respective biotypes is represented in Fig. 1b.

In addition, we established a set of ‘replicated’ ORFs for each species by selecting ORFs that were partially or totally predicted in at least three samples and were encoded by actively transcribed genes. We then defined CDSs by selecting ORFs that overlapped in-frame annotated CDSs, considered canonical and assigned then to individual annotated genes when several were fused (Supplementary Data 1). The rest of cases were termed small ORFs (sORFs) because they are unannotated, and the vast majority (~91%) are shorter than 150 amino acids. The total numbers of replicated sORFs are illustrated in Fig. 5a. ORF biotypes were defined following a slightly modified version of the Phase I GENCODE nomenclature18 (Supplementary Data 2). Lastly, we used RiboseQC to extract P-site counts, which were employed for quantifying in-frame P-site counts associated with each ORF.

Homology searches and identification of robust orthologs

We used the ‘pyliftover’ package (version 0.4.1) to identify counterpart sequences of exons from actively transcribed genes and replicated ORFs (sORFs and canonical CDSs) in the primate and rodent genomes considered in this study. Homologous regions were identified when a gene or ORF was fully mapped to unique counterpart regions on the same strand and chromosome in another species. Among the total CDSs as well as sORFs with young translation (see ‘Evolutionary classification of sORFs’ subsection), only one from humans (a young sORF), four from chimpanzees (two CDSs and two young sORFs) and 25 from rhesus (21 CDSs and four young sORFs) could not be aligned to the genomes of any other primate species (Extended Data Figs. 1e and 4c). Additionally, we assessed the mappability of 29-bp-long reads to identify ORF counterpart regions that could not be quantified due to the short length of Ribo-seq reads. Reassuringly, most ORFs exhibit fully mappable sequences (Extended Data Figs. 1e and 4c), with only 16 instances in humans (four CDSs and 12 young sORFs), 11 in chimpanzees (four CDSs and seven young sORFs) and 24 in rhesus (12 CDSs and 12 young sORFs) being unmappable in other analyzed primate species. These cases are described in Supplementary Table 9.

When the region was aligned to other species, we defined homologous genes and ORFs as those with the longest overlap to the identified homologous region, including all transcribed genes and all predicted ORFs in the other species to maximize homolog detection. In addition, we ran BLASTn73 (version 2.14.0+) against the same sequences of transcripts and ORFs per each species, identifying sequence homologs with a minimum E-value cutoff of 10−4. Genes or ORFs showing significant alignment to homologous sequences, as identified by pyliftover and/or BLAST, were considered as preserved in the respective species.

Next, we established two sets of robust orthologs in primates and mammals. This selection was based on identifying genes and ORFs with consistent homolog matches using both pyliftover and BLAST across all species considered in the lineage. Specifically, we included all genes and ORFs for which the same homolog match could be found as actively transcribed or with replicated translation in both pyliftover and BLAST analyses.

Evolutionary classification of genes

We defined three evolutionary gene ages based on their presence or absence across species hearts. Genes identified uniquely in human, chimpanzee or rhesus macaque genomes that lacked detected homologs in other species were categorized as young and species specific. Additionally, genes exhibiting transcription unique to both humans and chimpanzees were defined as young and hominini specific. The transcription of the remaining genes was classified as preserved, forming a heterogeneous group with varying levels of conservation. It is important to note that we classified genes only for the three primate species in which we generated both iPSC-CM and LV data.

It should be noted that, due to the search for homologs being conducted by examining all transcribed genes in other species, certain hominini-specific genes may meet the expression thresholds in only one of the two species (human or chimpanzee). This discrepancy explains the difference in the number of identified hominini-specific genes between humans and chimpanzees.

Evolutionary classification of sORFs

As done with the genes, we established three evolutionary sORF ages based on their presence or absence across species. The unique translation of sORFs identified in human, chimpanzee or rhesus macaque genomes that lacked detected homologs in other species and without evidence of translation in their counterpart sequences (fewer than 10 Ribo-seq reads in all pooled samples) was categorized as young and species specific. sORFs with translation unique to both humans and chimpanzees were defined as young and hominini specific. The remaining sORFs were defined as preserved at the level of translation.

Later, we employed a modified version of our previously published methodology21 to identify young sORFs with recent structures. For this, we reanalyzed the pairwise liftover alignments generated to trace the evolutionary origins of young sORFs. If the sequence of a sORF was truncated by at least 70% of the in-frame sequence in the counterpart regions of the other species analyzed, we classified the structure of the sORF as ‘de novo. Otherwise, the structure of the sORF was defined as ‘intact’. For the cases where the region could not be aligned in other species, the mode of evolution could not be accurately assessed due to limited available data, because the ancestral region evolved at the same time as the ORF sequence. Consequently, these cases were classified as ‘orphan’.

Analysis of differential gene and ORF expression

We defined sets of differentially transcribed and translated (based on RNA-seq counts or in-frame P-sites) genes and ORFs using DESeq2 (version 1.26.0)74.

For the calculation of differential TE, we evaluated the differences in ribosome occupancies (in-frame P-sites) between each primate species and the two other primates with DESeq2 and used RNA-seq counts as an interaction term into the statistical model of DESeq2.

Genes and sORFs that were uniquely differentially transcribed or translated in iPSC-CMs or prenatal LVs were defined as ‘prenatal’, whereas genes and sORFs that were uniquely differentially transcribed or translated in adult or postnatal LVs were defined as ‘postnatal’.

Calculation of TEs and their variances

TE was calculated as the ratio of DESeq2-normalized in-frame Ribo-seq P-sites and total RNA-seq counts in LV samples. Specifically, these ratios were computed for the primary CDS of each gene. We focused on cases where the CDS and its corresponding host gene were identified as robust orthologs within the mammalian (4,524 genes) or primate (8,238 genes) lineage. Most of the aligned CDS homologs displayed similar lengths across primate species (Extended Data Fig. 1f). However, in instances where multiple CDS sequences were encoded by a single gene, we selected the CDS region with the lowest variance in nucleotide length across species. Raw counts were based on in-frame P-sites and RNA-seq counts and adjusted by the CDS length to ensure comparability across different species prior to performing normalization across all samples and species. Genes that exhibited median values of 0 or 3 or more 0 count values across replicates within a specific species were excluded from global TE calculations.

We also determined the TEvar among five species (human, chimpanzee, rhesus, mouse and rat) for the aforementioned gene sets. To achieve this, we employed a multi-step approach. Initially, we computed the inter-species variances of the estimated individual TE for each species. This involved calculating the variance of TE estimates across all individuals in each species and subsequently dividing them by the number of samples. By averaging these normalized inter-species variances across the five species, we derived the average inter-species variance. Later, we evaluated the intra-species variance of average TE values for the five species. Subsequently, we subtracted inter-species variances from each calculated intra-species variance value. This resulted in a TEvar score for each gene under consideration. Notably, a higher gene TEvar score implies higher level of inter-species variability in TE corrected by each intra-species variance.

To assess potential effects resulting from species alterations in gene structure within TEvar, we computed variances in nucleotide length for both the coding sequence (CDSvar) and its associated untranslated region (UTRvar) across the five species. We later divided the genes in four groups corresponding to the four quantiles of CDSvar and UTRvar. To evaluate the correlation between TEvar and the levels of natural selection acting on the CDSs, we downloaded pre-computed ratios of non-synonymous to synonymous substitutions (dN/dS) from Ensembl70.

Gene markers and sORFs evaluated in CRISPR–Cas9 screens

We curated different lists of genes and sORFs in our study:

  • A list of cardiac gene markers based on mining several published publications (Supplementary Table 10). These genes represent cardiac maturity markers, sarcomere components and cardiac-specific genes.

  • OXPHOS components were extracted based on Gene Ontology annotations75.

  • Cardiac cell type markers from the Human Heart Cell Atlas, retrieving all genes that showed more than 90% of total cardiac expression restricted to one cell type76,77.

  • Essential genes were retrieved from a list of proteins that were shown to be essential for survival of human cells in a previous study27. By doing mutagenesis in haploid human cells, the authors identified genes required for optimal fitness. We only included genes that affected cell viability in both tested cell lines (KMB7 and HAP1).

  • Four sORF lists from CRISPR–Cas9 screens evaluating cell survival in human cell lines37,38,39,40.

Calculation of tissue enrichments

Tissue enrichment indexes were based on the tau (τ) metric78:

$$tau =mathop{sum }limits_{i=1}^{n}frac{(1-{{xt}}_{i})}{n-1}{{;}}quad{{xt}}_{i}=frac{{x}_{i}}{{max }_{1 < i < n({x}_{i})^{prime} }}$$

where n is the number of tissues and xi is the expression of a gene/ORF in a tissue.

For the calculation of the tissue enrichment, we used mean expression values across each tissue for (1) the developmental resource of Cardoso-Moreira et al.9 and (2) the GTEx project34. Owing to the known pervasive expression of a high number of evolutionarily young genes in the testis8,25, we excluded this tissue from tissue enrichment calculations. Any gene and ORF with τ above 0.75 were considered as tissue enriched.

Finally, we defined a list of genes with recent cardiac-enriched expression (unique to humans) by selecting cases with τheart above 0.75 uniquely in humans.

Genome-wide gene–gene correlations

Spearman correlations were computed to assess co-regulation among all transcribed genes in both human and rhesus, as we did previously15,79. DESeq2-normalized counts were used, focusing on pairwise complete observations across cardiac developmental RNA-seq datasets9. We required a minimum of 20 (human) or 10 (rhesus) samples with at least one raw count to be included in the calculation. Subsequently, for each young gene, we aggregated all co-regulated protein-coding genes (with an absolute Spearman’s rho > 0.5) and searched for enriched functionalities (see ‘Functional enrichment analysis’ subsection).

Analysis of ClinVar genetic variants

We evaluated genetic variants that can be important for disease in ClinVar45 (database March 2024) by quantifying how many of the genes and sORFs of interest overlapped at least one germline variant classified as pathogenic or likely pathogenic. Variants involved in cardiomyopathy, atrial fibrillation, ventricular tachycardia or cardiovascular phenotype were considered as cardiac disease variants.

Prediction of motifs and stable structures in microproteins

We predicted motifs and stable structures of microprotein sequences encoded by sORFs using InterProScan80 (version 5.69-101.0) and ESMFold81 (version 1). In parallel, we retrieved motifs and structures of shuffled amino acid sequences of all microproteins encoded by sORFs. Any structure with an average pLDDT higher than 80 was defined as potentially well folded. Due to limitations in ESMFold, only sORFs shorter than 400 amino acids were predicted.

CRISPRi-mediated knockdown with single-cell RNA-seq output

We focused on two candidate genes: LINC01405 and SRP14-AS1. These are two examples of cardiac genes where both the gene and the encoded sORF emerged recently in primates; their expression is upregulated in iPSC-CMs compared to adult LVs; and they are dysregulated in DCM. See Supplementary Data 4 for a full description of the method.

Functional enrichment analysis

Functional Gene Ontology75, KEGG82 and CORUM83 enrichment on specific groups of genes was done with gProfiler2 (ref. 84)(version 0.2.0), with default parameters and selecting full sets of transcribed or translated genes as custom backgrounds based on selected cutoffs for the different analyses, such as expression level, replicability or robust conservation.

Statistics and reproducibility

Statistical analysis was performed using R (http://r-project.org).

Differentially expressed and translated genes were identified with DESeq2. Cases with an absolute fold change equal to or higher than 1.5 and an adjusted P value lower than 0.05 were defined as differentially expressed/translated. For differential gene expression analysis in single-cell RNA-seq libraries, we used Wilcoxon rank-sum test considering differentially expressed genes with false discovery rate (FDR) < 0.05 and log2 fold change > 0.5. Comparisons between proportions of genes were analyzed with two-sided Fisherʼs test, unless otherwise stated. P < 0.05 was considered significant, and P values were adjusted using FDR corrections for several tests. Comparison between groups of ORFs in Extended Data Fig. 4k was evaluated by Wilcoxon rank-sum test; P < 0.05 was considered significant. Comparisons between multiple groups in Fig. 1c were made by ANOVA; P < 0.05 was considered significant. For Fig. 5l and Extended Data Fig. 4m, we compared whether microproteins encoded by sORFs presented a significantly higher proportion of motifs and stable structures compared to shuffled sequences by running one-sided Fisher’s exact tests; P < 0.05 was considered significant, and P values were adjusted using FDR corrections for several tests.

We evaluated the significance of the calculated TEvar observations using two approaches displayed in Extended Data Fig. 1g: (1) resampling, randomly assigning each of the analyzed samples to each species and recalculating the TEvar in each iteration (10,000 iterations) and (2) data downsampling, by selecting three (n = 10,000 iterations) and four (this is the lowest number of samples in a considered species, n = 10,000 iterations) samples per species to recalculate TEvar and evaluate the robustness of the calculated TEvar in our original data after modifying the number of samples per species. Both approaches were applied without sample replacement. One-sided P values obtained using both approaches represent the fraction of resampled or downsampled observations (out of 10,000) with medians higher than the real observed values and were adjusted using FDR corrections for the five groups of tested genes (essential, cardiac-enriched, sarcomere, OXPHOS complex IV and OXPHOS complex V). Resampling analyses indicated that the observed high TEvar of OXPHOS complex IV subunits was borderline significant (P = 0.047), whereas the TEvar of OXPHOS complex V subunits showed higher significance (P = 0.004). Further downsampling of the dataset over 10,000 iterations, including the same number of individuals per species, confirmed that the observed high TEvar for both complexes was significantly robust to differences in the number of biological replicates per species, although these values exhibited high variability when the number of samples was lower (complex IV downsampled to three samples: 95% confidence interval (CI95%) (0.51, 1.56); complex IV downsampled to four samples: CI95% (0.64, 1.13); complex V downsampled to three samples: CI95% (0.93, 1.83); and complex V downsampled to four samples: CI95% (1.04, 1.52)).

Sample numbers are indicated in the figure panels or legends. Reproducibility was ensured by having a minimum of three biological replicates per species and tissue. For iPSC-CMs, we generated three differentiation replicates (five for human) of the same individual. No statistical method was used to pre-determine sample size, and no data were excluded from analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.