Cell-type-specific consequences of mosaic structural variants in hematopoietic stem and progenitor cells – Nature Genetics

Single-cell-resolved mSV landscapes in HSPCs

To study mSV formation in HSPCs with cell-type-specific resolution, we analyzed cells from 19 healthy donors—ranging from newborn to 92 years of age—composed of n = 3 umbilical cord blood (UCB) and n = 16 bone marrow samples (Fig. 1a). We isolated viable CD34+ HSPCs (Supplementary Fig. 1) and cultured them for one cell division to enable Strand-seq (Methods). We obtained 1,133 high-quality single-cell libraries, with a mean of 432,282 uniquely mapped fragments per cell (Supplementary Fig. 2 and Supplementary Table 1). We used scTRIP14 to discover mSVs and whole chromosome aneuploidies (herein, collectively called ‘mosaicisms’), both in single cells and in subclones. Altogether, we identify 51 independently arisen mosaicisms, occurring in 16 of 19 (84%) donors (mean per donor = 2.7; range 0–8), including: 22 deletions, 12 duplications, 3 complex mSVs involving three or more breakpoints, 1 balanced inversion and 13 chromosomal losses (Fig. 1b and Supplementary Table 2). These mosaicisms affect 17 of 24 chromosomes and exhibit no chromosomal enrichment except for the Y chromosome, which was independently lost once or multiple times (leading to mosaic loss of Y (LOY)) in 8 of 12 (67%) male donors.

Fig. 1: HSPCs acquire a wide diversity of mSVs with age, without increased chromosomal instability.
figure 1

a, Cohort and experimental workflow used. For visualization purposes, here and below, strand- and haplotype-specific DNA reads are colored as follows: Watson (−) reads, orange; Crick (+) reads, blue; SNPs phased to haplotype 1 (H1), red circles; SNPs phased to haplotype 2 (H2), blue circles. b, Genome-wide karyogram of mSVs identified. Bars indicate the size of identified mSVs, color indicates the class and the relative size of the bubble linked to the middle of each mSV depicts its cell fraction (CF). Filled circles denote subclonal mSVs, while unfilled ones are singleton mSVs. Stars indicate bins significantly enriched for SCEs. c, Examples of singleton complex mSVs identified in the cohort. Copy-number estimates in affected regions are shown next to the respective segments. Black dotted lines represent mSV breakpoints. DNA reads are colored as described in panel a. IntDel, interstitial deletion; InvAmp, inverted amplification; terDel, terminal deletion; hetInv, heterozygous inversion. d, Singleton mSVs (n = 67 examined over 10 independent donors) are significantly larger, when comparing mean total affected base pairs, than subclonal mSVs (two-sided Wilcoxon rank-sum test; n = 10 examined over 6 independent donors; boxplots were defined by: minima, 25th percentile − 1.5× interquartile range (IQR); maxima, 75th percentile + 1.5× IQR; center, median; and bounds of box, 25th and 75th percentiles). e,g, Jitter plots depicting trends in the number of subclonal and singleton mSVs (e), and SCEs (g), across age (R, correlation coefficient calculated from the number of mSVs/SCEs given the donor age; P value is based on the two-sided significance test for the Pearson correlation coefficient, testing the hypothesis that it is 0.). f, Barplot of the incidence of singleton mSVs (y axis) in cells with or without subclonal mosaicism. Padj computed using two-sided Fisher’s exact tests. h, Results of the one-sided permutation test shuffling singleton mSV breakpoints (100-kb confidence interval) and SCE hotspots (200-kb bin) genome-wide for 10,000 permutations. The P value shows the significance of the difference between the permuted (black line) and actual (green) number of overlaps. i, Local Z-score of enrichment of overlaps between singleton mSV breakpoints and SCE hotspots. mSV breakpoints are shifted in windows of 100 kb to 10 Mb ±the bin in which an SCE hotspot is located, and the enrichment Z-score plotted each time. Additional permutations are plotted in Extended Data Fig. 1. j, Strand-seq data showing recurrent SCE and mSV co-occurrence at the SCE hotspot and FRA3B CFS in donor BM762. Haplotype-specific DNA reads and SNPs phased to H1 and H2 are colored as described in panel a. CN, copy number; Evob, observed overlaps; Evperm, expected overlaps; nPerm, number of permutations.

Investigating the subclonal composition of each mosaicism (Supplementary Table 2), we find 32 that are detected in only 1 cell (‘singleton mosaicism’), while the remaining mSVs constitute subclones with a cell fraction (CF) of 1.6–56.1% (‘subclonal mosaicism’). While subclones with sex chromosome losses (n = 12 LOY; n = 1 loss of X) reach CFs up to 46.4%, we do not observe whole autosomal aneuploidies. Focusing our further investigation on the 38 autosomal mSVs, we find notable differences between singleton and subclonal mosaicisms. First, 21 of 31 singleton mSVs (68%) exhibit terminal gains or losses, whereas all seven subclonal mSVs comprise interstitial alterations. Second, all complex mSVs are singletons. These include a breakage fusion bridge cycle-mediated14 mSV on chromosome 20p, and a terminal amplification of 1q (Fig. 1c). Third, singleton mSVs are nearly 18 times larger on average than subclonal mSVs (36.9 versus 2.1 megabase pairs (Mb), respectively; P = 0.0009, Wilcoxon rank-sum test; Fig. 1d). These data indicate that singleton mSVs, detected in 1 of every 43 HSPCs, bear the characteristics of de novo rearrangements (Supplementary Notes), suggesting that not all mSVs have the same potential to form appreciable subclones.

Analyzing these data with respect to donor age shows subclonal mSV expansions (Pearson’s correlation; R = 0.16; P = 1.1 × 10−7) and sex chromosome losses (R = 0.087; P = 0.0034) are associated with increased age (Fig. 1e), consistent with previous studies of mosaic CNAs2,11,18,19. Conversely, singleton mSVs are uncorrelated with age (R = 0.008; P = 0.79; Fig. 1e), suggesting continuous acquisition throughout life. Instead, we observe elevated numbers of de novo mSVs in cells already containing a subclonal mosaicism versus unmutated cells (Fisher’s exact test; 4.76% versus 1.96%; P = 0.038; Fig. 1f), suggesting that mSV-harboring cells may be ‘predisposed’ to accumulate further rearrangements.

Hotspots of mSV formation

Since DNA double-strand breaks (DSBs) can trigger structural rearrangements7,21, we examined the correlation between DSB acquisition and donor age. Strand-seq enables the detection of sister chromatid exchanges (SCEs) to allow systematic mapping of DSBs following homologous repair16. We identified 4,528 SCEs in our dataset (~4 SCEs per cell, consistent with previous reports22; Extended Data Fig. 1a). SCE abundance is inversely correlated with age (R = −0.089; P = 0.0027; Fig. 1g and Supplementary Fig. 3), with on average 4.6 SCEs per cell in individuals <60, compared with 3.9 SCEs per cell in donors >60 (Extended Data Fig. 1a). With HSPCs exhibiting largely stable acquisition of mSVs and SCEs regardless of age, these data suggest mSV formation occurs consistently throughout life.

Since structural rearrangements can be influenced by local sequence context7, we analyzed the genomic locations of SCEs and mSVs. The skewed distribution of SCEs along chromosomes is even more pronounced than that of mSVs (Fig. 1b and Supplementary Fig. 4): 6.67% (302 of 4,528) cluster into 20 SCE ‘hotspots’ (Methods, Extended Data Fig. 1b and Supplementary Table 3), of which five (25%) coincide with common fragile sites23 (CFSs) (Supplementary Table 4). Notably, SCEs overlap significantly with mSV breakpoints, with 3% (133 of 4,528) of all SCEs intersecting an mSV breakpoint (P < 0.0001, derived from 10,000 permutations; Fig. 1h,i, Extended Data Fig. 1c–f and Supplementary Table 3). While CFSs are enriched for both SCEs (P < 0.0002) and mSV breakpoints (Extended Data Fig. 1g,h), we identify additional SCE hotspots with similar enrichments not previously identified as CFSs (Fig. 1b,j, Supplementary Fig. 5 and Supplementary Tables 24). These loci may therefore represent mSV hotspots in HSPCs.

High-precision cell-typing using nucleosome occupancy profiles

To investigate the cell-type-specific impact mSVs exert on HSPCs, we utilized a two-pronged approach by coupling single-cell mSV analysis with nucleosome occupancy-based functional profiling20. First, to develop nucleosome occupancy-based cell-type classifiers20, we constructed single-cell nucleosome occupancy reference profiles for HSPCs derived from both UCB and bone marrow, covering eight distinct cell types: hematopoietic stem cells (HSCs), multipotent progenitors (MPPs), lymphoid-primed multipotent progenitors (LMPPs), common lymphoid progenitors (CLPs), plasmacytoid dendritic cells, common myeloid progenitors (CMPs), granulocyte–macrophage progenitors and megakaryocyte–erythroid progenitors (MEPs) (Fig. 2a and Supplementary Fig. 6). Using well-defined immunophenotypes (Supplementary Table 5 and Supplementary Fig. 6) we index-sorted HPSCs, and devised a preamplification-free single-cell MNase sequencing (scMNase-seq) protocol (Methods) to characterize the single-cell nucleosome occupancy profile for each cell type.

Fig. 2: scMNase-seq atlases for eight distinct HPSCs enable cell-type-aware single-cell multiomic profiling.
figure 2

a, Single-cell multiomic analysis workflow used to investigate mSVs in HSPCs with Strand-seq, which involves single-cell mSV discovery (scTRIP14), single-cell nucleosome occupancy (NO) analysis to infer mSV functional effects (scNOVA20) and cell-typing. GMP, granulocyte–macrophage progenitor; pDC, plasmacytoid dendritic cell. b, Construction of bone marrow and UCB-specific NO reference datasets to allow for cell-typing, based on subjecting HSPC cell types to index sorting, and scMNase-seq. Heatmap of single-cell NO of gene bodies of 305 single bone marrow HSPCs (UCB-based reference shown in Extended Data Fig. 2). The 819 signature genes depicted (rows) allow for discrimination between eight cell types (columns). Cells are grouped and color-coded by immunophenotype, determined by FACS. Example marker genes for each cell type are shown to the right of the heatmap, color-coded by the cell type. Differential NO of marker genes is represented by Z-scores. c, Comparison of inferred gene activity20 (act), based on inverse NO and publicly available gene expression (RNA sequencing) data24 for the representative classifier genes from the bone marrow scMNase-seq reference. Gene activity at gene bodies was inferred using the NO Z-score multiplied by (−1). Color and the dot sizes reflect the Z-score of inferred gene activity and RNA expression, respectively. d, Receiver operating characteristic (ROC) curve showing leave-one-out cross-validation of the bone marrow cell-type classifier’s performance using single-cell NO patterns. e, Unsupervised UMAP dimensionality reduction of the bone marrow HSPC scMNase-seq data. f, Supervised UMAP dimensionality reduction of the data in e, using the bone marrow cell-type classifier. AUC, area under the curve.

We obtained 480 high-quality scMNase-seq libraries (Supplementary Table 6): 305 from bone marrow-derived HSPCs (1 donor) and 175 from UCB-derived HSPCs (5 donors) (Supplementary Table 1). Using scNOVA, we identify 899 and 819 genes exhibiting cell-type-specific nucleosome occupancy in the UCB- and bone marrow-derived datasets, respectively (Fig. 2b and Extended Data Fig. 2a). The cell-type-specific gene activities inferred from nucleosome occupancy20 are broadly consistent with published transcriptomic datasets24 (Fig. 2c). For example, from the bone marrow-derived nucleosome occupancy dataset, we infer increased activity of the canonical marker MME (CD10) only in CLPs25, while HDC (involved in myeloid-lineage priming26) exhibits increased activity in CMPs. We also observe differential nucleosome occupancy at genes not previously reported as HSPC markers, such as SH2D4B and FAT3 (Supplementary Table 7a,b).

Harnessing these gene sets, we utilized nucleosome occupancy measurements as features for developing supervised cell-type classification models using partial linear square discriminant analysis (PLS-DA) (Fig. 2d–f, Extended Data Fig. 2, Supplementary Table 7a,b and Methods). These classifiers provide excellent accuracy, with an average area under the curve of 0.97 for bone marrow and 1.00 for UCB, as estimated by leave-one-out cross-validation (Fig. 2d and Extended Data Fig. 2). Uniform manifold approximation and projection (UMAP) of the latent variables corroborate the discriminatory power of these classifiers compared with unsupervised classification (Fig. 2e,f and Extended Data Fig. 2).

Subclonal mSVs commonly exhibit a lineage bias

Having constructed nucleosome occupancy references for HSPCs, we next performed cell-typing of each Strand-seq library (Fig. 3a and Supplementary Table 8). Tissue-level cell abundances detected based on nucleosome occupancy show high consistency with previous studies24,27,28,29, including an expanded HSC frequency in older bone marrow donors (from 8.1% to 80%; false discovery rate (FDR)-adjusted P (Padj) = 0.013; mixed linear model analysis), and a greater abundance of MPPs in UCB versus bone marrow24,27 (37% versus 0.1%; Padj = 2.45 × 10−33; Fisher’s exact test; Extended Data Fig. 2). Furthermore, the cell-type compositions seen in Strand-seq closely resemble estimates from orthogonal single-cell RNA sequencing (scRNA-seq) data generated from two donors (BM65, BM712), independently verifying our nucleosome occupancy-based classifiers (Fig. 3b and Supplementary Fig. 8).

Fig. 3: mSVs in HSPCs frequently exhibit a cell-type bias.
figure 3

a, Inferred cell-type composition (based on Strand-seq) per donor (ordered by age). b, Upper, stacked bar graph depicting the HSPC cell-type composition in BM65, estimated through SingleR cell-type annotation72 of scRNA-seq, utilizing previously published immune cell-type annotations as a reference profile51,57, compared with cell-typing of Strand-seq data. Lower, cell-type compositions are highly correlated between Strand-seq (x axis) and scRNA-seq (y axis) in BM65. The error band indicates the confidence interval controlling the 95% confidence region. (R, correlation coefficient calculated from the x and y axes; P value is based on the two-sided significance test for the Pearson correlation coefficient, testing the hypothesis that it is 0.) c, Dotplot of results of the cell-type enrichment analysis for each mSV identified, showing the CF, enrichment and significance of each cell type per mSV subclone versus an idealized control. The number in brackets indicates the number of single cells of a given cell type, in a given subclone, used to calculate enrichment. Data here show enrichment for single genotypes; for combined enrichments see Supplementary Fig. 10. d, Circle-packing plot summarizing the mSVs and inferred cell-type composition of each subclone for each of the 19 donors. Transparent circles with a solid outline represent distinct samples. Transparent inner circles with dashed outlines represent mosaicism-bearing subclones within a sample, while colored circles denote the cell types contributing to the subclone. Each circle is proportional to the total number of single cells composing that cell type/subclone. A gray background identifies mosaicism-bearing subclones showing a significant (FDR 10%) cell-type enrichment with respect to the control group of karyotypically normal cells. e, Summary of lineage biases observed across all subclonal mSVs (that is, excluding LOY/loss of X) across the cohort. f, Enrichment analysis of pathways grouped by Jaccard similarity, for subclonal mSVs across the cohort. Only groups of pathways enriched in two or more mSVs are shown. For all individual pathways, see Supplementary Fig. 12. For all groups of pathways and details on Jaccard similarity-based grouping, see Supplementary Fig. 35.

We next explored the cellular context of mSVs. Of the 19 subclonal mosaicisms found, 8 (42%) show significant cell-type enrichments (FDR 10%; Fig. 3c,d and Supplementary Figs. 9 and 10); and, when considering only subclonal mSVs (that is, removing sex chromosome losses), 5 of 7 (71%) show significant biases. Here, we find predominantly myeloid skewing, with 5 of 5 (100%) of the cell-biased subclonal mSVs enriched in either myeloid or myelo-primitive cell types (Fig. 3e). These lineage-biased events include: a 10-Mb inversion on chromosome Xq12-Xq21.1 enriched in MEPs (BM65); a 1-Mb duplication at 13q enriched in MEPs (BM70); a 300-kilobase (kb) duplication at 19q enriched in CMPs (BM63); and two sequentially arisen deletions at 17p (1.2 Mb) and 17q (500 kb) enriched in both CMPs and HSCs (BM712).

By comparison, sex chromosome losses exhibit more variability, with cell-type enrichments seen in only 3 of 12 (25%; all LOYs) and each of these exhibiting bias for a different cell type: MEP, LMPP and HSC, respectively (Supplementary Figs. 10 and 11). This suggests that the functional impact of LOY is less pronounced or more context-specific30. Furthermore, singleton mSVs do not show cell-type enrichment (Supplementary Fig. 10), suggesting that lineage biases seen in subclonal mSVs are due to their impact on cellular function, rather than biased acquisition in a specific cell type.

Remarkably, despite the diverse genomic loci affected by subclonal mSVs, there is a notable convergence on certain molecular phenotypes. Specifically, the Ras and JAK/STAT signaling pathways, as well as lipid metabolism—previously associated with clonal hematopoiesis (CH) and leukemia31,32—are recurrently altered (Fig. 3d–f, Supplementary Figs. 12 and 13 and Supplementary Tables 9, 10 and 17). These data link mSVs to common changes in aging-related pathways.

Cell-type-specific impact of an inversion

The molecular consequences of mosaic inversions are underexplored, since most studies are biased towards CNAs7,11. We therefore investigated the Xq12-Xq21.1 inversion (‘Xq-Inv’), seen in 22.6% (19 of 84) of cells from a 65-year-old female donor (BM65; Fig. 4a). Nucleosome occupancy analysis20 confirms the inversion lies on the active X-homolog (Supplementary Fig. 14), supporting its potential for mediating functional effects. We refined the inversion breakpoints33 (Methods) to chrX:66753519–76960327, with confidence intervals of ~10 kb and ~18 kb, respectively. While neither breakpoint directly overlaps a gene, the inversion is predicted to fuse two topologically associating domains (TADs) by disrupting their annotated boundaries (Fig. 4b), putatively altering the respective gene regulatory environments34.

Fig. 4: Mosaic inversion contributes to MEP-biased cell fate and subclonal expansion of HSPCs through cis and trans effects.
figure 4

a, Strand-seq data of X chromosomal homologs from BM65 depicting the unaffected haplotype 2 (also denoted ‘WT’; top) and the Xq-Inv (somatic mosaic inversion on chromosome Xq) on haplotype 1 (bottom) in single cells. For visualization purposes, here and below, strand- and haplotype-specific DNA reads are colored as follows: Watson (−) reads, orange; Crick (+) reads, blue. b, Genome browser track showing the confidence interval of inversion breakpoints and annotated TAD boundaries73 around them. Below, NO differences at CREs between Xq-Inv and WT cells are shown as log2-fold changes (permutation-adjusted P values computed using a sliding window approach20). The most significant signal out of 13 peaks representing patterns of haplotype-specific NO is a region with inferred increased chromatin accessibility, which overlaps with annotated AR enhancers74 residing 386 kb apart from the AR gene. Three annotated AR enhancers intersecting with the most significant peak are highlighted in red. The black vertical dotted lines indicate the breakpoint positions of mSVs, and the red horizontal dotted lines show the significance level of haplotype-specific NO (FDR 5%, and 10%). c, Heatmap of differential nucleosome occupancy (diffNO) genes identified in Xq-Inv cells compared with WT cells, generated after regressing out the contribution of individual cell types. The y axis represents single cells analyzed, and diffNO genes are plotted on the x axis. Changes in inferred gene activity are colored from red (increased gene activity) to blue (decreased gene activity). d, Pathways over-represented by the genes with diffNO (FDR 10% based on the hypergeometric test; Act_U, activity up; Act_D, activity down). e, Circle-packing plot depicting cell-type-resolved mSVs (terDup, terminal duplication; terDel, terminal deletion). Dotted lines denote mSVs; gray-colored background denotes measured cell-type enrichment. f, Violin plot of NO of known AR target genes, which exhibit an AR-binding motif in their promoter based on MsigDB75, in Xq-Inv (n = 18 cells) and WT cells (n = 66 cells), all cell types (left), HSCs only (n = 18 cells, upper-right) and MEPs (n = 28 cells, lower-right). Boxplots were defined by minima, 25th percentile − 1.5× IQR; maxima, 75th percentile + 1.5× IQR; center, median; and bounds of box, 25th and 75th percentiles. P values are based on the two-sided likelihood ratio test followed by Benjamini–Hochberg multiple correction. The gray and yellow shading of violin plots show the genotype of cells (gray, WT; yellow, Xq-Inv). g, Cell-type-specific analysis of NO differences at CREs between the mSV subclone and WT cells. Padj values of significant peak regions (FDR < 10%) are highlighted. A red arrow indicates the HSC-specific significant peak region containing two AR enhancers, in which we infer increased chromatin accessibility (these two enhancers are highlighted in red in Supplementary Fig. 18). The red dotted lines indicate the significance level of haplotype-specific NO (FDR 10%).

To investigate the potential impact of the inversion, we interrogated haplotype-resolved nucleosome occupancy profiles at cis-regulatory elements (CREs) to infer chromatin accessibility for each homolog20. Using a haplotype-aware sliding window analysis (Methods), we normalized nucleosome occupancy between the active and inactive X, and compared Xq-Inv cells with unmutated cells from the same donor. We identify 13 peak regions with significantly altered nucleosome occupancy (10% FDR; Fig. 4b), with 4 (31%) located within one of the affected TADs. The strongest peak fell into an intergenic region and showed decreased nucleosome occupancy on the inverted haplotype, indicating increased chromatin accessibility20. This peak is located adjacent to the androgen receptor gene (AR). Closer analysis shows three annotated AR enhancers fall within this peak (Supplementary Table 11), all residing in the fused TAD (Fig. 4b and Supplementary Fig. 14). These data suggest AR as a potential target of gene dysregulation and contributor to subclonal expansion. Indeed, androgens are used to treat bone marrow failure syndromes by inducing HSPC proliferation, albeit with an incompletely understood mode of action35.

To study the downstream effects of the Xq-Inv, we performed a genome-wide search for differential gene activity20, comparing the nucleosome occupancy of gene bodies between Xq-Inv and unmutated cells (Methods). We find 123 genes displaying differential nucleosome occupancy (Fig. 4c and Supplementary Table 10)—all of which reside outside the inversion locus—suggesting strong trans effects of Xq-Inv. Gene set over-representation analysis reveals dysregulation of several AR-related pathways, including Ras signaling and erythropoietin signaling (10% FDR; Fig. 4d and Supplementary Table 12). Erythropoietin signaling, for example, contributes to an erythroid-bias of HSCs in association with elevated AR activity36,37. Finally, TF-target enrichment analysis20 reveals three TFs with differential activity in Xq-Inv cells: EGR1, RUNX1 and IKZF1—all of which are linked to AR signaling (Supplementary Fig. 15). These data independently suggest AR activation as a result of Xq-Inv.

Notably, all three TFs have previously been reported to play critical roles in MEPs38,39,40, hinting that AR activation could be a key factor in the enrichment of MEPs within the Xq-Inv subclone (Fig. 4e). To explore this, we performed a cell-type-aware nucleosome occupancy analysis in the AR gene-body, revealing elevated AR activity from the rearranged homolog in HSCs, but not in MEPs (10% FDR; Supplementary Fig. 16). Likewise, upon testing AR target genes (Supplementary Table 13) we infer increased activity in HSCs, but not MEPs, with Xq-Inv (10% FDR; Fig. 4f and Supplementary Fig. 15), indicating HSC-specific AR overactivation in Xq-Inv cells. Consistent with this, Xq-Inv HSCs contain unique differential nucleosome occupancy peaks (10% FDR), including at two AR enhancers (Fig. 4g and Supplementary Fig. 17). These enhancers, which contain binding sites for EGR1, RUNX1 and IKZF1, are more accessible in HSCs, suggesting cell-type-specific enhancer activities (Supplementary Fig. 18). Finally, where these HSCs show regulatory changes consistent with elevated AR signaling (with 3 of 4 differential nucleosome occupancy genes representing annotated AR targets), Xq-Inv myeloid cells (CMPs and MEPs) show a more diffuse signal (with 23 of 105 and 12 of 55 differential nucleosome occupancy genes being AR targets, respectively) (Supplementary Table 9 and Supplementary Fig. 15). Among the MEP-specific genes, we infer high activity of RIT1 (Padj = 0.0057), a gene whose overexpression has been implicated in CH with MEP expansion41. Comparing the scRNA-seq data from BM65 with HSPCs from the Human Cell Atlas bone marrow cohort42 shows significant enrichment for AR activity in BM65 versus the Human Cell Atlas cohort in HSCs and MEPs, but not LMPPs (Supplementary Fig. 19). These findings are in line with androgen-mediating erythropoiesis through AR-dependent pathways43. They further imply HSC-specific AR overactivity, with a ‘priming’ role of Xq-Inv biasing cells towards megakaryocyte–erythroid lineages.

Stepwise accumulation of mSVs in HSPCs

While our data indicate that mSVs impact molecular phenotypes, how subclonal expansions are facilitated in cells harboring more than one co-existing mSV is unclear. We explored subclone dynamics in a 71-year-old male donor (BM712) exhibiting five distinct subclones, three of which demonstrate cell-type bias (FDR 10%; Fig. 5a). Of the 123 cells sequenced, 103 (84%) harbor at least one subclonal mosaicism, including two interstitial deletions and three LOYs (Fig. 5a,b). We tracked the subclonal evolution of BM712 using shared mSVs. One subclone (26% CF) shows LOY as the only mSV event and is enriched for HSCs. The four other subclones trace back to a ~1.2-Mb deletion at 17p11.2 (17p-Del), seen in 56% of cells, followed by the progressive acquisition of a ~500-kb deletion at 17q11.2 (17q-Del) and two independent LOYs (Fig. 5c–e). Bulk WGS of CD34 cells verified the subclonal 17q-Del and 17p-Del events (Fig. 5e and Supplementary Fig. 20), and revealed both mSVs are carried into mature blood cells.

Fig. 5: mSV accumulation in a single donor associates with clonal expansion.
figure 5

a, Circle-packing plot of mSVs found in BM712. b, Strand-seq karyograms of unmutated (WT; upper), 17p-Del (somatic mosaic heterozygous deletion on chromosome 17p) only (middle) and 17p-Del and 17q-Del (somatic mosaic heterozygous deletion on chromosome 17q) (bottom) somatic genotypes in single cells. c, Bubble hierarchy plot of mSVs identified in BM712. Bubbles are colored by somatic genotype, and scaled proportionally to each subclone’s frequency within the donor. CF is noted beside each bubble, and the distinguishing mosaicism acquired by each subclone indicated on the adjoining arm from the parent population. d,e, UCSC genome browser tracks for the 17p-Del (d) and 17q-Del (e) genomic segments. Tracks for both panels include composite read data and BreakpointR33-based breakpoint assignments, and highlight relevant genes. In e, the high-confidence deletion call from bulk WGS is also displayed (VAF inferred by Delly2 (ref. 76) is 28.5%). f, Heatmap of genes showing differential NO between WT, 17p-Del and 17pq-Del cells. g, Pathway over-representation analysis using ConsensusPathDB77 for the genes identified in the pairwise comparison of 17p-Del and 17pq-Del subclones with WT cells (FDR 10% based on the hypergeometric test). On the x axis, Act_U and Act_D depict increased and decreased activity, respectively. h, UMAP plot of scRNA-seq of CD34+ cells, with inferred cell type from reference data51 overlaid. i, Cell-type composition and enrichment analysis for 17p-Del and 17pq-Del subclones in scRNA-seq data of CD34+ cells. Asterisks indicate cell types with significant enrichment in a given subclone, based on Benjamini–Hochberg-adjusted Fisher’s exact test. j, UMAP plot of scRNA-seq of CD34 cells, with cell type inferred from single-cell reference datasets51,57 overlaid. k, Cell-type composition and enrichment analysis for the 17p-Del subclone in scRNA-seq of CD34 cells. ‘Unresolved’, the 17q-Del subclone could not be resolved in these scRNA-seq data owing to the low numbers of expressed genes covered. Significant cell-type enrichment with **Padj < 0.001 or ***Padj < 0.0001, respectively, based on two-sided Fisher’s exact test followed by Benjamini–Hochberg multiple testing correction. In CD34+ cells, CMPs and LMPPs are enriched in the 17p-Del subclone (Padj = 1.99 × 10−11 and Padj = 6.48 × 10−3, respectively) and HSCs are enriched in the 17q-Del subclone (Padj = 2.07 × 10−5). In the case of CD34 cells, monocytes are enriched in the 17p-Del subclone (Padj = 9.6 × 10−29). dups, duplications; NK, natural killer.

To explore the functional impact of the initiating mSV (17p-Del), we compared the gene-body nucleosome occupancy of 17p-Del cells with unmutated cells from BM712 using scNOVA, identifying 76 dysregulated genes (10% FDR; Fig. 5f). TF-target over-representation analysis20 shows enrichment for the targets of seven TFs, with the most significant being SREBF1 (Padj = 0.0047) (Supplementary Fig. 21). This gene is hemizygously deleted by 17p-Del, while the other six TFs fall outside the deletion, suggesting a potential important role for SREBF1 loss in the molecular phenotype of 17p-Del cells (Fig. 5d). Protein–protein interaction mapping of all seven dysregulated TFs using STRING44 (Supplementary Methods) reveals a significant protein–protein interaction network connecting all TFs (P = 3.57 × 10−8; Supplementary Fig. 21), highlighting their functional relationship (Supplementary Notes). Pathway enrichment analysis shows this network is enriched for MAPK signaling components (Padj = 0.0028), previously linked to cell-cycle activation in aging HSCs45. Finally, gene set over-representation analysis of all 76 dysregulated genes supports MAPK activation (Fig. 5g), along with dysregulation of lipid homeostasis, a contributor to increased myelopoiesis46. Taken together, this suggests that 17p-Del triggers increased MAPK activity, potentially driving myeloid-biased clonal expansion through hemizygous SREBF1 loss.

We next investigated the consequences of 17q-Del, seen in a subclone with 43.1% CF. This deletion disrupts the NF1 tumor suppressor via hemizygous loss of protein-coding exon 1 (Fig. 5e and Supplementary Figs. 22 and 23). In addition to its well-understood roles in cancer47, NF1 has been nominated as a CH driver by single nucleotide variant (SNV) analysis48 (Supplementary Notes), suggesting that the 17q-Del may fuel HSPC clonal expansion. Using scNOVA, we find 112 dysregulated genes in 17q-Del cells. Pathway over-representation analysis also shows altered metabolism and upregulated mTOR signaling in the subclone (Supplementary Fig. 24). Given the known critical role of NF1 in mTOR signaling49, and the role of mTOR signaling in cell proliferation and HSPC differentiation50, these findings suggest that the 17q-Del induces mTOR dysregulation, potentially fostering subclonal expansion.

To further characterize these subclones, we generated 4,114 scRNA-seq libraries from CD34+ cells isolated from BM712 (Supplementary Fig. 25), and assigned HSPC cell types to the data using a transcriptome reference of human blood51 (Fig. 5h). To molecularly phenotype the deletion subclones, we capitalized on the fact that copy-number-imbalanced mSV classes can be utilized for targeted re-calling of CNAs in scRNA-seq data20 (Methods), allowing characterization of mSV-bearing cells across a widened dynamic expression range. Using this approach, we infer that 2,571 (63%) scRNA-seq cells bear the 17p-Del, 1,841 (45%) contain the 17q-Del and 995 (24%) exhibit LOY (Supplementary Table 14)—CFs similar to the Strand-seq analyses. Co-occurrence analyses of these mosaicisms corroborate the subclonal structure identified using Strand-seq (Supplementary Fig. 26). Finally, the scRNA-seq data also verify the inferred lineage biases, with 17p-Del cells enriched for CMPs and LMPPs (Padj = 2.0 × 10−11, Padj = 0.0064; Fisher’s exact test), and both 17q-Del and LOY cells enriched for HSCs (Padj = 2.6 × 10−14, Padj = 1.0 × 10−56; Fisher’s exact test; Fig. 5i and Supplementary Fig. 25).

Having located the mosaic subclones in the scRNA-seq data, we more deeply characterized their molecular phenotypes controlled by cell type. First, gene ontology analysis of the differentially expressed genes between HSCs with and without LOY identifies pathways linked to HSC quiescence52,53 (10% FDR; Supplementary Tables 15 and 16), potentially explaining the observed HSC enrichment of LOY in BM712. Next, we confirm a distinct transcriptional profile for 17q-Del cells, with differential activity seen for 16 pathways (Molecular Signatures Database (MSigDB) Hallmark; Supplementary Tables 15 and 16 and Supplementary Fig. 27) including those related to HSPC proliferation, differentiation and metabolism. These pathways include MYC and mTOR signaling through mTORC1—two known downstream effectors of somatic NF1 inactivation49,54—which can be linked to HSC expansion and inhibition of differentiation55,56. Indeed, we find 17q-Del cells are significantly enriched for HSCs compared with 17p-Del cells (Padj = 2.1 × 10−5; Fig. 5g), potentially mediated through MYC and/or mTORC1 upregulation55,56. Finally, 17q-Del cells show an altered DNA damage response, with decreased expression of BRCA1, BRCA2, FANCI and BLM—implying these cells might be prone to acquire further alterations. Together, this suggests that BM712 underwent a stepwise acquisition of a potentially ‘higher-risk’ molecular phenotype; first, HSCs were enabled to exit quiescence and bias their differentiation (17p-Del); and, second, cells became more proliferative and HSC-like, and potentially more permissive to acquiring further mutations.

Finally, we explored the presence and functional impact of these mSVs in scRNA-seq data generated from terminally differentiated CD34 blood cells. We annotated 2,965 cells into eight cell types using published reference data57 (Fig. 5j and Supplementary Fig. 28), and performed targeted CNA re-calling58. Notably, we find a significant enrichment for monocytes in 17p-Del cells (Fig. 5k), a circulating downstream progeny of CMPs. These data underscore that these mSVs, identified in HPSCs, could impact peripheral blood cells. In contrast, our efforts to re-detect CNAs within the smaller 17q-Del region were unsuccessful due to its limited number of expressed genes, underscoring the superior capability of Strand-seq in functionally characterizing mSVs relative to scRNA-seq.

Functional effects of mSVs in blood samples

To extrapolate these findings to a larger cohort of blood samples, we interrogated the UK Biobank cohort59. The phenotypic data paired with whole-exome sequencing (WES) data from 469,792 donors59 provide the opportunity to study somatic mutations in relation to blood counts. Focusing on our top hits—NF1, SREBF1 and AR—we extracted rare (minor allele frequency (MAF) < 1%) SNVs and small (<50 bp) insertion and deletion variants (INDELs) from UK Biobank samples, and classified these based on their predicted impact (Supplementary Table 18). Since CNA losses affecting both the 17p-Del and 17q-Del regions were previously documented2,60, we additionally made use of WES-based CNA calls60 which we analyzed by burden testing (Methods). We first concentrated on the 17p-Del and 17q-Del regions, analyzing gene-disrupting SNVs. We find a bimodal VAF distribution for NF1 and SREBF1 predicted loss-of-function (pLoF) SNVs, but not for rare synonymous and rare missense variants (Fig. 6a). These data indicate that gene-disrupting pLoF SNVs represent a common source of mosaicism at these loci. Furthermore, they emphasize the link between gene-disrupting mSVs affecting SREBF1 and NF1, and clonal expansions in normal blood.

Fig. 6: Functional effects of mSVs are supported by re-analysis of UK Biobank data.
figure 6

a, VAF plot for SNVs in SREBF1 and NF1, separated by mutation type, in the UK Biobank. b,c, Volcano plots showing burden test results for genes in the 17p-Del (b) and 17q-Del (c) (somatic mosaic deletions on chromosomes 17p and 17q) candidate regions, respectively. Genes with Padj < 0.05 are labeled. A subset of blood count traits is depicted (see Supplementary Fig. 29 for all blood count traits). d, VAF plot for SNVs in AR, separated by mutation type, in females (see Supplementary Fig. 34 for males). e, Volcano plot showing association test results of single rare missense SNVs at the Xq-Inv (somatic mosaic inversion on chromosome Xq) locus for all 11 blood count traits (generated from female donors). The full respective list of missense variants analyzed is included in Supplementary Table 18. Variants with Padj < 0.05 are colored by gene and labeled by trait: NRBC, nucleated red blood cell count; basophil, basophil count; RBC, red blood cell count. Variants with Padj ≥ 0.05 are colored in gray. The y axes in b, c and e depict nominal P values. For b, c and e, P values were obtained using the two-sided Wald test followed by the Benjamini–Hochberg multiple correction.

Furthermore, at the SREBF1 locus, we find CNA losses and pLOF SNVs are independently associated with altered blood counts (n = 2 losses and n = 74 pLOF SNVs; Supplementary Table 18 and Supplementary Figs. 29 and 30), with the SREBF1 gene being among the strongest hits within the 17p-Del region for several categories, including elevated total leukocytes (Padj = 0.00012; loss) and elevated monocytes (Padj = 0.0012; loss) (Fig. 6b and Supplementary Fig. 29). These findings independently support that SREBF1 loss may contribute to a cell-type bias in leukocytes, specifically towards monocytes. When repeating the same analysis for all genes in the 17q-Del region, we find losses at 5 of 6 genes are associated with elevated total leukocytes—yet, only for NF1 do we observe that both loss and pLoF SNVs are significant (Padj = 0.042 for both; Fig. 6c, Supplementary Table 18 and Supplementary Figs. 29 and 30). This supports the contributions of both 17p-Del and 17q-Del to cell-type skewing and potentially clonal expansion in blood. Interestingly, pLOF SNVs in NF1 are associated with a marked increase in neutrophil counts (Padj = 0.00019), strongly implicating this gene in myeloid-skewed hematopoiesis.

Lastly, we analyzed rare missense SNVs at the Xq-Inv locus (n = 5 genes), motivated by earlier reports of activating somatic missense mutations in AR61, which we reasoned could potentially mirror the AR activation molecular phenotype seen in BM65. In females, we observe a bimodal VAF for missense SNVs, but neither for pLoF nor for rare synonymous SNVs, suggesting that AR missense SNVs, but not other SNVs, exhibit somatic mosaicism (Fig. 6d and Supplementary Notes). Furthermore, five rare AR missense SNVs, but no AR pLoF SNVs, are associated with altered blood cell counts (Padj < 0.05, for all five SNVs; Fig. 6e). These fall into exon 1 (n = 3), exon 2 (n = 1) and exon 4 (n = 1), all of which also harbor missense SNVs in cancer that impinge on AR function61. We observe association with increased nucleated red blood cell count for n = 4 missense SNVs (Padj < 0.05, for all four), and decreased basophil count for the remaining SNV (Padj = 0.043). These findings independently support a link between AR activation and altered cell counts in UK Biobank samples.