Search
Close this search box.

A sequence-aware merger of genomic structural variations at population scale – Nature Communications

Overview of PanPop pipeline

The central component of PanPop is a crucial process called PART (PAnpop Realign and Thin), which employs a sequence-aware SV local realignment method to resolve overlapping SVs, particularly reducing multi-allelic SVs into biallelic forms. PART consists of five steps: realign group, rebuild consensus sequences, alignment, SV integration, and SV thinning (Fig. 1a). During the realign group step, SVs that overlap or are in close proximity (with a default adjacency threshold of 200 bps) are grouped based on their positions. Subsequently, the rebuild sequences step retrieves the sequences of each realign group based on the mutation information in Variant Call Format (VCF) files and an SV parse process. During this process, if an individual contains multiple SVs within a realign group, each SV is treated as unique. The alignment process employs sequence aligners (MUSCLE18, FAMSA19, or stmsa20) to generate multiple sequence alignment results. MUSCLE is selected for short sequences (<1000 bp), while FAMSA or stmsa is used for longer sequences. The SV integration step involves dividing the alignment into distinct blocks along the reference sequence, where each block represents the unaltered status across different samples. Blocks displaying consistent alignment (without SVs) are eliminated, and only the remaining blocks are classified as new SVs. It is important to note that despite insertions of varying lengths altering the alignment status in different samples, they are classified as a single SV since we divide the alignment based on the reference sequence. Finally, the SV thinning process reduces multi-allelic SVs to biallelic ones based on their similarity. If two allele sequences share at least 60% of their bases and have a difference of less than 20 bp, they are considered the same allele. The remaining alleles are clustered into distinct groups using the MCL algorithm21, with each group treated as a unique allele, and the alleles within each MCL group are merged together.

Fig. 1: Schematic diagram and individual level SV merging performance of PanPop.
figure 1

a Detailed workflow of PanPop’s key process, PART. Different colors represent different sequences or SVs (Del for deletion, Rep for repeat, Ins for insertion). In the upper three partitions of the diagram, thick lines with different colors indicated different sequences, with Rep, Ins1, and Ins2 initially being collinear. The thin gray line indicated a gap after alignment. The last two parts of the diagram illustrate the allelic state of SVs after SV integration; For example, the SV2 initially has three different alleles. However, during the thin process, it is merged since the allelic sequence for sample S2 is very similar to that in sample S3. b Performance evaluation of 12 SV mergers in calling and merging simulated SVs, assessed by recall (x-axis), precision (y-axis), and F1 score (F1, dashed line). Circles, squares, and triangles represent SVs supported by at least one (unfiltered), two, and three (filtered) SV callers for SV merges, respectively, while SV callers were marked as asterisks. SV mergers and callers are represented using different colors, line types, and shapes. c Distribution of sequence similarity among merged SVs from two SV callers and four SV mergers with F1-score higher than 0.9. Note that for SV mergers, only filtered SVs (SVs supported by at least two SV callers) were used and excluded SVIM-asm, which had an F1-score of 1. Source data are provided as a Source Data file.

It is worth noting that PART is a relatively independent module specifically designed to handle the merging of different SV datasets in VCF format. All parameters can be easily and flexibly configured through the config file. In addition to utilizing PART for SV merging, the PanPop pipeline incorporates SV calling and filtering steps to integrate the entire process, starting from reads or assemblies and leading to the final individual or population SVs. For long sequencing reads, Minimap222 (or alternatively, NGMLR23) is employed to map them onto a reference genome. This is followed by four commonly used SV callers (Sniffles23, cuteSV5, svim24, and pbsv) for SV identification. If an assembly is present, it is aligned using Minimap2, and SVs are called using SVIM-asm25 (Fig. 2a). For short sequencing reads commonly used in population-level analyses, the VG toolkit26 is employed for alignment to the graph-genome (pan-genome) reference and SV calling (Fig. 2b).

Fig. 2: Overview of the pipeline utilized by PanPop.
figure 2

Pipelines for Long-reads-based and short-reads-based approaches were illustrated by a, b, respectively. The middle partition of the figure was shared by both pipelines.

The PART method is applied twice to merge SVs from different callers, either for individual analysis or for multiple individuals within large populations. In the case of individual SV filtering, only SVs supported by two or more SV callers are retained when multiple callers are utilized. For the population-level SV filtering, referred as “fill-depth information”, depth information is extracted from the mapping file for each SV in each sample and combined into the merged VCF file. During this process, SVs with abnormal depth are soft-masked to ensure result accuracy15,27. Additionally, although optional, it is recommended to filter SVs based on the maximum missing rate to obtain a final set of refined population-scale SVs. We also provide tools for aggressively thinning SVs by removing low-frequency alleles, resulting in a set of predominantly bi-allelic SVs.

Evaluation of the accuracy of single individual among multiple approaches

To assess the performance of the PanPop pipeline, we conducted an evaluation that involved analyzing its ability to merge individual-level SVs using both simulated and real long-sequencing data. The software versions and detailed parameter used were available in Supplementary texts. Initially, we employed five SV callers to identify SVs, and then compared the results obtained using PanPop with those from nine other SV mergers. For the simulated Arabidopsis thaliana data, the F1-scores of the five SV callers ranged from 0.81 to 1.00, with variable recall and precision rates (Figs. 1b and 3d, Supplementary Data 1). The SVIM-asm achieved an almost perfect score in this simulation dataset. However, after merging without any filtering, the 12 mergers exhibited relatively low F1-scores, primarily due to low precision (Fig. 1b, Supplementary Figs. 1a and 2). Consequently, filtering was necessary. We tested the SVs that were supported by at least one or more callers and observed that the precision rate increased with the number of supporting callers, while the recall rate decreased significantly (Figs. 1, 3, 4). As a result, the strategy of requiring support from at least two SV callers yielded the best F1-score and we adopted this as the default parameter for subsequent analyses.

Fig. 3: Performance comparison of different SV mergers using a simulated SV dataset.
figure 3

Sequence similarity a, length similarity b, and F1-score d comparison among 12 SV mergers with filtered SVs (SVs supported by at least two SV callers) and 5 SV callers. SV mergers and callers represented using different colors and line types. For d the gray dashed line was the simulated SVs count, and minimal supported SV caller count of 1 to 3 were represented as circle, square and triangle, respectively. c SV length distribution of simulated data (Lower) and PanPop result (Upper, filtered SVs). Red, ultramarine, and blue bars represent SV types of insertion, deletion, and other, respectively. Source data are provided as a Source Data file.

Fig. 4: Performance evaluation of PanPop using HG002 dataset.
figure 4

a SV length and type for PanPop (upper, filtered SVs) and high-confidence dataset (lower). Red, ultramarine, and blue bars represent SV types of insertion, deletion, and other, respectively. b, c comparison between 12 SV mergers and 5 SV callers. For each merger, minimal supported SV caller count of 1 to 3 was represented as circle, square, and triangle, respectively. SV mergers and callers are represented using different colors and line types. The gray dashed line in b was the count of high-confident SVs, and the gray dashed line in c was F1 score of 0.9, 0.8, and 0.7. Source data are provided as a Source Data file.

After applying filtering, both software programs demonstrated a significant increase in F1-score and precision rate, particularly with PanPop, Truvari and SVanalyzer achieving the highest F1-score, exceeding 0.93 (Fig. 1b, Supplementary Fig. 1a, Supplementary Fig. 2). To further assess accuracy, we calculated the similarity between the simulated SVs and the SVs identified by the 12 mergers and five SV callers (Figs. 1c and 3a). We found that PanPop identified a high proportion (83.3%) of high-similarity (≥98%). In comparison, the best two SV callers and three other mergers only showed proportions ranging from 25.4% to 79.6% of high-similarity SVs (excluding SVIM-asm, which had an F1-score of 1) (Fig. 1c). We also assessed the distribution of SV lengths, and found that PanPop exhibited a high degree of similarity (Fig. 3b, c). Collectively, these findings demonstrate that merging SVs can greatly improve accuracy, with PanPop delivering the best performance. Additionally, we evaluated the performance of PanPop using true data of high-confidence SVs in human HG002, and the results were mostly consistent with those from the simulated dataset (Fig. 4, Supplementary Fig. 1b). The F1-score of SVanalyzer was slightly higher than PanPop (0.958 vs 0.954), which may be attributed to the SV split strategy of PanPop, making benchmarking more challenging. The recall rate of SVIM-asm also decreased to 70% as the complexity of genomics increased. Moreover, the genotype accuracy of PanPop was significantly higher than that of SVanalyzer and Truvari, with values of 0.979, 0.463, and 0.920, respectively (Supplementary Data 2).

Evaluation of the performance in the population-scale SV merging

We proceeded to assess the merging performance of population-scale SVs using two datasets of A. thaliana: 86 samples with long sequencing read and 1092 samples with short sequencing read. For population merging, it was necessary for the merger to generate genotype information for each individual to facilitate subsequent analyses. Therefore, out of the 12 mergers evaluated, only PanPop, SURVIVOR, bcftools, Truvari, and sv-merger were utilized. In the case of the long-sequenced dataset, SVs were identified by 5 SV callers and filtered by requiring support from at least 2 callers for each individual before merging all individual information into the population-scale. We initially evaluated the results based on the proportion of missing genotypes, as high missing rates are deemed meaningless in population analyses. The proportion of mutations with a missing rate exceeding 30% was 8.2% for PanPop, 62.2% for SURVIVOR, 78.1% for bcftools, and 99.7% for Truvari (Fig. 5a, Supplementary Fig. 3a) when merging 86 samples using the dataset filled with information of non-mutation samples by PanPop. When using the raw dataset generated by SV callers, the missing rate was exceedingly highly, exceeding 97% for each of these four SV mergers. The high missing rate of the raw dataset made it unsuitable for subsequent analyses, especially for bcftools and Truvari, primarily because their merging strategy involved nearly lossless bulk combinations of all mutations without processing (Figs. 5f and 6f). This distinction became more pronounced when merging 1092 samples (Fig. 6a). The lower missing rate of PanPop can be attributed to the PART algorithm and the “fill-depth information” process. However, when applying a data source after ‘fill-depth information’ process by PanPop, the missing rate of SURVIVOR and bcftools greatly dropped but was still considerably higher than that of PanPop (Fig. 5a, Supplementary Fig. 3a). Alongside the missing rate, the proportion of biallelic SVs is also crucial, since multi-allelic SVs are challenging to utilize in subsequent population genetics analyses. We discovered that approximately 400k SVs (89.0%) were biallelic when merged by PanPop, with only a slight change after filtering out sites with more than 30% missing data (Fig. 5f). This count and percentage were significantly higher than those of SURVIVOR and bcftools after filtering. This distinction was also more prominent when merging 1092 samples (Fig. 6f).

Fig. 5: Performance evaluation of the population-scale SV merging by four SV mergers using 86 A. thaliana long-sequencing datasets.
figure 5

a shows the missing proportion of SVs. SV merging using the original results of SV callers or processed by PanPop’s “fill_depth_information” method, represented by the solid or dashed lines, respectively. b SV length and allele count statistic for PanPop, with allele counts distinguished using different colors. c, f show the histogram of evaluation scores (recall, precision, and F1) and biallelic proportion of the results from four SV mergers. Bar and error bars of c are means ± SE (n = 86 for each bar). The upper and lower sections of f represent the datasets after a maximum missing rate filter of 30% and the raw datasets, respectively. The solid and shadow bars of f represented the count of biallelic and total SVs, respectively. The black dot of f represented the proportion of biallelic SV. d the precision and recall rate of four SV mergers. The bright and dark points represented the result of raw SVs and SVs with a filter of max missing of 30%, respectively. e Magnified view of the filtered PanPop results. For e, the color of each dot represents the genotype concordance (GT concordance), with the color gradient from red to blue indicating increasing numbers. For a, c, d, and f, colors of red, green, blue, and gray represented different software of PanPop, SURVIVOR, bcftools, and Truvari, respectively. Source data are provided as a Source Data file.

Fig. 6: Performance of four SV merger in large-scale SV merging using 1092 A. thaliana NGS datasets.
figure 6

a shows missing proportion of SVs. SV merging using the original results of SV callers. Line of bcftools and Truvari were mostly overlapped. b SV length and allele count statistic for PanPop, with allele counts distinguished using different colors. c, f shows the histogram of evaluation scores (recall, precision, and F1) and biallelic proportion of the results from four SV mergers. Bar and error bars of c are means ± SE (n = 1092 for each bar). The upper and lower sections of f represent the datasets after a maximum missing rate filter of 30% and the raw datasets, respectively. The solid and shadow bars of f represented the count of biallelic and total SVs, respectively. The black dot of f represented the proportion of biallelic SV. d the precision and recall rate of four SV merger. The bright and dark points represented the result of raw SVs and SVs with a filter of max missing of 30%, respectively. e Magnified view of the filtered PanPop results. For e, the color of each dot represents the genotype concordance (GT concordance), with the color gradient from red to blue indicating increasing numbers. For a, c, d and f, the color of red, green, blue, and gray represented different software of PanPop, SURVIVOR, bcftools, and Truvari, respectively. Source data are provided as a Source Data file.

We proceeded to further benchmark SVs by comparing them before and after population merging. Since PanPop splits complex SVs into several smaller SVs, we reassembled those small SVs before evaluation. Regardless of the missing rate, bcftools achieved the highest F1-score (nearly 1) because it simply combined all SVs. PanPop also performed well, with an average F1-score of 97.5% and a recall rate of 99.0%. However, SURVIVOR only had a recall rate of 69.4% and an F1-score of 73.0% (Fig. 5d). After filtering SVs with a maximum missing rate of 30% (where a maximum of 30% of samples had missing genotypes for each SV), the performance of bcftools, Truvari and SURVIVOR underwent significant changes. The F1-score of bcftools and Truvari nearly dropped to zero due to the high missing rate, while SURVIVOR’s F1-score was greatly reduced to an average of 19.7%. However, due to its low missing rate, PanPop still achieved an F1-score of 95.4% (Fig. 5c). We also compared PanPop with the other two software programs capable of performing population-scale merging, namely sv-merger (Supplementary Note 1) and Sniffles2 (Supplementary Note 2), and found that their performance was not as good as PanPop (Supplementary Figs. 4 and 5). Considering the anticipated increase in sequencing data volume in the future, a large number of samples will be analyzed. Therefore, we used 1092 short-sequenced samples as a replacement to evaluate large-scale population SV merging. The VG toolkit was selected to align reads and call SVs, using the graph-genome generated by the 86 long-sequenced samples as the reference. The benchmark results of the four SV mergers were similar to those of the 86 samples, with PanPop demonstrating the best performance (Fig. 6, Supplementary Fig. 3b).