Overview of PanPop pipeline
The central component of PanPop is a crucial process called PART (PAnpop Realign and Thin), which employs a sequence-aware SV local realignment method to resolve overlapping SVs, particularly reducing multi-allelic SVs into biallelic forms. PART consists of five steps: realign group, rebuild consensus sequences, alignment, SV integration, and SV thinning (Fig. 1a). During the realign group step, SVs that overlap or are in close proximity (with a default adjacency threshold of 200 bps) are grouped based on their positions. Subsequently, the rebuild sequences step retrieves the sequences of each realign group based on the mutation information in Variant Call Format (VCF) files and an SV parse process. During this process, if an individual contains multiple SVs within a realign group, each SV is treated as unique. The alignment process employs sequence aligners (MUSCLE18, FAMSA19, or stmsa20) to generate multiple sequence alignment results. MUSCLE is selected for short sequences (<1000 bp), while FAMSA or stmsa is used for longer sequences. The SV integration step involves dividing the alignment into distinct blocks along the reference sequence, where each block represents the unaltered status across different samples. Blocks displaying consistent alignment (without SVs) are eliminated, and only the remaining blocks are classified as new SVs. It is important to note that despite insertions of varying lengths altering the alignment status in different samples, they are classified as a single SV since we divide the alignment based on the reference sequence. Finally, the SV thinning process reduces multi-allelic SVs to biallelic ones based on their similarity. If two allele sequences share at least 60% of their bases and have a difference of less than 20 bp, they are considered the same allele. The remaining alleles are clustered into distinct groups using the MCL algorithm21, with each group treated as a unique allele, and the alleles within each MCL group are merged together.
It is worth noting that PART is a relatively independent module specifically designed to handle the merging of different SV datasets in VCF format. All parameters can be easily and flexibly configured through the config file. In addition to utilizing PART for SV merging, the PanPop pipeline incorporates SV calling and filtering steps to integrate the entire process, starting from reads or assemblies and leading to the final individual or population SVs. For long sequencing reads, Minimap222 (or alternatively, NGMLR23) is employed to map them onto a reference genome. This is followed by four commonly used SV callers (Sniffles23, cuteSV5, svim24, and pbsv) for SV identification. If an assembly is present, it is aligned using Minimap2, and SVs are called using SVIM-asm25 (Fig. 2a). For short sequencing reads commonly used in population-level analyses, the VG toolkit26 is employed for alignment to the graph-genome (pan-genome) reference and SV calling (Fig. 2b).
The PART method is applied twice to merge SVs from different callers, either for individual analysis or for multiple individuals within large populations. In the case of individual SV filtering, only SVs supported by two or more SV callers are retained when multiple callers are utilized. For the population-level SV filtering, referred as “fill-depth information”, depth information is extracted from the mapping file for each SV in each sample and combined into the merged VCF file. During this process, SVs with abnormal depth are soft-masked to ensure result accuracy15,27. Additionally, although optional, it is recommended to filter SVs based on the maximum missing rate to obtain a final set of refined population-scale SVs. We also provide tools for aggressively thinning SVs by removing low-frequency alleles, resulting in a set of predominantly bi-allelic SVs.
Evaluation of the accuracy of single individual among multiple approaches
To assess the performance of the PanPop pipeline, we conducted an evaluation that involved analyzing its ability to merge individual-level SVs using both simulated and real long-sequencing data. The software versions and detailed parameter used were available in Supplementary texts. Initially, we employed five SV callers to identify SVs, and then compared the results obtained using PanPop with those from nine other SV mergers. For the simulated Arabidopsis thaliana data, the F1-scores of the five SV callers ranged from 0.81 to 1.00, with variable recall and precision rates (Figs. 1b and 3d, Supplementary Data 1). The SVIM-asm achieved an almost perfect score in this simulation dataset. However, after merging without any filtering, the 12 mergers exhibited relatively low F1-scores, primarily due to low precision (Fig. 1b, Supplementary Figs. 1a and 2). Consequently, filtering was necessary. We tested the SVs that were supported by at least one or more callers and observed that the precision rate increased with the number of supporting callers, while the recall rate decreased significantly (Figs. 1, 3, 4). As a result, the strategy of requiring support from at least two SV callers yielded the best F1-score and we adopted this as the default parameter for subsequent analyses.
After applying filtering, both software programs demonstrated a significant increase in F1-score and precision rate, particularly with PanPop, Truvari and SVanalyzer achieving the highest F1-score, exceeding 0.93 (Fig. 1b, Supplementary Fig. 1a, Supplementary Fig. 2). To further assess accuracy, we calculated the similarity between the simulated SVs and the SVs identified by the 12 mergers and five SV callers (Figs. 1c and 3a). We found that PanPop identified a high proportion (83.3%) of high-similarity (≥98%). In comparison, the best two SV callers and three other mergers only showed proportions ranging from 25.4% to 79.6% of high-similarity SVs (excluding SVIM-asm, which had an F1-score of 1) (Fig. 1c). We also assessed the distribution of SV lengths, and found that PanPop exhibited a high degree of similarity (Fig. 3b, c). Collectively, these findings demonstrate that merging SVs can greatly improve accuracy, with PanPop delivering the best performance. Additionally, we evaluated the performance of PanPop using true data of high-confidence SVs in human HG002, and the results were mostly consistent with those from the simulated dataset (Fig. 4, Supplementary Fig. 1b). The F1-score of SVanalyzer was slightly higher than PanPop (0.958 vs 0.954), which may be attributed to the SV split strategy of PanPop, making benchmarking more challenging. The recall rate of SVIM-asm also decreased to 70% as the complexity of genomics increased. Moreover, the genotype accuracy of PanPop was significantly higher than that of SVanalyzer and Truvari, with values of 0.979, 0.463, and 0.920, respectively (Supplementary Data 2).
Evaluation of the performance in the population-scale SV merging
We proceeded to assess the merging performance of population-scale SVs using two datasets of A. thaliana: 86 samples with long sequencing read and 1092 samples with short sequencing read. For population merging, it was necessary for the merger to generate genotype information for each individual to facilitate subsequent analyses. Therefore, out of the 12 mergers evaluated, only PanPop, SURVIVOR, bcftools, Truvari, and sv-merger were utilized. In the case of the long-sequenced dataset, SVs were identified by 5 SV callers and filtered by requiring support from at least 2 callers for each individual before merging all individual information into the population-scale. We initially evaluated the results based on the proportion of missing genotypes, as high missing rates are deemed meaningless in population analyses. The proportion of mutations with a missing rate exceeding 30% was 8.2% for PanPop, 62.2% for SURVIVOR, 78.1% for bcftools, and 99.7% for Truvari (Fig. 5a, Supplementary Fig. 3a) when merging 86 samples using the dataset filled with information of non-mutation samples by PanPop. When using the raw dataset generated by SV callers, the missing rate was exceedingly highly, exceeding 97% for each of these four SV mergers. The high missing rate of the raw dataset made it unsuitable for subsequent analyses, especially for bcftools and Truvari, primarily because their merging strategy involved nearly lossless bulk combinations of all mutations without processing (Figs. 5f and 6f). This distinction became more pronounced when merging 1092 samples (Fig. 6a). The lower missing rate of PanPop can be attributed to the PART algorithm and the “fill-depth information” process. However, when applying a data source after ‘fill-depth information’ process by PanPop, the missing rate of SURVIVOR and bcftools greatly dropped but was still considerably higher than that of PanPop (Fig. 5a, Supplementary Fig. 3a). Alongside the missing rate, the proportion of biallelic SVs is also crucial, since multi-allelic SVs are challenging to utilize in subsequent population genetics analyses. We discovered that approximately 400k SVs (89.0%) were biallelic when merged by PanPop, with only a slight change after filtering out sites with more than 30% missing data (Fig. 5f). This count and percentage were significantly higher than those of SURVIVOR and bcftools after filtering. This distinction was also more prominent when merging 1092 samples (Fig. 6f).
We proceeded to further benchmark SVs by comparing them before and after population merging. Since PanPop splits complex SVs into several smaller SVs, we reassembled those small SVs before evaluation. Regardless of the missing rate, bcftools achieved the highest F1-score (nearly 1) because it simply combined all SVs. PanPop also performed well, with an average F1-score of 97.5% and a recall rate of 99.0%. However, SURVIVOR only had a recall rate of 69.4% and an F1-score of 73.0% (Fig. 5d). After filtering SVs with a maximum missing rate of 30% (where a maximum of 30% of samples had missing genotypes for each SV), the performance of bcftools, Truvari and SURVIVOR underwent significant changes. The F1-score of bcftools and Truvari nearly dropped to zero due to the high missing rate, while SURVIVOR’s F1-score was greatly reduced to an average of 19.7%. However, due to its low missing rate, PanPop still achieved an F1-score of 95.4% (Fig. 5c). We also compared PanPop with the other two software programs capable of performing population-scale merging, namely sv-merger (Supplementary Note 1) and Sniffles2 (Supplementary Note 2), and found that their performance was not as good as PanPop (Supplementary Figs. 4 and 5). Considering the anticipated increase in sequencing data volume in the future, a large number of samples will be analyzed. Therefore, we used 1092 short-sequenced samples as a replacement to evaluate large-scale population SV merging. The VG toolkit was selected to align reads and call SVs, using the graph-genome generated by the 86 long-sequenced samples as the reference. The benchmark results of the four SV mergers were similar to those of the 86 samples, with PanPop demonstrating the best performance (Fig. 6, Supplementary Fig. 3b).
- SEO Powered Content & PR Distribution. Get Amplified Today.
- PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
- PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
- PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
- PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
- Source: https://www.nature.com/articles/s41467-024-45244-9