Search
Close this search box.

Chromosome-level genome assembly of ridgetail white shrimp Exopalaemon carinicauda – Scientific Data

Animal materials and genome sequencing

A female shrimp was collected from Rizhao Haichen Aquatic Co., Ltd. The muscle tissue was collected for DNA extraction and library construction. Total genomic DNA was extracted using a cetyltrimethylammonium bromide method. For the genome survey, a 350 bp paired-end library was constructed according to the manufacturer’s instructions (Illumina, San Diego, CA, USA) and sequenced on an Illumina NovaSeq 6000 platform. A total of 276.18 Gb of raw data were obtained, which covered approximately 54 × of the estimated genome (Table 1).

Table 1 Genome assembly statistics of E. carinicauda.

For PacBio sequencing, a 15 kb library was constructed using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA) and sequenced with circular consensus sequencing mode using a single 8 M SMRT Cell on the PacBio Sequel II platform (Pacific Biosciences). After filtering out the low-quality reads and sequence adapters, 3636.91 Gb subreads of PacBio Data were obtained, representing approximately 708 × sequence coverage based on the estimated genome size (Table 1). Finally, 203.27 Gb of CCS reads were generated using SMRTLink 9.0 which covered approximately 40 × of the estimated genome.

For the construction of the Hi-C library, DNA was fixed with 4% formaldehyde solution and digested with the 4-cutter restriction enzyme MboI. The digested fragments were labeled with biotin-14-dCTP, then the cross-linked fragments were subjected to blunt-end ligation. The library was sequenced on the Illumina NovaSeq 6000 platform, and approximately 552.65 Gb of Hi-C clean reads were generated, covering approximately 108 × of the estimated genome (Table 1).

Genome survey

The genome size and heterozygosity were estimated using the k-mer method before genome assembly17. The k-mer distribution was calculated from Illumina short reads using Jellyfish based on k-mer (k = 17)18. The heterozygosity ratio was estimated by the online tool of GenomeScope19 (https://github.com/schatzlab/genomescope). Finally, the estimated genome size of E. carinicauda was predicted to be approximately 5.12 Gb, with 84.74% repetitive sequences, and the genome heterozygosity was 2.62% using a 17-mer analysis (Fig. 2), suggesting a complex genome of E. carinicauda.

Fig. 2
figure 2

The 17-mer analysis of the genome.

Chromosome-level genome assembly

The initial genome was assembled with HiFi reads using the Peregrine (v0.1.6.1) (https://github.com/cschin/peregrine). A modified “best overlap graph” strategy was used to get the contig assembly based on the overlap graph. Contig overlaps were removed from the assembled contig sequences using Purge_dups (https://github.com/dfguan/purge_dups). De novo assembly of PacBio sequences yielded a preliminary assembly of 5.86 Gb, containing 47,421 contigs with a contig N50 length of 235.28 kb, a maximum length of 3,038,493 bp and a GC content of 34.79% (Table 1).

Chromosome-level assembly of E. carinicauda was conducted using Hi-C technology. Juicer (v1.6.2)20 and 3D-DNA (v180922)21 software were implemented to obtain the chromosome-level whole genome assembly. The filtered Hi-C reads were aligned to the initial draft genome using Juicer (v1.6.2). Only uniquely mapped and valid paired-end reads were used for the assembly using 3D-DNA. Juicebox (v1.9.8) was used to manually order the scaffolds to generate more precise chromosome-level genome of E. carinicauda according to the chromosomal interaction heatmap22. Contact maps were visualized using HiCExplorer (v3.3)23. The number of chromosomes was 90, which was determined based on karyological observations of E. carinicauda chromosomes in our previous study15. The contigs were ultimately clustered into 45 pseudochromosomes for E. carinicauda, with a scaffold N50 length of 138.24 Mb. The total length of the 45 pseudochromosomes was 5.58 Gb (covered 95.29%) (Fig. 3a,b), of which the length ranged from 46.25 Mb to 338.48 Mb. The length of the un-placed scaffolds was 275.86 Mb (Table 2).

Fig. 3
figure 3

Genome assembly of E. carinicauda. (a) Hi-C assembly of chromosome interactive heatmap. A deeper colour represents stronger interaction between contigs. (b) Characterization of assembled genome. a, Physical map of E. carinicauda pseudochromosomes (Mb scale), different colour represents different chromosome. b, proportional distribution of repeated sequences in 1 Mb window. c, gene density represented by number of genes in 1 Mb window. d, GC content represented by percentage of G/C bases in 1 Mb window.

Table 2 Statistics of cluster number and length of single chromosome.

The quality of the final chromosome-level genome assembly was assessed using the following three methods. First, we aligned the Illumina DNA short reads obtained from our previous study to the assembled genome and found that approximately 99.00% of the DNA short reads could be mapped to our assembly using BWA (v0.7.15)24. Second, read depth and GC content with 10 kb windows were used to evaluate the assembly results and determine whether there was a significant GC bias or sample contamination, showing that the assembled genome was clean without contamination (Fig. 4). Finally, genome assembly and completeness were further evaluated using conserved genes in benchmarking universal single-copy orthologs (BUSCO, v5.2.2) with the arthropoda_odb10 database25. The results showed that 92.89% of the 1013 single-copy genes were highly conserved orthologs (88.75% complete, 4.15% fragmented, and 7.11% missing) (Table 3).

Fig. 4
figure 4

GC content and depth distribution. The horizontal axis represents the percentage of GC content, and the vertical axis represents the average sequencing depth.

Table 3 Universal single copy ortholog (BUSCO) assessment of E. carinicauda.

Compared to the published genome of E. carinicauda11, our assembled genome is of significantly improved quality and integrity. The contig N50 increased from 263 bp to 235,277 bp, with an increase of nearly 900-fold, and scaffold N50 increased from 816 bp to 138,242,434 bp. Meanwhile, the assembled complete orthologue proportion enhanced from 43.44% to 88.75% according to the BUSCO assessment.

Repetitive and non-coding gene prediction

To detect repeat elements in E. carinicauda genome, de novo and homology-based strategies were combined using multiple methods. Mini-inverted repeat transposable elements (MITEs) were identified using MITE-Hunter (v1.0)26 for de novo annotations. Long terminal repeat sequences (LTRs) were detected using LTRharvest27 and LTR_Finder (v1.07)28, and the prediction results of these two software programs were integrated using LTR_retriever (v2.8.2)29. RepeatMasker (v4.1.0)30 was used in the homology-based alignment to search E. carinicauda genome sequence in the RepBase database (http://www.girinst.org/repbase). RepeatMasker was used to mask the repetitive sequences obtained by the above method, and RepeatModeler (v2.0)31 was used to perform the de novo identification of other repetitive sequences with the repeat-masked genome. Ultimately, we identified approximately 4.19 Gb of repetitive sequences, accounting for approximately 71.49% of the assembled genome, among which 9.97% were tandem repeat sequences. Among these repetitive sequences, LTRs (42.52%) accounted for the highest proportion of the assembly, followed by DNA (10.81%) and LINE (3.33%) (Table 4).

Table 4 Repeat components in E. carinicauda genome.

Five types of noncoding RNA (ncRNA) were identified in the genome of E. carinicauda, including microRNAs (miRNAs), transfer RNAs (tRNAs), ribosomal RNAs (rRNA), small nuclear RNAs (snRNAs) and small nucleolar RNAs (snoRNAs). The tRNA was predicted using tRNAscan-SE (v2.0)32. Other types of ncRNAs were detected by alignment to Rfam database33 using infernal (v1.1.3) software34. In total, 10249 non-coding RNAs (ncRNAs) were annotated, including 3,702 rRNAs, 386 miRNAs, 5,811 tRNAs, 269 snRNAs, and 81 snoRNAs (Table 5).

Table 5 Classification of ncRNAs in E. carinicauda genome.

Gene prediction and annotation

We detected the protein-coding genes in the E. carinicauda genome assembly by a comprehensive strategy that combined ab initio prediction, protein-based homology searches, and RNA sequencing data predictions. For ab initio prediction, augustus (v3.2.2)35, SNAP (v6.0)36, Glimmer hmm (v3.0.4)37 and GeneMark-ET38 were used to predict the repeat-masked genome structure. For protein-based homology prediction, the protein sequences of homologous species including Daphnia pulex (GCA_021134715.1), Procambarus virginalis (GCA_020271785.1), Fenneropenaeus chinensis (GCA_019202785.2), Penaeus japonicus (GCA_017312705.1), Penaeus monodon (GCA_015228065.1), Litopenaeus vannamei (GCA_003789085.1), Portunus trituberculatus (GCA_017591435.1) and M. nipponense (GCA_015104395.1) were downloaded from the NCBI database and aligned against the E. carinicauda genome using GeMoMa (v1.7.1)39 to perform homology prediction. Furthermore, the RNA-seq data from different tissues and embryonic development stages (PRJNA594425, PRJNA746617, PRJNA756619, PRJNA881755, and PRJNA881756) were mapped to the genome by HISAT2 (v2.1.0)40. The full-length transcripts (PRJNA594425) from our previous study41 were assembled using Cufflinks (v2.1.1)42, then the open reading frame was predicted using PASA (v20140417)43. The EVidenceModeler44 was employed to consolidate the results from these three methods, enabling the merging and integration of gene predictions. Finally, 44,288 high-quality protein-coding genes were predicted. These predicted genes displayed an average gene length of 28,448 bp, an average coding length of 1,424 bp and 6.09 coding exons per gene.

These genes were functionally annotated using BLAST against NR, SwissProt, eggNOG, InterPro, GO and KEGG45. The protein-coding gene functional annotation results were merged using the aforementioned methods. Finally, 70.53% of the total predicted genes were successfully assigned with at least one functional annotation (Table 6).

Table 6 Statistical results of gene function annotation.