Genome sequencing
To achieve chromosomal level assembly of Chowghat Green Dwarf (CGD) cultivar, four sequencing technologies were integrated: ONT long reads (Oxford Nanopore), paired-end short-read Illumina sequencing, and Hi-C proximity ligation data, along with the PacBio long read data generated in our previous study9. The Illumina library yielded 86.46 Gb of raw data with 82.40% high-quality (HQ) reads after quality control (Supplementary Table 1). The ONT library produced 23.80 Gb of raw data with an N50 read length of 16,245 bp (Supplementary Table 2). The Hi-C library yielded 123.09 Gb of data with 97.41% high-quality reads (Supplementary Table 3).
Genome assembly and scaffolding
We performed a hybrid assembly approach to generate the contig-level genome of CGD by integrating raw data generated through ONT, PacBio, and Illumina sequencers. Using the MaSuRCA assembly pipeline, we first crafted 23.80 Gb, error-corrected ONT long reads. These reads boasted an average length of 4,255 bp and an N50 of 16.2 Kb. We then incorporated error-corrected PacBio long reads and combined them with ONT long read data. Additionally, we integrated ~ 471.8 million Illumina short reads to refine the accuracy of the assembly further. This multi-layered strategy culminated in the primary ‘CGD’ genome assembly, encompassing ~ 2.68 Gb. This primary assembly was structured into 44,376 contigs, with the N50 standing at 122 Kb and N90 at 34 Kb.
We then employed a reference-based assembly approach using an already published high-quality coconut dwarf genome (Accession no. GWHBEBU00000000)6 to lay the groundwork for our genome assembly. Following this, we integrated Hi-C sequencing data, which allowed us to anchor ~ 2.62 Gb (~ 97%) of the genome to 16 pseudomolecules with an additional ~ 65 Mb unplaced genome data (1914 scaffolds) (Fig. 2A, B; Table 1). Scaffolds with less than 1000 bp length were removed. These pseudo-molecules were named based on their sizes. The chromosomes varied in length, with the longest being ~ 241 Mb and the shortest ~ 90 Mb, achieving an N50 value of ~ 174 Mb (Supplementary Table 4).
Assembly of Chowghat Green Dwarf genome indicating different genomic features. (A) The circles from outside to inside represent: (A) physical map; (B) z-score normalized GC content; (C) gene density; (D) TE density; (E) SSR density; (F) Gibberellic acid biosynthesis genes; (G) oil biosynthesis genes; and (H) NLR loci. The pseudochromosomes are plotted in a unit of 1 Mb. (B) Hi-C image at 5 Mb resolution of the final CGD Genome after manual curation.
The current version of the genome assembly of the CGD cultivar has significantly improved the earlier published CGD genome9, which was assembled using short-read Illumina and long-read PacBio data. The current high-quality assembly comprises 16 pseudomolecules, representing ~ 97% of the CGD genome. From the 26,855 scaffolds with an N50 value of ~ 128 Kb and final genome size of ~ 1.93 Gb reported in our earlier study9, the current genome assembly consists of 16 pseudomolecules equating to a final assembly size of ~ 2.68 Gb (including unplaced scaffolds) with a N50 value of ~ 174 Mb.
Assessment of genome assembly
To assess the quality and completeness of our genome assembly, we employed BUSCO analysis using the embryophyta_odb10 database. As presented in Supplementary Fig. 2, the CGD assembly scored 96.80% and 1.80% for complete and fragmented BUSCO assessments, respectively. Genome completeness using BUSCO improved from ~ 84% (previous assembly; Rajesh et al., 2020) to ~ 96.80% (present assembly).
Additionally, we aligned the clean genomic reads from Illumina libraries to our assembled CGD genome. Approximately 93.7% of the short Illumina reads could be aligned to the CGD assembly. These results highlight the solidity and integrity of our genome assembly, affirming its aptness for subsequent genome characterization and annotation. The contiguity, completeness, and accuracy of the genome were evaluated by k-mer completeness score of 87.35% (Supplementary Fig. 3), consensus per-base accuracy (QV; 32.30) (Supplementary Table 6), and LTR assembly index of 8.85 (Table 1). These indices highlight the enhanced quality of the current CGD genome assembly.
Identification of telomeres
Telomeric repeat sequences were identified at both ends of the CGD genome in five chromosomes; nine chromosomes showed telomere signals at one end, and two chromosomes did not show any signatures of telomeric presence, which suggests the potential for further improvement (Supplementary Fig. 4).
Repeat identification and annotation
We delved into characterizing repetitive elements in the CGD genome using a blend of de novo and homology-based approaches using the Repbase database. Repetitive DNA sequences accounted for about 2.19 Gb (81.64%) of the CGD genome (Supplementary Table 5), which is reflected in the enlarged genome size of coconut in comparison to date palm (671 Mb)56 and oil palm (1.8 Gb)57 genomes. Among the repetitive elements, LTR elements were the most prevalent, constituting 53.76% of the genome. Within the LTR class, LTR copia and LTR gypsy were the most abundant elements, comprising 40.57% and 12.98% of the genome, respectively. These observations corroborate the reports of an abundance of copia and gypsy elements in palm genomes58. Also, the abundance of these repetitive elements, particularly LTR elements, underscores their significant influence on shaping the genomic landscape of coconut palms (Fig. 3).
Functional annotation of the CGD v2 protein-coding genes. The upset plot details the unique and overlapping annotations contributed by RefSeq, Interproscan, GO, and KEGG.
The draft genome assembly screening identified 592,389 putative microsatellite loci across 1,928 sequences, with 433,941 mononucleotide repeats. Among the remaining simple sequence repeats (SSRs) loci, dinucleotide repeats were the most prevalent, with 92,555 instances, accounting for approximately 15.6% of the total. Additionally, 90,238 complex simple sequence repeats were discovered (Results available in Figshare).
EDTA analysis identified 9142 intact LTR retrotransposons within the CGD genome. Genome-wide LTR insertion time analysis from these intact LTRs showed proliferation of copia and gypsy elements in the CGD genome occurred during distinct periods. The copia elements exhibited recent activity, most proliferating within the last 2 million years. In contrast, gypsy elements showed evidence of ancient activity, with divergence occurring between 2 and 6 million years ago (Fig. 4A); these results are consistent with previous studies59.
Genome-wide analysis within CGD genome of intact LTRs and RGAs analysis. (A) Insertion Time analysis of intact LTRs within the CGD genome. (B) A phylogenetic tree was generated using the maximum likelihood method in IQ-TREE, derived from the sequence alignment of all predicted RGAs found in the CGD genome assembly.
Gene prediction and functional annotation
The gene prediction pipeline, MAKER, identified 34,483 protein-coding genes in the CGD genome, with an average gene length of 6,541 bp and 5.9 exons per mRNA (shown in Figshare). Through homology searches against the RefSeq database, we successfully annotated 29,147 genes, representing approximately 84.5% of the predicted genes (Fig. 3). Meanwhile, InterProScan provided domain annotations for 22,556 CGD genes (data available in Figshare).
Identification of non-coding RNAs
Our examination of the CGD genome identified a diverse range of non-coding RNAs (ncRNAs). Specifically, we detected 1,676 tRNAs and 65 copies of 18S rRNA, 70 copies of 28S rRNA, 74 copies of 5.8S rRNA, and 1,174 copies of 5S rRNA. Additionally, a substantial number of other ncRNAs were found, totaling 20,433 (data available in Figshare). This comprehensive identification offers a detailed view of the ncRNA landscape within the coconut genome, underscoring its complexity and functional diversity.
Prediction of resistance gene analogs (RGAs)
To identify resistance genes in the CGD genome, we conducted an in-depth genome-wide prediction of RGAs, resulting in the identification of 1368 RGAs. The predicted RGAs were grouped into 24 primary categories, determined by their internal domain structures and motif arrangements (Table 2). A survey of RGAs revealed that the KIN domain-containing gene family dominated the CGD genome; there were 133 single kinase domains and 505 kinase-transmembrane domain (KIN-TM) containing gene families. Around 237 members represented the RLK gene family, while the RLP family had 112 members. Other categories include NBS- transmembrane (N-TM) with a count of 58, lectin-receptor kinase (LecRK) with 35, and cytoplasmic kinase (CK) with 83 genes. Minor categories such as coiled-coil lysM domain receptor (CLys), Coiled Coil-Toll/interleukin-1 receptor (CT), and coiled-coil lectin-receptor kinase (CLecRK) had comparatively low count, indicating their limited presence in the coconut genome. In addition, CN and CNL genes were represented by 20 and 22 genes, respectively, while NL genes had 23 occurrences.
This distribution highlights the diversity and relative abundance of different R gene classes in the CGD genome, with KIN and RLK-related genes playing a predominant role. Phylogenetic analysis of resistance gene analogs (RGAs) in the CGD genome revealed three major clades based on domain structures and evolutionary relationships11,59. The first clade consists of NBS-containing genes, including Coiled Coil- Nucleotide-binding site (CN), Coiled Coil- Nucleotide-binding site- Leucine-rich repeat (CNL), Nucleotide-binding site- Leucine-rich repeat (NL), Nucleotide-binding site (N), and Nucleotide-binding site-transmembrane (NTM), which clustered together, indicating their shared role in pathogen recognition and defense mechanisms. The second clade encompasses the kinase-related families, including KIN, KINTM, RLK, and RLP, highlighting their collaborative functions in signal transduction and stress responses. The third clade includes other R gene classes with lower counts, such as C, CK, CL, CLECRK, CLK, CLYS, CT, CTM, L, LEC, LECRK, LYK, LYS, T, and TRAN. These findings reveal the diversity and functional specialization of RGAs, contributing to the coconut palm’s resilience against biotic challenges. Compared to the Catigan Green Dwarf genome59, a significantly higher number of RGAs were identified in the present study (Table 2). Our finding indicates three major clades from phylogenetic analysis NBS containing genes, RLK and RLP gene family in one clade together and KIN-TM as a major third clade all the other categories fallen within/between these three major clades (Fig. 4B). The availability of whole genome sequences of diverse coconut accessions will enable a deeper understanding of the evolution of RGAs and support breeding for disease-resistant varieties.
Prediction of genes involved in gibberellic acid (GA) and oil biosynthesis
Gibberellins represent a critical class of phytohormones that play a pivotal role in enhancing plant growth, developmental processes, and extension of post-harvest longevity in horticultural crops60. These hormones contribute to increasing the resilience of crops to a repertoire of stresses by altering the expression of genes associated with antioxidant systems (both enzymatic and non-enzymatic), osmoprotectants, as well as various proteins and enzymes60. In addition, gibberellins function synergistically with other plant growth regulators, thereby facilitating augmented growth dynamics and physiological activities in plants60,61. Seventy-seven genes related to GA biosynthetic pathways (Fig. 2 layer F) were identified in the CGD genome (data available in Figshare).
Oil biosynthesis is a multifaceted physio-biochemical process. In the past few years, significant advances have been made in elucidating the biochemical pathways that mediate oil synthesis. As a result of these developments, numerous pivotal enzyme genes implicated in oil biosynthesis have been successfully isolated and characterized; these achievements have paved the way for modest advancements in leveraging gene function technologies to augment oil content within seeds or to refine the profile of fatty acids62. Our understanding remains limited regarding transcriptional regulatory factors influencing genes associated with lipid biosynthesis in coconut. Furthermore, knowledge of how key enzymes might synergistically coordinate and regulate synthetic and metabolic routes is still being determined. Investigating these elusive mechanisms promises to reveal further regulatory insights and strategies to enhance lipid accumulation in coconut. In total, 444 genes related to oil biosynthetic pathways were mined from the current CGD genome (Fig. 2 layer G). Further studies on oil metabolism gene expression profiles in coconut might shed light on the co-expression networks pertaining to acyl metabolism-related genes.
Whole genome duplication (WGD) analysis
WGD events were analyzed by calculating the Ks values between paralogous gene pairs within C. nucifera, E. guineensis, A. catechu, and C. simplicifolius. The Ks distribution analysis revealed that all compared Arecaceae members analyzed experienced two rounds of WGDs, τ and ω (Fig. 5; Supplementary Fig. 5). Evolutionary dating based on Ks values indicated similar occurrence times as previously described6. Two rounds of WGD events have occurred in all palms since the divergence of monocots and eudicots56,63. The first event, τ-WGD, is shared by nearly all monocots, except Alismatales and Acorales, and occurred at ~ 150 Mya63. The second event, ω-WGD, unique to palm species, occurred at ~ 75 Mya, resulting in palaeotetraploidy. In the current study, we observed two distinct peaks in the Ks distribution. This suggests that C. nucifera has experienced two whole genome duplication events during its evolution, an observation reported earlier in the Chinese coconut genome5.
The density distribution of synonymous nucleotide substitutions (Ks) in whole genome duplication analysis.
Comparative genomics
The gene synteny analysis results revealed that the CGD genome exhibited significant gene structure and order conservation, sharing 447 syntenic blocks with the oil palm genome and 489 syntenic blocks with the date palm genome (Supplementary Fig. 6). This provides insights into the evolutionary relationships and genomic synteny among these palm species.
Whole-genome alignment using the Dgenies tool revealed a one-to-one syntenic relationship between the CGD genome from this study and the pre.
viously published Chinese Tall (CNGB accession – GWHBEBT00000000) and Chinese Dwarf (CNGB accession – GWHBEBU00000000) coconut genome assemblies (Fig. 6). This high-quality alignment, with no evident chromosomal rearrangements, underscores the robustness and quality of CGD genome assembly generated in the current study.
Syntenic dot plot between Chowghat Green Dwarf (CGD) coconut cultivar and (A) Chinese Tall and (B) Chinese Dwarf coconut cultivars. The dot plot axis matrix is expressed in nucleotides, with the dot plot axes exhibiting a square relationship.
‘Kalpa Genome Resource’: modules and interface
Figure 7 gives an overview of the architecture of the Kalpa Genome Resource web server. The information in this customized database has been grouped into five basic functional web pages: ‘Home,’ ‘Annotation,’ ‘Genome Browser,’ ‘Transcriptome,’ and ‘Download’ (Fig. 8).
-
(i)
The ‘Home’ page holds descriptions about the database, genome statistics, and news updates (Fig. 8A).
-
(ii)
The ‘Annotation’ page has multiple search functions: Locus Search, KEGG Search, GO Slim Search, Text Search, and BLAST Search for Sequence homology search against genome and transcriptome sequences. BLAST Search is achieved by using the NCBI-BLAST API. The “Annotation” page further has subpages, namely “Simple Sequence Repeats” (SSR) and “Repeat Elements.” As the page name suggests, these pages provide details of identified SSRs and various repeat elements in the KGB database (Fig. 8B and C).
-
(iii)
The ‘Genome Browser’ page allows users to navigate through genomic sequences and their genomics elements interactively. The page also provides for the downloading of sequences for interested targeted regions or selected genomics elements. However, this feature is restricted to users and controlled by the admin. The architecture of this page also helps us integrate any genomic version in the future if the need arises. The ‘Genome Browser’ page integrates JBrowse for its functionality (Fig. 8D).
-
(iv)
The ‘Transcriptome’ page contains an overview of the predicted transcripts from the genome and their comprehensive annotation. This database’s architecture allows users to update multiple genome or transcriptome data and metadata as required without changing the existing database structure or functionality.
Different pages of the ‘Kalpa Genome Resource’ web server. (A) Home page; (B) Different search functions; (C) BLAST search option; (D) JBrowse gene visualization; (E) BLASTp search results; and (F) Download options.
Search modules of the ‘Kalpa Genome Resource’ browser
This section describes some common queries that can be addressed using the ‘Kalpa Genome Resource’ browser. Detailed instructions for carrying out these and other queries can be found at Kalpa Genome Resource.
The simplest query search methods available are:
-
Locus search: Find an object with a known KGR locus number.
-
GOSlim search: Find objects associated with a known sequence database accession number.
-
KEGG search: Find objects by KEGG Pathways Ids.
-
Text search: Find objects that contain one or more keywords anywhere in their text.
-
BLAST search: Sequence homology search against Genome & Transcriptome database.
Locus search
The Locus search function provides the option to search the KGR database with one locus ID at once. The search function returns the functional details of the locus searched. The details contain information like (i) Genome Version, (ii) Gene product Name, and (iii) Comprehensive Gene Ontology (GO) categorized as Biological Process, Molecular Function, & Cellular Component. It also contains information about associated KEGG pathways in which the gene predicted plays a role. Furthermore, the details of sequence homology are provided based on the identified protein ID from the NCBI RefSeq protein database.
GOSLIM search
The GOSLIM search function provides the option to search the KGR database with plant GOSLIM IDs and their annotation. These plant-specific GOSLIM IDs are associated with Locus ID during their sequence homology-based annotation process. The search page provides two options to users. They can search using the ‘Locus ID’ or the ‘GOSLIM ID’ of their interest. The locus-based search function returns the list of GOSLIM IDs and their details associated with the Locus ID. However, the GOSLIM ID-based search returns the list of all the Locus IDs in the KGR database, which are reported to be associated with the GOSLIM ID of interest with their annotation. The Locus IDs are hyperlinked to their detailed functional page in both search results. The search result page also includes a sub-search box to filter the results.
KEGG search
The KEGG search function provides the option to search the KGB database with KEGG Pathway IDs. These specific KEGG pathway IDs were associated with Locus ID during their sequence homology-based annotation process. The search page provides options to users to navigate through the KGB database to identify all the Locus IDs, which were annotated with the specific KEGG pathway ID of their interest. The search function returns the list of Locus IDs, their KEGG Gene Ortholog ID, and all the associated KEGG Pathways with respective Locus IDs. The search result page also hyperlinks to the publicly available KEGG Gene Orthology page for further detailed information. The search result page also includes a sub-search box to filter the results.
TEXT search
Text Search is the most powerful feature of the KGB Database Annotation search. It allows users to search or query the database with any keyword of interest. For example, if a user wants to identify all the Locus / Gene IDs annotated as “Receptor” proteins, they must type in the keyword in the text search box and initiate the search function. The search result page lists all the Locus IDs annotated with the word “Receptor”. The result page also provides a hyperlink to all the Locus IDs and their detailed functional annotation page so users can get more information about the specific Locus ID. The result page further provides a sub-search box to filter the results with another text word to narrow down the results.
BLAST search
The BLAST search page contains options for entering a query sequence (genome or transcriptome sequence) by typing or uploading a FASTA file. The user can also select different BLAST parameters and algorithms. To achieve this, we have implemented the available NCBI-BLAST API. However, downloading database sequences that give hits with the query sequence is restricted, while viewing sequence alignment in a graphical format is free (Fig. 8E).
Querying by region of interest
A region of interest can be specified using a pair of flanking markers: genes, genomic coordinates, transcripts, amplifiers, or any other mapped object. Given a region of interest, the comprehensive map is searched to find all loci within it. These loci can be displayed in a table or graphically as slices through the comprehensive map or as slices through a chosen set of primary maps. The comprehensive map slice shows all loci in the region, including genes, repeats, SSRs, etc.
Transcriptome datasets
The KGR database also stores and displays functional genomics and transcriptomic datasets, e.g., Gene Ontologies (GO), pathways, expression data, etc. This makes it easier to understand biological processes under normal or treated conditions. Researchers trying to identify similarities and differences between molecular conditions can upload tissue expression data. Expression profiles of the transcriptome datasets can also be visualized in graphical format and retrieved through the KGR database.
Retrieving a graphical view of the locus position
The results of queries for genes, primers, ESTs, etc., can be displayed on the KGR comprehensive map. Users can retrieve various default track information and customized track information like SNP, InDels, SSRs, repeats, and transcripts from the core database. If the results are spread across several chromosomes, multiple chromosomes will be displayed. Further, double-clicking on any of these genes reveals detailed information for the selected gene.
Jbrowse module of KGR
JBrowse is a popular interactive browser for visualizing genome data, encompassing genome sequences, gene structures, protein-coding gene annotations, single nucleotide polymorphisms (SNPs), site information, and expression profiles. This study incorporated the JBrowse2 tool into the KGR, importing the CGD genome sequence generated in the current study, RNA-Seq, and annotation data (in ‘general feature format’ files) into JBrowse (Fig. 8A).
Downloads
The portal provides a download option for accessing the CGD genome and transcriptome data in various file formats. All the generated files and images can be extracted from the ‘Home page’ by clicking the link using the download button (Fig. 8F).
In summary, we have assembled a high-quality genome of the C. nucifera CGD cultivar by incorporating ONT long-read sequencing, highly accurate short-read sequencing, and Hi-C technologies. The genome assembly generated and its accompanying resources are a significant enhancement for the palm genomics community, in general, and the coconut genomics community, in particular. They will provide valuable tools to aid in coconut breeding and deepen our understanding of the biology and evolution of this important tropical palm. As new and diverse omics data from the multi-omics platform (viz., genomics, transcriptomics, proteomics, metabolomics) become available in coconut, e.g., from re-sequencing, pan-genomes, and epigenome experiments, the ‘Kalpa Genome Resource’ will continually evolve. Future development will prioritize updating, integrating, and centralizing information and incorporating advanced functionalities to facilitate easier and more comprehensive access to this genomic resource. As more omics data for coconuts become available, the Kalpa Genome Resource will support the coconut research community and researchers and breeders of other palms.
- SEO Powered Content & PR Distribution. Get Amplified Today.
- PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
- PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
- PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
- PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
- Source: https://www.nature.com/articles/s41598-024-79768-3






