Metadata formats and schemas
NCBI Datasets delivers metadata as data report files in JSON and JSON Lines formats (https://jsonlines.org). JSON and JSON Lines formats are simple and easy-to-use formats with strong tooling support. Additionally, these formats balance human readability and convenience with machine readability and operability. A companion tool called dataformat (see Command-line tools section below) is also available to facilitate access to the information in the data reports by converting them to comma-separated values (.csv) or Excel® (.xlsx) formats. Compared to tabular formats, JSON and JSON Lines formats maintain the hierarchical organization of data, preserve relationships between fields, and are extensible, in that new fields or sets of fields can be easily added without introducing breaking changes. Although all metadata is provided in either JSON or JSON Lines formats by default, it is recognized that many users prefer working with data in a tabular format for browsing and analysis purposes. To address this, we provide tools to convert JSON Lines reports to tabular formats and offer table downloads directly from our web interfaces. For a comprehensive understanding of each data report, including field descriptions and example values, detailed schemas are available on our documentation pages under Data report schemas (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/data-reports/). All fields in the data reports are included in the schema pages.
How does NCBI Datasets support data exploration and downloads?
To provide maximum flexibility in data access for biologists and bioinformaticians, NCBI Datasets offers a variety of interfaces to browse and download data, including web pages, command-line tools, and OpenAPI (https://www.openapis.org/). Most users will find that web pages and command-line tools, or a combination of both, will best serve most data exploration and download goals. The NCBI Datasets API also serves as the primary data source for both the web pages and command-line tools, ensuring data consistency regardless of the interface used. Together, these support the FAIR principles of findability and accessibility through easy discoverability, scalable download options, extensive metadata, and interoperable formats.
Web interface
The NCBI Datasets web interface (https://www.ncbi.nlm.nih.gov/datasets) offers a user-friendly organism-centric experience for searching, browsing, and downloading data from across the NCBI sequence databases (Fig. 3). Users can enter a species name from the NCBI Datasets homepage to navigate to a taxonomy page representing that species. The taxonomy pages act as a gateway to NCBI data available for each node of the taxonomic tree, connecting to gene and genome pages relevant to that taxonomic node, as well as to related taxa.
NCBI Datasets taxonomy pages replace the legacy Entrez Genome pages, providing basic taxonomic information and a high-level look at the gene and genome data available for that taxon. The taxonomy pages conveniently link to the NCBI Datasets taxonomy browser, genome table, genome pages, and gene table for annotated genomes. Additionally, they provide links to data stored in other NCBI databases not yet incorporated into NCBI Datasets.
NCBI Datasets gene and genome tables allow users to simultaneously browse large numbers of records, with links to individual records and options to download data. The gene table allows browsing of genes for a species, with each row representing a single gene, and columns representing various metadata fields. The “Actions” column links the Genome Data Viewer (https://www.ncbi.nlm.nih.gov/gdv/) and NCBI ortholog pages (https://www.ncbi.nlm.nih.gov/gene/59272/ortholog/?scope=7742). The genome table allows users to browse assembled genomes, with each row representing a genome, and columns representing key metadata. It offers options to download data packages and tables. The “Actions” column in the table provides links to the NCBI Datasets genome pages, BLAST, and the Genome Data Viewer (GDV). NCBI Datasets genome pages represent individual assembled genomes, replacing the legacy Entrez Assembly pages. Similar to the legacy pages, the NCBI Datasets genome pages describe single genomes and provide options for downloading data. In contrast to the legacy pages, the new genome pages consolidate information from multiple NCBI databases (such as Assembly, BioProject, and BioSample) on a single page. In addition, command-line and curl commands are provided at the top of each page, enabling users who browse on the web to easily get data in a terminal environment. This also offers an accessible introduction for those unfamiliar with the command-line.
Finally, the taxonomy browser allows the easy exploration of organisms within their taxonomic context, visualizing available assembled genomes for different ranks. As NCBI Datasets grows, we will continue to add a taxonomic view to additional data types. All data visible on NCBI Datasets web pages can be downloaded via the blue download buttons.
Command-line tools
The command-line tools offer programmatic access to all data available on the NCBI Datasets web pages. With the increasing complexity of metadata and the growing volume of sequence data, robust programmatic access is crucial for biologists to fully utilize NCBI data.
We offer two command-line tools: datasets and dataformat. The datasets tool allows users to download genome, gene, and virus data packages. The companion tool, dataformat, converts content from the metadata reports that use the JSON Lines format into more user-friendly tabular formats (Fig. 4). The command-line tools are available on Mac, Linux, and Windows platforms via NCBI or may be installed using conda (https://anaconda.org/conda-forge/ncbi-datasets-cli). Our tools are accessible even to users new to the command-line environment as they are intuitive, well-organized, and flexible.
NCBI Datasets command-line tool syntax is characterized by commands and nested subcommands, followed by query terms such as taxonomic names, genome, gene or BioProject accessions. These subcommands are named using plain language terms that are easily recognizable such as “gene” or “genome.” This is a distinct departure from NCBI E-utilities12, which requires knowledge of NCBI’s many databases.
The datasets command-line tool features two major top-level commands: “download” and “summary” (Fig. 4). The download command returns the NCBI Datasets data package as a zip archive, while the summary command prints metadata to the terminal screen. Subcommands are used to specify the type of data requested. Using subcommands, the datasets tool provides context-specific help, including a brief overview of subcommands and descriptions of the corresponding available flags. For example, to get more information about how to obtain genome metadata for a particular taxon of interest, the command datasets summary genome taxon–help returns a list of available flags, including options to restrict to only annotated or reference genomes, or genomes released during a specified date range (Fig. 5).
Query terms and identifiers that can be used to find data in the web interface can also be used with the command-line tool. For example, genome data can be queried using common or scientific taxonomic names, such as “human” or “Mus musculus,” or using assembly (e.g., GCF_000001405.40) or BioProject (e.g., PRJNA705675) accessions. Similarly, gene data can be queried by common and scientific species names, gene symbols or aliases, and transcript or protein accessions.
For example, the following command prints genome metadata to the screen describing the human reference genome, GRCh38: datasets summary genome taxon human –reference. In contrast, the command datasets download genome taxon siluriformes –annotated –include protein results in the download of a genome data package containing protein sequences for annotated genomes from the order Siluriformes or catfish (Fig. 6).
The dataformat command-line tool converts JSON Lines metadata to either a tabular (.tsv) text format or an Excel spreadsheet (.xlsx) with a set of user-specified fields. The layout of the JSON Lines metadata report is illustrated in Fig. 7. Each ‘box’, or line, in the JSON Lines format, encapsulates metadata pertinent to a genomic sequence, with the capability to nest further detailed ‘boxes’ of metadata within each line. Metadata can be piped directly from datasets to dataformat or, alternatively, the path to the relevant metadata can be passed as input. For instance, a simple two-column table describing mouse genomes released during a three-month period in 2023 can be created with the following command:
datasets summary genome taxon ‘mus musculus’
–released-after 4/1/2023 –released-before 7/1/2023
–as-json|-lines |
dataformat tsv genome –fields organism-name,accession
This returns the following output:
Organism Name Assembly Accession
Mus musculus GCA_030265425.1
Mus musculus GCA_949316305.1
Mus musculus GCA_949316315.1
A list of available fields can be accessed through the CLI help menu or web documentation. For example, to find a list of fields available for generating tables describing genome data, refer to the dataformat (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_genome/) reference page.
- SEO Powered Content & PR Distribution. Get Amplified Today.
- PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
- PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
- PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
- PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
- Source: https://www.nature.com/articles/s41597-024-03571-y