Search
Close this search box.

Genie: the first open-source ISO/IEC encoder for genomic data – Communications Biology

Data

We used all available datasets from the MPEG-G genomic information database (https://mpeg.chiariglione.org/standards/MPEG-G/genomic-information-representation/MPEG-G-genomic-information-database-4) for our simulations. We refer the reader to Supplementary Table 3 for more information.

Genie, in its current version, does not implement all features of the MPEG-G standard needed to represent all alignments possible in BAM format. To allow a fair comparison to other compression approaches, we filtered or transformed aligned items containing the following features that Genie does not support:

  1. 1.

    The alignment information of half-aligned paired records (one nucleotide sequence aligned, the other unaligned) was removed, i.e. the records were converted to completely unaligned records.

  2. 2.

    Supplementary and secondary alignments were removed.

  3. 3.

    Optional tags were removed from the records. Optional tags in SAM files provide additional annotations on a per-record level, such as details from base calling, alignment processes, or information about mate records in paired sequencing. The usefulness of these tags largely depends on the specific downstream tasks and their requirements.

Genie includes a script in its repository to perform that filtering. All relevant information left in the resulting files is compressible in a lossless way by all tools used in the experiments.

Transcoding

The Genie encoding/decoding processes use MPEG-G records as uncompressed input and output formats. MPEG-G records (referred to by their file extension mgrec) are a flexible binary data structure that can represent unaligned as well as aligned genomic data. Other uncompressed formats like FASTQ or BAM must be transcoded into MPEG-G records before any encoding with Genie can be executed. In the case of aligned data, the Genie encoder expects the records to be ordered by the alignment position of their first nucleotide sequence. Transcoders that can convert BAM and FASTQ files into MPEG-G records are included in the Genie application. Records in BAM files must be sorted by record identifiers before the BAM transcoder is started (for example, by using samtools). The transcoder output is automatically sorted by alignment position and thereby directly suitable as input for Genie compression.

Regrouping of records

As a first step of encoding, the records are regrouped into access units. An access unit is a set of jointly encoded records, and the smallest decodable unit defined in MPEG-G. Records in an access unit must share specific properties. Those properties are:

  1. 1.

    Record class: the class of an MPEG-G record describes the type of alignment. Available classes are P (fully aligned to a reference sequence without mismatches), N (fully aligned to reference and the only mismatches are substitutions with N, referring to an unknown nucleotide), M (fully aligned to reference with arbitrary substitutions), I (fully aligned to reference with insertions, deletions, or clipping), U (completely unaligned) and HM (record with two paired nucleotide sequences, one aligned and one unaligned).

  2. 2.

    The number of nucleotide sequences in the records, i.e., if the records consist of two paired nucleotide sequences or single nucleotide sequence.

  3. 3.

    The specific reference sequence that the nucleotide sequences in the records are aligned to, if any.

Records with the same set of those properties are put in a buffer together. With 5 classes of records (P, N, M, I, HM), 2 pairing configurations (paired/unpaired), and n reference sequences, there are at most 5  2 n buffers for aligned records. Additionally, with one class (U), 2 pairing configurations (paired/unpaired), and no references, there are at most 2 buffers for unaligned sequences. However, in most datasets, not all of these record types occur, thus some of these buffers typically remain unused. Once the buffer reaches a certain threshold size or no more input data is available, the buffer is flushed, and all records in it are encoded as one access unit. Multiple threads can encode multiple access units in parallel.

Encoding of records into descriptors

Once an access unit is assembled, it is encoded into descriptor subsequences. An overview of descriptors used in MPEG-G is contained in Supplementary Table 1. Nucleotide sequences are encoded by comparing them to their reference sequence and storing only the information necessary to reconstruct the records from the reference sequence. In the case of no available external reference sequence, a reference can be computed from the genomic records themselves (local assembly encoding, global assembly encoding based on SPRING9, or the nucleotide sequence can be encoded verbatim (low latency encoding) without exploiting the redundancy between overlapping records and their reference.

Record identifiers in MPEG-G are represented as delta-encoded, tokenized strings. In the first step, each record identifier is converted into a sequence of tokens independently. We refer the reader to the Supplementary Table 2 for a list of all tokens available. In the next step, the record ID of each record is compared token by token with the ID of the previous record. If possible, tokens are replaced with the delta token (indicating two different numbers at the same place in both IDs) or the match token (tokens are completely identical). Assuming that record identifiers in a dataset often follow a certain naming pattern (e.g., containing incrementing numbers or constant strings), this exploits redundancy between the record IDs. For each token position, a descriptor subsequence encoding the token types and further descriptor subsequences for parameters associated with each token type are created.

Quality scores in Genie can be completely discarded or encoded verbatim. It is also possible to employ a lossy compression scheme using CALQ22, which uses genotype uncertainty to quantize the quality scores adaptively, such that subsequent analysis of the genomic data is minimally impacted.

Sequence transformations and entropy encoding

The descriptor subsequences generated by the previously mentioned encoding schemes are finally entropy encoded. This is performed by GABAC21, an entropy encoding solution consisting of various transformations that can be optionally carried out, followed by CABAC (Context Adaptive Arithmetic Coding)23 encoding.

One of the following transformations is applied at symbol level for each descriptor subsequence, yielding one or multiple transformed descriptor subsequences:

  1. 1.

    No transformation: No transformation is applied to the descriptor subsequence.

  2. 2.

    Equality encoding: The descriptor subsequence is demultiplexed into two transformed descriptor subsequences. The first transformed subsequence contains a binary flag Fi for each symbol Si.

    Fi = 1 indicates that Si = Si−1.

    Fi = 0 indicates Si ≠ Si−1 and that the next value in the descriptor subsequence shall be reconstructed from the second transformed subsequence. The second transformed subsequence contains a reconstruction value Rj for the jth flag F with value 0. Rj is computed such that:

    $${R}_{j}=left{begin{array}{ll}{S}_{i},quad &{{{{{{{rm{if}}}}}}}},{S}_{i} , < ,{S}_{i-1} {S}_{i}-1,quad &{{{{{{{rm{if}}}}}}}},{S}_{i} , > ,{S}_{i-1}end{array}right.$$

  3. 3.

    Match encoding: The descriptor subsequence is demultiplexed into three transformed descriptor subsequences. A buffer of the last B encoded symbols is maintained. It is determined if for an l as large as possible, with 1 < l ≤ B, the string of the next l symbols matches with a substring of length l in the buffer. If a match is found, the starting index of that substring in the buffer is appended to the first transformed subsequence and l to the second transformed subsequence. If no match is found, 0 is added as length to the second subsequence and the current symbol Si to the third subsequence.

  4. 4.

    Run length encoding: The descriptor subsequence is demultiplexed into two transformed descriptor subsequences. For a guard value G, the descriptor subsequence is partitioned into the longest possible runs (Lj, Sj) of consecutively equal symbols, with the length Lj and the symbol Sj of the jth run. Each length Lj is then decomposed into the form Lj = njG + rj with 0 < rj ≤ G. Finally, for every run, nj symbols with value G followed by one symbol with value rj − 1 are appended to the first transformed subsequence, and one symbol with value Sj is appended to the second transformed subsequence.

After the sequence transformation is applied, each symbol can be optionally split into smaller subsymbols. The size in bits of the subsymbols must be a factor of the symbol size in the transformed descriptor subsequence. For example, a sequence of 6-bit symbols can be split into a 3-bit subsymbol sequence with twice the amount of elements. After subsymbol splitting, for each of the subsymbols, one of the following transformations is applied:

  1. 1.

    No transformation: The sequence of subsymbols is copied to an identical sequence of transformed subsymbols.

  2. 2.

    Look-Up-Table transformation: A two (coding order 1) or three (coding order 2) dimensional table is generated in which the frequency of each subsymbol is sampled depending on the context of the previous one (coding order 1) or two (coding order 2) subsymbols in the sequence. For each context, the symbols are then ordered by their frequency, such that the most frequent symbol is at index 0 and the least frequent symbol at the highest index. Every subsymbol in the sequence is then substituted with its index in the table given the current context. The look-up tables needed to inverse this transformation are included at the beginning of the transformed subsymbol sequence.

In the next step, the transformed subsymbol sequences are converted into a stream of bits with one of the following binarizations:

  1. 1.

    Binary: The subsymbol is encoded as a binary number with n bits

  2. 2.

    Truncated unary: The subsymbol S is encoded as a run of S one bits followed by a single zero bit. If S equals the maximum in the possible range of values, the trailing zero is truncated.

  3. 3.

    Exponential Golomb: If N is the number of bits necessary to encode S + 1 as binary number, the subsymbol S is encoded as N zero bits, followed by the binary representation of S.

Finally, the binarized values are entropy encoded using the context-adaptive arithmetic coding algorithm (CABAC). The coding order of CABAC can be adjusted between 0 and 2.

Determination of encoding parameters

A parameter set in MPEG-G contains one full configuration per descriptor subsequence for the previously described transformations. As the MPEG-G standard permits only 256 parameter sets per dataset, and datasets usually contain many more access units than 256, it is impossible to find an optimized parameter set individually for each access unit. There are many advanced strategies imaginable to use the available slots for parameter sets as efficiently as possible, and those leave room for further research. For this paper, however, we decided to generate baseline results with a simple approach. That approach is to use just one constant, globally optimized parameter set for every access unit and every dataset. This implies that there is exactly one configuration per descriptor subsequence to optimize. As an additional experiment, we also compared this approach of using only one, globally optimized parameter set to using an individually optimized parameter set per access unit under the aspect of the sequence transformation. Our results indicate that the loss in compression ratio is insignificant (<1%) when choosing the simpler, global approach. The numeric results for this experiment are presented in Supplementary Table 5.

To find a well-working, global set of parameters, we encoded datasets 01-1, 32, 02, and 37 in the MPEG-G genomic database with all encoding processes in Genie (Reference-Based encoding, Local Assembly encoding, Global Assembly encoding, Low Latency encoding). We extracted the descriptor subsequences of the first 10 access units for each combination of dataset and encoding process (in the following referred to as file) before any transformation or entropy coding was applied (resulting in 80 access units in total). Ten access units were chosen as a sample for the full file, since we found that this sample size provides a balanced approach, offering a manageable computational complexity while also already encompassing most types of access units that appear in the file in significant amounts.

For every descriptor subsequence, we ran an exhaustive search and explored all possible combinations of supported parameters. For each combination of parameters, we recorded the final compressed size. The explored parameters consisted of the sequence transformation choice (none, equality encoding, match encoding, run length encoding) and the following parameters for each resulting transformed descriptor subsequence:

  1. 1.

    CABAC coding order (0, 1, 2)

  2. 2.

    Number of subsymbols per symbol for subsymbol splitting (1, 2, 4, 8)

  3. 3.

    Subsymbol transformation (No transformation, LUT transformation)

  4. 4.

    Binarization (binary, truncated unary, exponential Golomb)

Choice of sequence transformations

We first evaluated the choice of the sequence transformation, as it alters the semantics and statistics of the generated transformed descriptor subsequences heavily. Because it is possible to compute the optimal sequence transformation for each descriptor subsequence directly, we chose an analytical approach over standard optimization algorithms (like genetic algorithms) that only approximate the optimal solution. We determined for each possible combination of access unit, file (combination of dataset and encoding strategy), descriptor subsequence, and sequence transformation the combination of parameters resulting in the smallest possible compressed size. With that experimental data, we performed the following calculation:

Let F be the set of all input files, T the set of all sequence transformations and S the set of all descriptor subsequences. Let

$${{{{{{{rm{TrDeSu}}}}}}}}(s,t,f,i),| ,sin S,tin T,fin F,iin {{{{{{{bf{N}}}}}}}},i , < ,10$$

refer to the tuple of demultiplexed transformed descriptor subsequences obtained by applying sequence transformation t to descriptor subsequence s in access unit i in file f. Let minSize(TrDeSu(s, t, f, i)) refer to the smallest possible compressed size (as determined in the experiments) for that tuple with arbitrary compression parameters, apart from the already fixed choice of sequence transformation.

The smallest total compressed size for that tuple in all 10 access units available for a file f is then:

$${{{{{{{rm{totalMinSizeTrDeSu}}}}}}}}(s,t,f)=sumlimits_{jin {{{{{{{bf{N}}}}}}}},j < 10}{{{{{{{rm{minSize}}}}}}}},({{{{{{{rm{TrDeSu}}}}}}}}(s,t,f,j)).$$

Now for each combination of descriptor subsequence s and file f, it is possible to compare the compressed sizes depending on the chosen sequence transformation t. The best sequence transformation tbest for s and f is the one resulting in the smallest compressed size ({{{{{{{rm{totalMinSizeTrDeSu}}}}}}}}(s,t,f)left.right)). Every other sequence transformation will result in a worse compressed size, i.e. a loss of compression ratio with respect to the optimum occurs. That loss factor can be calculated as:

$${{{{{{{rm{loss}}}}}}}}(s,t,f)=frac{{{{{{{{rm{totalMinSizeTrDeSu}}}}}}}}(s,t,f)}{mi{n}_{jin T}({{{{{{{rm{totalMinSizeTrDeSu}}}}}}}}(s,j,f))}.$$

We then computed the average loss factor for a specific descriptor subsequence and sequence transformation over all files:

$$overline{{{{{{{{rm{loss}}}}}}}}}(s,t)=frac{{sum }_{jin F}{{{{{{{rm{loss}}}}}}}}(s,t,j)}{parallel Fparallel }.$$

The best sequence transformation for each descriptor subsequence globally can then be obtained by minimizing that average loss in compression ratio:

$${t}_{best}(s)=arg {min }_{jin T}(overline{{{{{{{{rm{loss}}}}}}}}}(s,j)).$$

Choice of remaining parameters

We observed that the set of remaining parameters remained strongly consistent among the best solutions with the smallest compressed size, given a fixed sequence transformation and descriptor subsequence. Therefore, we chose the remaining parameters by counting how often they appeared in the best solutions and chose the most frequently occurring ones. We refer the reader to Supplementary Table 4 for the results. The resulting parameters were hard-coded into Genie to be used for the corresponding descriptor subsequence.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.