Binary Sequence Formats

The Xdna, SnapGene, and Gck sequence formats

2019/08/20

Abstract

This document intends to document binary sequence formats used by some molecular biology programs and which are, as far as I know, not documented anywhere else.


Table of Contents

1. Introduction
1.1. History and Rationale for this Document
1.2. Conventions Used in this Document
1.3. Legal Notice
2. The Xdna Format
2.1. Base Structure
2.2. Extended Xdna Files
3. The SnapGene Format
3.1. General Structure
3.2. The Cookie Packet
3.3. The DNA Packet
3.4. Other Packets
4. The Gck Format
4.1. The Header and the Sequence Packet
4.2. The Features Packet and Associated Strings
4.3. The Sites Packet and Associated Strings
4.4. The Versions Section
4.5. The Name and Flags Section

When I first started to work in a molecular biology lab, most of my then colleagues were using Apple computers and a particular, Mac OS-specific plasmid editor called DNA Strider. I was regularly given some sequence files in the native binary format of that program, which none of the applications available on my GNU/Linux desktop were able to read.

Between two experiments, I reverse-engineered the DNA Strider format and wrote a couple of small C programs to read a DNA Strider file and convert it into a FASTA or GenBank file, or conversely to read a FASTA file and convert it into a DNA Strider file. This was my Xdna2 project.

A few years later, in another lab, I received from external collaborators some sequence files in the native format of another sequence editor called SnapGene. I envisioned for a while to add support for that format to my Xdna2 project, but quickly decided against it, mostly on the grounds that, given the complexity of the SnapGene format, implementing a parser in C code would have been too time-consuming.

Instead, I chose to develop the parser in Python, and since I was doing that, I also re-wrote my original DNA Strider parser in Python as well. I wrote both parsers in such a way that they integrate seamlessly with the SeqIO framework of Biopython. This was my BinSeqs project.

Later again, I came across files in another proprietary format, this time generated by a program called Gene Construction Kit. I again reverse-engineered the format and added a parser for it to the BinSeqs project. At around the same time, I submitted my parsers to the Biopython project. At the time of this writing (mid-August 2019), they have been merged into the master branch and will be part of the upcoming Biopython 1.75 release.

The code of the parsers may be read by anyone wishing to understand the formats—I tried to put enough explanatory comments for that purpose. However, I felt that it could be useful to have a standalone description of the formats in plain English, which would not require the reader to be fluent in Python. This is thus the purpose of this document.

I document here the DNA Strider, SnapGene, and Gene Construction Kit native formats, as I understood them. This is not an exhaustive documentation, but it should allow anyone wishing to read files in those formats to do so and to extract most of the information the files contain.

The Xdna format is the native format used by Christian Marck’s DNA Strider program for Mac OS. It is also used by Serial Cloner.

A basic Xdna file comprises three sections in a fixed order: a header, the sequence, and a comment (the comment being optional). The general structure is depicted in Figure 1.


The header has a fixed size of 112 bytes. It starts with a byte giving the version number of the format, which seems to always be zero (I have never seen a Xdna file with a different version). It ends with a byte which seems to always be 0xFF and whose meaning is unknown (it may be simply to mark the end of the header).

After the version byte come two bytes (byte 2 and 3) indicating the type of sequence stored in the file (1 denotes a DNA sequence, 2 a degenerated DNA sequence, 3 a RNA sequence, and 4 a protein sequence) and the topology of the sequence (0 denotes a linear sequence and 1 a circular sequence), respectively.

The header contains three lengths which are all stored as big-endian long integers (4 bytes). There is the length of the sequence (bytes 29–32), the length of the comment (bytes 97–100), and the negative length (bytes 33–36). The “negative length” is the length of the part of the sequence before the base considered as the “origin” (base number 1, which in DNA Strider is not always the first base).

[Note]Note

Serial Cloner has no such concept of a sequence origin and always generates files with a negative length of zero.

The sequence itself starts immediately at the first byte after the header (byte 113), and runs for as many bytes as indicated in the header’s sequence length field. The sequence is followed by the optional comment, without any separating byte(s)—the first byte after the sequence is the first character of the comment. The comment may be empty, in which case bytes 97–100 contain zero and the file ends after the last byte of the sequence (unless it is an “extended” Xdna file with an annotation section, see below).

Some Xdna files contain an additional annotation section after the comment (or immediately after the sequence if the comment is empty). This is typically the case of files generated by Serial Cloner (actually, I believe this additional section might be an extension of the format, created by Serial Cloner; I have never seen a DNA Strider-generated Xdna file containing such a section).


The annotation section (Figure 2) starts with a single byte whose meaning is unknown (it might be there solely to indicate the presence of the annotation section), then contains two variable-length fields describing optional right-side and left-side overhangs (described in the following section). Then, a single byte gives the number of sequence features, which are then stored in as many feature structures (described in a later section) until the end of the file.

After the left overhang specification comes a single byte indicating the number of features (a Xdna file therefore cannot contain more than 255 features). If there’s no features, that byte is zero and the file ends here.

If there are features, they are stored one after the other after that last byte. Each feature contains 6 fields and 4 flags (Figure 3).


All fields are stored as Pascal strings. They are, from the first to the last:

  • the displayed name of the feature;

  • a description of the feature;

  • the type of the feature, which may be any of the features type supported by GenBank (e.g., misc_feature, CDS, exon, etc.);

  • the start position of the feature (counting from 1, not 0), stored as text;

  • the end position of the feature, stored as text;

  • the text representation of a RGB triplet (3 comma-separated numbers from 0 to 255, e.g. 127,127,127,) indicating the color used to paint the feature on a sequence or plasmid map.

[Note]Note

Since the start and end positions of the feature are stored as text, the format could theoretically support fuzzy locations (but not compound locations), by using a notation similar to the one used in GenBank flat files (e.g., <5 to denote a start position located anywhere before the fifth base). However neither DNA Strider nor Serial Cloner support such notation and only exact locations may be used.

The description field may contain a simple free-form comment on the feature, but may also contain GenBank-like qualifiers. In that case, the field is structured in lines (separated by carriage returns, \r), the first line being a free-form comment, and the following lines containing key-value pairs. Here is an exemple of a formatted description field (line feeds inserted for clarity):

Free-form comment on the first line\r
gene="bla"\r
product="beta-lactamase TEM"\r
function="ampicilin resistance"
        

The four flags are stored between the fifth and sixth fields, each flag being a single byte. The first flag indicates the strand the feature is on, for DNA sequences (reverse strand if the flag is not set, forward strand if it is); the second flag indicates whether the feature should be displayed on a sequence or plasmid map; and the fourth flag indicates whether the feature should be decorated with an arrow. The meaning of the third flag is unknown.

The SnapGene format (typical file extension: .dna) is the native format of the SnapGene program from GSL Biotech.


This packet (Figure 5) is identified by the tag 0x09. The data bytes start with 8 bytes encoding the string SnapGene, which acts as a magic cookie allowing to identify a SnapGene file.

The cookie is followed by three big-endian short integers, the first of those indicating the type of sequence stored in the file (which is always 1 for a DNA sequence).

Given its fixed structure, the Cookie Packet always contains 14 data bytes and therefore a SnapGene file always starts with the following sequence of bytes:

0x09 0x00 0x00 0x00 0x0E 0x53 0x6E 0x61 0x70 0x47 0x65 0x6E 0x65
  |  \_________________/   |    |    |    |    |    |    |    |
  |   Length (14 bytes)   'S'  'n'  'a'  'p'  'G'  'e'  'n'  'e'
  +- Packet tag
        

This packet has the tag 0x0A. It contains the text representation of a XML tree starting with a root node named Features which lists the features found in the sequence.

Each feature is represented by a XML node named Feature. That node may contain the following attributes:

A Feature node contains one or several Segment child node(s) giving the sequence coordinates of the feature. Each Segment node has a range attribute whose value is of the form XXX-YYY, where XXX is a 1-based start coordinate and YYY is the end coordinate.

After the Segment node(s), the Feature node may also contain Q nodes representing feature qualifiers. Each Q node has a name attribute giving the name of the qualifier, and a V child node for the value of said qualifier. The value itself is stored in a text or int attribute depending on the type of the qualifier. Textual values contain XML-escaped HTML tags.

Here is a (simplified) example of the XML that may be found in a Features packet:

The Gck format is the native format of the Gene Construction Kit program from Textco Biosoftware.


The overall structure of a Gck file is depicted in Figure 7. This is a somewhat weird mix of fixed-sized blocks (the header, the unknown 706-byte block near the end, and the 17-byte flags section), length-prefixed packets, ad-hoc structures (the versions), and lists of strings.

All the types of packets start with a big-endian long integer giving the size of the packet (the number of following data bytes). However, and contrary to SnapGene packets as described above, there is no type tag indicating the type of the packet and the nature of its contents. The type of a packet is solely indicated by the packet’s position in the overall structure of the Gck file.

All multi-byte numerical values use big-endian order.

The following sections describe the different components of a Gck file, in the order in which they appear.

Sequence features are described in two consecutive parts of a Gck file: first a Features Packet (Figure 9), then a list of strings associated with the features (thereafter referred to as the Feature Strings List).


The data bytes of the packet start with a long integer giving the length of the sequence (which is the same as the length indicated at the beginning of the Sequence Packet, I don’t know why that information is repeated here), followed by a short integer giving the number of features.

Each feature is then described by a 92-byte block (Figure 10).


Two long integers in bytes 1–4 and 5–8 give the 1-based coordinates of the feature. The strand flag in byte 31 can be 0 (no strand specified), 1 (feature on the reverse strand), 2 (feature on the forward strand), or 3 (feature on both strands).

A short integer in bytes 15–16 supposedly indicates the feature’s type. This field can take a large range of values, but in all the Gck files I have seen there was actually only two possibilities: a value of zero is a misc_feature (which can be anything, really), and any non-zero value denotes a CDS.

The last byte of the structure (byte 92) is a version number. It indicates that the feature belongs to the specified version of the file. Versions are numbered in reverse order: the most recent version has the number zero, then the previous version has the number 1, the version before that has the number 2, and so on. (So if you are only interested in the current version of the file, you may skip all features with a version number greater than zero.)

Two long integers, in bytes 48–51 and 52–55, indicate, if they contain a non-zero value, that this feature has respectively a name (stored as a 8-bit Pascal string) and a comment (stored as a 32-bit Pascal string) in the Feature Strings List after the Features Packet.

The Feature Strings List has no header, no length indicator, or any marker delimiting the beginning or end of the section. To parse it, or even just to skip it, one needs to know how many strings to expect (and whether they are 8-bit P-strings or 32-bit P-strings), by first parsing the Features Packet and looking at which features have non-zero name or comment pointers. The strings are stored in the same order as the features in the Features Packet, meaning there is first the name of the first feature (if any), then the comment of the first feature (if any), then the name of the second feature, and so on.