OrthoXML & SeqXML

The SeqXML schema (XSD) defines the skeletal structure of the sequence files and allows one to set constraints for each type of data it contains: for example, one can limit a DNA sequence to consist only of {A,G,C,T,N}. If one then tries to import a DNA sequence containing a 'Z', this error will be detected automatically by any XML validator.

As with FASTA, a SeqXML file includes the gene or protein ID, a description and the sequence itself, but it also provides the option to add alternative identifiers. The content can be validated and the position is well defined making it easy to parse and process.

For documentation, click here.

SeqXML

OrthoXML

OrthoXML is designed broadly to allow the storage and comparison of orthology data from any ortholog database. It establishes a structure for describing orthology relationships while still allowing flexibility for database-specific information to be encapsulated in the same format.

For documentation, click here.

Introduction

Two common data management issues for orthology databases are:

-organizing and standardizing the proteome datasets that are used as input
-representing the orthology relationships that are generated as output

To make these tasks easier, SeqXML and OrthoXML are designed as standardized data exchange formats for input and output, respectively. Who is using it? Why XML?

News

June 30, 2011

-SeqXML version 0.4 released

June 15, 2011

-Preprint of the SeqXML and OrthoXML article available here

Apr 4, 2011

-Our article on SeqXML and OrthoXML has been accepted for publication. We'll link to the online version as soon as it's available.

Mar 25, 2011

-OrthoXML Java library available

Feb 24, 2011

-SeqXML BioJava and Biopython implementations available

Feb 23, 2011

-OrthoXML version 0.3 released

Dec 10, 2010

-SeqXML version 0.3 released
-version 5 of reference proteomes in SeqXML format released

Why XML?

Or to put it another way: do we really need new file formats? Take FASTA, for example. Isn’t that good enough?

Although FASTA is a relatively simple format, being human readable and having only one header line, this very simplicity causes data integrity problems due to the lack of standardization. There is no generally accepted way to define the content of the header line. Furthermore, there can be multiple entries of the same gene or protein in one file, and invalid characters can be present in the sequence.

A parser or a person has to safeguard against these issues; otherwise, downstream analyses can produce erroneous results, often silently. By converting these files to a markup language like XML, it becomes a lot easier to avoid those issues.

Databases that have signed up to use OrthoXML:

*OrthoXML files available for download.