this is where jason keeps a rough draft of some documentation and some features that could improve the specification.

draft doc:

The SMART Genomics API defines how to package, transfer, and process genomics data for use in clinical applications. Genomics use many technologies each having different outputs and provides multiple lines of evidence in which to base a conclusion. Each request made by an application will likely call for many different pieces of data. This makes for complex data. In accommodating the use of complex data, this API uses a system of containers. A container holds a small piece of information. The response received by the application will be a collection of containers having the information requested.

An alignment refers to the results of aligning one sequence to another. An alignment resource contains the necessary information defining the alignment metadata along with the alignment itself. The metadata for the alignment resource describe the alignment and include the sequences aligned along with the type of alignment. The type of alignment may be of any format, but are designed with BLAST and NGS aligner output in mind. Using text, one or more SAM records or BLAST hits comprise the values of the results in the alignment resource. Another format of alignment results allowed are those found in a binary file like a BAM formatted file.

The alignment resource has the following required fields: patient, query, reference and the text results of the alignment. The query field maintains the sequence resource of the sequence aligned whereas the reference is a sequence resource that the query aligns to. Optional fields for the alignment resource are the fields for fileType, type, and score.

The alignment resource may contain a single alignment alone, but more likely it will have several matches and that is allowed. The text results of an alignment may take one of two forms to fit the resource. The alignments may be embedded in the resource itself or the resource may reference a separate flatfile of alignments using an URI to its location.

The sequence resource describes information about a biological sequence. A sequence represents a biological sequence. More specifically, the sequence resource describes a nucleic acid or amino acid sequence. Any string of letters can be considered a sequence.

The sequence resource is considered to be a base resource providing the foundation of more specific sequence resources. It contains the fields common to all sequences regardless of type. Every sequence will have certain fields and these fields are defined in the base sequence resource. A sequence object must have an identifier for the sequence along with its sequence string and the owner of the resource (patient). The required fields reside in the elements named "identifier", "type", and "read". The identifier provides an id for the sequence contained. The type indicates what kind of sequence it is and may be one of DNA, RNA or AminoAcid. The read defines the actual text sequence string. Optional fields are coordinate, quality, sourceTissue and isMapped. The coordinate refers to the sequence's genomic coordinate, if known. The quality field houses a string of quality scores for the sequence. Further, the sequence resource allows a place for the origin of the sequence or the source tissue from which it was harvested. isMapped is a boolean field indicating whether the sequence has been mapped to the genome.

A sequence may be further defined by the specific sequence resources. A specific resource refers to one that inherits a base resource. In the case of a DNA sequence, a DNASequence resource should be defined. The DNASequence resource, as does the RNASequence and AminoAcidSequence resources, define better the features of a specific type of sequence. It is highly encouraged that the base sequence resource should not be used directly, but aids in the use of a specific sequence resource. These three sequence resource types have features specific to the kind of sequence they represesent along with having all of the qualities of the base sequence type.

There is a resource that houses data for visualization. The DataSet resource provides a means to containing graph and table data. Genomic data is often packaged as a graph to illustrate the relationships in data. Similarly, experimental data with columns of data points represent the means for easy analysis. These two formats are extremely popular for applications concerning the visualization of complex data. By using a standard format that much software uses as its input, it proves easy to integrate the plethora of existing tools with cutting-edge applications. The DataSet resource provides an easy method to transfer batches of data in standard formats and pass it directly to the application without further processing required. The data may be used, without change, by many pieces of software meant for looking at data in analysis.

The DataSet resource has the required fields tagged as structure and content. The structure element describes the type of data the resource contains and can be either "table" or "graph". The content element contains a string of data in either GraphML or delimited text format.

changes and additions:

score should go away or accommodate multiple values one for each match in the resource
alignment should allow for text values to be embedded in resource or referencing an external flatfile.
reference should be called subject if the term query is used
type and filetype should merge into one alignment type

include table data
don't understand what subjectType is
no description field


is patient id in every resource?