Peter (BioPython Dev | 2 Aug 12:45 2006
Picon
Picon

Re: Reading sequences: FormatIO, SeqIO, etc

Leighton Pritchard wrote:
> On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote:
> 
>>Question One
>>============
>>Is reading sequence files an important function to you, and if so which 
>>file formats in particular (e.g. Fasta, GenBank, ...)
> 
> Yes.  FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW
> 

PTT (Protein table files)

http://www.ibt.unam.mx/biocomputo/hom_make_db.html
(Anyone got an NCBI link for the file format?)

GFF (General Feature Format)

http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml

GFF and PTT aren't exactly what I would call sequence files, in that
they don't contain any sequence data.  But thinking about it, maybe
those files could be turned into SeqRecords or SeqFeatures (with empty
sequences).

> 
>>If you have had to write you own code to read a "common" file format 
>>which BioPython doesn't support, please get in touch.
> 
(Continue reading)

Leighton Pritchard | 2 Aug 13:23 2006
Picon

Re: Reading sequences: FormatIO, SeqIO, etc

On Wed, 2006-08-02 at 11:45 +0100, Peter (BioPython Dev) wrote:
> GFF (General Feature Format)
> 
> http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
> 
> GFF and PTT aren't exactly what I would call sequence files, in that
> they don't contain any sequence data.  

Fair point, but GFF3 (see below) can optionally carry sequence data, and
I use them for exactly what you say here:

> those files could be turned into SeqRecords or SeqFeatures (with empty
> sequences).

I was thinking that GFF3 would be more useful than GFF:

http://song.sourceforge.net/gff3.shtml

NCBI have already gone over to this on bacterial genomes, at least,
(e.g.
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gff), and it's a much
richer format than the original specification.  Andrew Dalke has already written a GFF3 parser/writer,
which is available at

http://www.dalkescientific.com/PyGFF3-0.5.tar.gz

I've not used this in anger, yet...

> Its looks like there is enough overlap between the EMBL and Genbank to
(Continue reading)

Peter (BioPython Dev | 2 Aug 14:56 2006
Picon
Picon

Re: Reading sequences: FormatIO, SeqIO, etc

Leighton Pritchard wrote:
> Fair point, but GFF3 (see below) can optionally carry sequence data,
> and I use them for exactly what you say here:
> 
>> maybe those files could be turned into SeqRecords or SeqFeatures 
>> (with empty sequences).
> 
> I was thinking that GFF3 would be more useful than GFF:
> 
> http://song.sourceforge.net/gff3.shtml
> 

Thanks for the links... interesting that GFF3 allows embedding Fasta
sequences.

>> Reading your other comments, it looks like you wouldn't miss 
>> FastaRecord or GenBank records if they were phased out.
> 
> Not personally, but others may have strong opinions and breakable 
> code, yet.

There is no need to remove the current modules, just mark them as
depreciated.  Of course, if there is some strong support for these
objects then we might not want to be so harsh...

> It may be a side-issue, but should a Clustal parser return an 
> Alignment object or iterate over SeqRecord objects?  And for that 
> matter, what about other MSA files in FASTA format?  I think we ought
> allow parsers to return an Alignment where the user requests it, 
> which is a functionality I'm not currently aware of in the FASTA 
(Continue reading)


Gmane