Peter Cock | 22 Jul 2012 16:19
Gravatar

Replacing SeqFeature sub_features with compound feature locations

Dear all,

One of the 'warts' in the current SeqRecord/SeqFeature object
model is how non-trivial features are stored - in particular joins
(in the terminology of GenBank/EMBL).

Previous discussions include:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005830.html
...
http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html
http://lists.open-bio.org/pipermail/biopython-dev/2011-October/009221.html

Consider a single gene like this from NC_000932 in our test
suite: complement(join(97999..98793,69611..69724))

Currently that becomes three SeqFeature objects, a parent
object present in the SeqRecord's feature list, and two child
objects (one for each exon) within that parent feature's
sub_features list. The parent feature gets a location which
summarises the span, so start 97999-1 (Pythonic counting),
end 69724, and strand -1.

This usage of the sub_features property in this way has
been present in Biopython for a very long time, and prevents
us using it for nesting features based on the parent/child
relationship models used in GFF (e.g. gene and CDS, or
gene, mRNA, CDS, and exon).

As Brad and I had discussed, a new separate mechanism
might be added for explicit parent/child relationships
(Continue reading)

Brad Chapman | 23 Jul 2012 15:05
Gravatar

Re: Replacing SeqFeature sub_features with compound feature locations


Peter;
Thanks for working through the sub_feature issue and coming up with this
proposal. I'm 100% on board with converting over to something more
general and this looks like a great approach.

A couple of quick thoughts:

- Would it be possible to have a back-compatible 'sub_features' that
  reconstituted features based on the compound location? This could help
  us avoid breaking scripts that use sub_features, even if we no longer
  fill those in going forward.

- How do you envision storing GFF feature hierarchies? The location
  object is more lightweight with only position and strand information.
  Nested child GFF features would have key/value pairs associated with
  them as well. Would we want to use sub_features (or some new nested
  structure) for these?

Brad

> Dear all,
>
> One of the 'warts' in the current SeqRecord/SeqFeature object
> model is how non-trivial features are stored - in particular joins
> (in the terminology of GenBank/EMBL).
>
> Previous discussions include:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005830.html
> ...
(Continue reading)

Peter Cock | 23 Jul 2012 18:02
Gravatar

Re: Replacing SeqFeature sub_features with compound feature locations

On Mon, Jul 23, 2012 at 2:05 PM, Brad Chapman <chapmanb <at> 50mail.com> wrote:
>
> Peter;
> Thanks for working through the sub_feature issue and coming up with this
> proposal. I'm 100% on board with converting over to something more
> general and this looks like a great approach.
>
> A couple of quick thoughts:
>
> - Would it be possible to have a back-compatible 'sub_features' that
>   reconstituted features based on the compound location? This could help
>   us avoid breaking scripts that use sub_features, even if we no longer
>   fill those in going forward.

When you say 'use' do you mean populate and modify? Use in the
read-only sense is already covered - in that any Biopython code
generating complex SeqFeature objects would (in the short term)
populate both the sub_feature AND the new compound location.

Things get very hairy if we want to support edits to the sub_features
also automatically updating the new compound location (and vice
versa). So I don't want to do that.

> - How do you envision storing GFF feature hierarchies? The location
>   object is more lightweight with only position and strand information.

Only in the simple cases. In addition to single line GFF features, you
have joins expressed by multiple GFF lines with a common ID. Also,
it seems quite possible that GFF3 will add a new tag entry to describe
fuzzy locations in future, see e.g. this thread
(Continue reading)

Lenna Peterson | 24 Jul 2012 18:57
Picon
Gravatar

Re: Replacing SeqFeature sub_features with compound feature locations

On Sun, Jul 22, 2012 at 10:19 AM, Peter Cock <p.j.a.cock <at> googlemail.com> wrote:
> Dear all,
>
> One of the 'warts' in the current SeqRecord/SeqFeature object
> model is how non-trivial features are stored - in particular joins
> (in the terminology of GenBank/EMBL).
>
> Previous discussions include:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005830.html
> ...
> http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html
> http://lists.open-bio.org/pipermail/biopython-dev/2011-October/009221.html
>
> Consider a single gene like this from NC_000932 in our test
> suite: complement(join(97999..98793,69611..69724))
>
> Currently that becomes three SeqFeature objects, a parent
> object present in the SeqRecord's feature list, and two child
> objects (one for each exon) within that parent feature's
> sub_features list. The parent feature gets a location which
> summarises the span, so start 97999-1 (Pythonic counting),
> end 69724, and strand -1.
>
> This usage of the sub_features property in this way has
> been present in Biopython for a very long time, and prevents
> us using it for nesting features based on the parent/child
> relationship models used in GFF (e.g. gene and CDS, or
> gene, mRNA, CDS, and exon).
>
> As Brad and I had discussed, a new separate mechanism
(Continue reading)

Peter Cock | 24 Jul 2012 19:19
Gravatar

Re: Replacing SeqFeature sub_features with compound feature locations

On Tue, Jul 24, 2012 at 5:57 PM, Lenna Peterson <arklenna <at> gmail.com> wrote:
>> This is what I have tried to do on this branch:
>> https://github.com/peterjc/biopython/tree/f_loc4
>>
>> As part of this, adding two FeatureLocations will give a
>> CompoundLocation - similarly you can add a simple
>> FeatureLocation and a CompoundLocation or two
>> CompoundLocation objects. I think this makes creating
>> a SeqFeature describing a Eukaryotic gene model
>> MUCH simpler than with the existing approach.
>>
>> (A potential refinement not implemented yet would be
>> to merge abutting exact locations automatically, so that
>> adding 123..456 and 457..999 would give 123..999
>> instead of join(123..456,457..999), but that might be
>> too much magic?)
>
> Hi Peter,
>
> I have been testing the new CompoundLocation w.r.t. coordinate mapping
> and for the most part, I find it simplifies things.

That's encouraging.

> The documentation suggests using + to combine FeatureLocations, which
> invites the use of sum. However, sum doesn't work properly. I explain
> why in my StackOverflow question:
> http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior

Huh, I hadn't anticipated that - but I agree trying to use sum seems
(Continue reading)

Lenna Peterson | 24 Jul 2012 23:08
Picon
Gravatar

Re: Replacing SeqFeature sub_features with compound feature locations

>> The documentation suggests using + to combine FeatureLocations, which
>> invites the use of sum. However, sum doesn't work properly. I explain
>> why in my StackOverflow question:
>> http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior
>
> Huh, I hadn't anticipated that - but I agree trying to use sum seems
> natural.
>
>> I have considered a number of workarounds:
>>
>> 1. Implementing __radd__ on FeatureLocation to return self if other ==
>> 0 allows sum() to work in place, but I am uncomfortable with
>> hard-coding such a condition.
>
> Another idea is to define FeatureLocation or CompoundFeature
> addition with an integer to expose the current private method _shift.
> i.e. Apply an offset to the co-ordinates. Something I'd been pondering
> as a (previously unrelated) enhancement. In this interpretation, adding
> zero would have no effect on the co-ordinates and thus as a side
> effect should also make sum(locations) work. We'd need to test this
> to see if that actually works.

Yes, this works fine:

Modifying FeatureLocation.__add__ with the condition:

    if isinstance(other, int):
        return self._shift(other)

and adding FeatureLocation.__radd__:
(Continue reading)

Peter Cock | 24 Jul 2012 23:38
Gravatar

Re: Replacing SeqFeature sub_features with compound feature locations

On Tue, Jul 24, 2012 at 10:08 PM, Lenna Peterson <arklenna <at> gmail.com> wrote:
>>> The documentation suggests using + to combine FeatureLocations, which
>>> invites the use of sum. However, sum doesn't work properly. I explain
>>> why in my StackOverflow question:
>>> http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior
>>
>> Huh, I hadn't anticipated that - but I agree trying to use sum seems
>> natural.
>>
>>> I have considered a number of workarounds:
>>>
>>> 1. Implementing __radd__ on FeatureLocation to return self if other ==
>>> 0 allows sum() to work in place, but I am uncomfortable with
>>> hard-coding such a condition.
>>
>> Another idea is to define FeatureLocation or CompoundFeature
>> addition with an integer to expose the current private method _shift.
>> i.e. Apply an offset to the co-ordinates. Something I'd been pondering
>> as a (previously unrelated) enhancement. In this interpretation, adding
>> zero would have no effect on the co-ordinates and thus as a side
>> effect should also make sum(locations) work. We'd need to test this
>> to see if that actually works.
>
> Yes, this works fine:
>
> Modifying FeatureLocation.__add__ with the condition:
>
>     if isinstance(other, int):
>         return self._shift(other)
>
(Continue reading)

Peter Cock | 6 Sep 2012 02:34
Gravatar

Re: Replacing SeqFeature sub_features with compound feature locations

On Tue, Jul 24, 2012 at 10:38 PM, Peter Cock <p.j.a.cock <at> googlemail.com> wrote:
> On Tue, Jul 24, 2012 at 10:08 PM, Lenna Peterson <arklenna <at> gmail.com> wrote:
>> I agree that an "upgraded" FeatureLocation could be more
>> elegant.
>
> It could turn out to be simpler having just one location object...
> certainly worth trying out before committing this branch as is.

Such a new  "upgraded" FeatureLocation would need to hold
a list/tuple of its parts (rather like the proposed CompoundLocation),
and those could be simply as tuples of start, end, strand, db_ref
etc (essentially everything currently held in a FeatureLocation).

I'm not sure that that is any better than the new class
CompoundLocation holding a list of existing FeatureLocation
objects.

On the bright side, the branch still works nicely with the
extra BioSQL tests I added.

One of the issues worth a bit more discussion is the start
and end values of the CompoundLocation - which I am
considering making act as the left/minimum and right/
maximum boundary of the region spanned by the parts.

For normal forward strand features this does give the
biological start and end, likewise for reverse strand
features but inverted (location's start gives the biological
end). i.e. for *most* features this means no change to
the current behaviour.
(Continue reading)


Gmane