Discussion:
[Bioperl-l] Polyproteins, ribo slippage, and mat_peptide in viruses?
Chris Larsen
2009-10-27 16:33:01 UTC
Permalink
All,

I am attempting to find some solutions to a DB loading problem we are
encountering in viruses. It is multifold:

Some viruses churn out a polyprotein rather than individual peptides;
further they also slip the ribosome, so a source nucleotide is used
more than once in translation (ribosome halts, backs up one
nucleotide, and continues in a new frame); and finally we have post
translational processing into mature peptides. The main thing is that
the mature peptide is contained a a subset of the whole parent
polyprotein, but is not provided as a single file in GBK for each
mat_peptide CDS. We have to get that in order to run algorithms on the
relevant processed proteins. Therefore we cannot directly load into
GUS, but rather have to choose how to get the mat_peptide sequence.
Actually I think the viruses know that, and are just messing with us
out of spite, since we have iPods and they dont. Anyway.. from anyone
who has encountered this I seek guidance.

We have as choices:

1. Get the locations of mature peptide children in /Protein/
carve the mat_peptide sequence out of the whole polyprotein translation
check that the mat_peptide is infact an identical subset of the
translated protein
load that

OR

2. Use the locations of starts and stops in /Nucleotide/
translate that, using the slippage information
get mature peptides that line up exactly to the parent polyprotein

If you know of BioPerl sequence handling support for this, I would
love to hear more. Clearly this is a nonstandard thingamabob.

Stupid viruses

Chris
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Peter
2009-10-27 17:29:08 UTC
Permalink
Post by Chris Larsen
All,
I am attempting to find some solutions to a DB loading problem we are
Some viruses churn out a polyprotein rather than individual peptides;
further they also slip the ribosome, so a source nucleotide is used more
than once ?in translation (ribosome halts, backs up one nucleotide, and
continues in a new frame); and finally we have post translational processing
into mature peptides. The main thing is that the mature peptide is contained
a a subset of the whole parent polyprotein, but is not provided as a single
file in GBK for each mat_peptide CDS. We have to get that in order to run
algorithms on the relevant processed proteins. Therefore we cannot directly
load into GUS, but rather have to choose how to get the mat_peptide
sequence. Actually I think the viruses know that, and are just messing with
us out of spite, since we have iPods and they dont. Anyway.. from anyone who
has encountered this I seek guidance.
1. Get the locations of mature peptide children in /Protein/
carve the mat_peptide sequence out of the whole polyprotein translation
check that the mat_peptide is infact an identical subset of the translated
protein
load that
OR
2. Use the locations of starts and stops in /Nucleotide/
translate that, using the slippage information
get mature peptides that line up exactly to the parent polyprotein
If you know of BioPerl sequence handling support for this, I would love to
hear more. Clearly this is a nonstandard thingamabob.
Stupid viruses
Chris
Cool viruses :)

Do you have some specific examples from GenBank? I'm starting
to deal with virus annotation in my work, so this is of interest.

Peter

P.S. As you might guess from my email address, I'm actually more
interested in Biopython than BioPerl, but the same algorithmic
approach could be tested in either.
Peter
2009-10-27 19:54:04 UTC
Permalink
Hello Peter!
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
...
No mat_peptide sequence is given. We want that...
Looking at the GenBank file displayed, the mat_peptide features
(mature peptides) do not include a translation entry (like the parent
CDS feature does). However, they do have protein IDs - which are
actually links in the HTML version.

This leads me to suggest a third option as an alternative to the two
ideas you outlined. You could parse the GenBank file(s), and for each
mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as
a FASTA file, or a GenPept file). If you only have a relatively small
number of viruses and proteins this is probably going to be pretty
easy. At least, I could do it in Biopython and I am sure the same is
true with the BioPerl GenBank parser and their EFetch interface.

However, for a large dataset, handling it all locally (your options
(1) and (2) sound best).

Peter
Chris Larsen
2009-10-27 20:07:55 UTC
Permalink
Peter,

This is a good strategy when the gi is given. However I failed to
mention that we are finding the example I gave is unusual (15%?)---
most virus 'mature peptides' we will apply this analysis to do not in
fact have a gi number or unique identifier associated with them. There
are thousands of dengue virus files to be processed to give mature
proteins.

Should have mentioned this...Hence the problem--we cant look it up
because only the parent polyprotein has a gi. Theres nothing to look
up /by/ in most cases. So we still have to build a set of proteins
that are cleaved out of every polyprotein, by local and high
throughput methods, by building it out of the available information
(sadly, kind of a run around-- it should be in the genbank entry).

Chris
Post by Peter
Hello Peter!
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
...
No mat_peptide sequence is given. We want that...
Looking at the GenBank file displayed, the mat_peptide features
(mature peptides) do not include a translation entry (like the parent
CDS feature does). However, they do have protein IDs - which are
actually links in the HTML version.
This leads me to suggest a third option as an alternative to the two
ideas you outlined. You could parse the GenBank file(s), and for each
mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as
a FASTA file, or a GenPept file). If you only have a relatively small
number of viruses and proteins this is probably going to be pretty
easy. At least, I could do it in Biopython and I am sure the same is
true with the BioPerl GenBank parser and their EFetch interface.
However, for a large dataset, handling it all locally (your options
(1) and (2) sound best).
Peter
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Peter
2009-10-27 20:17:19 UTC
Permalink
Post by Chris Larsen
Peter,
This is a good strategy when the gi is given. However I failed to mention
that we are finding the example I gave is unusual (15%?)---most virus
'mature peptides' we will apply this analysis to do not in fact have a gi
number or unique identifier associated with them. There are thousands of
dengue virus files to be processed to give mature proteins.
Should have mentioned this...Hence the problem--we cant look it up because
only the parent polyprotein has a gi. Theres nothing to look up /by/ in most
cases. So we still have to build a set of proteins that are cleaved out of
every polyprotein, by local and high throughput methods, by building it out
of the available information (sadly, kind of a run around-- it should be in
the genbank entry).
Chris
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.

I also note that in the example given, all the mature peptides
have nice and simple locations (in terms of their co-ordindates
for the nucleotides), no ribosomal slippages etc. This means
grabbing the relevant bits of the genome and translating it is
also pretty easy (option 2 in your original email).

Have you got a more typical entry you can point us at?

If there is nothing publicly available, I wouldn't mind you
emailing me one or two to look at off list (and if don't mind,
they might make good examples for Bio* project unit tests
or examples).

Peter
Chris Fields
2009-10-27 20:46:05 UTC
Permalink
Post by Peter
Post by Chris Larsen
Peter,
This is a good strategy when the gi is given. However I failed to mention
that we are finding the example I gave is unusual (15%?)---most virus
'mature peptides' we will apply this analysis to do not in fact have a gi
number or unique identifier associated with them. There are
thousands of
dengue virus files to be processed to give mature proteins.
Should have mentioned this...Hence the problem--we cant look it up because
only the parent polyprotein has a gi. Theres nothing to look up /
by/ in most
cases. So we still have to build a set of proteins that are cleaved out of
every polyprotein, by local and high throughput methods, by
building it out
of the available information (sadly, kind of a run around-- it should be in
the genbank entry).
Chris
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for
the mat_peptide feature key, they relate back to the full protein
sequence with from/to, not to the protein_id for the feature. Example:

# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts

# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959

This record doesn't appear to contain any mapping information along
those lines, which makes me think this is an autogenerated record
using the Gene record, which does have those mappings:

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Post by Peter
I also note that in the example given, all the mature peptides
have nice and simple locations (in terms of their co-ordindates
for the nucleotides), no ribosomal slippages etc. This means
grabbing the relevant bits of the genome and translating it is
also pretty easy (option 2 in your original email).
Have you got a more typical entry you can point us at?
If there is nothing publicly available, I wouldn't mind you
emailing me one or two to look at off list (and if don't mind,
they might make good examples for Bio* project unit tests
or examples).
Peter
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output will
have the frameshifts designated with '/' or '\', so it would then be a
matter of splitting the sequence based on the midline, then mapping
those protein fragments back to the original sequence coordinates. Is
this along the lines of what you mean?

chris
Peter
2009-10-27 21:31:55 UTC
Permalink
Post by Peter
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for the
mat_peptide feature key, they relate back to the full protein sequence with
# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts
# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959
Right - the protein ID link is just based on the GI number, 28416959.
This link (or EFetch) gives you the (short) mature peptide.
This record doesn't appear to contain any mapping information along those
lines, which makes me think this is an autogenerated record using the Gene
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Are you suggesting one option is (if the mat_peptide annotation
is lacking a protein or GI number) to go online to the Gene
database using the gene ID of the precursor (parent) protein
to find the IDs of the mature (child) peptides?

Peter
bill
2009-10-27 21:47:02 UTC
Permalink
These mature proteins do have gi.
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
grep NC_001959 gene2accession
11983 1491970 REVIEWED - - NP_786945.1 28416959
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786946.1 28416960
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786947.1 28416961
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786948.1 28416962
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786949.1 28416963
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786950.1 28416964
NC_001959.2 106060735 4 5373 + -
11983 1491971 PROVISIONAL - - NP_056822.1 9630806
NC_001959.2 106060735 6949 7587 + -
11983 1491972 PROVISIONAL - - NP_056821.2 106060736
NC_001959.2 106060735 5357 6949 + -

Bill
Post by Peter
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for the
mat_peptide feature key, they relate back to the full protein sequence with
# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts
# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959
Right - the protein ID link is just based on the GI number, 28416959.
This link (or EFetch) gives you the (short) mature peptide.
This record doesn't appear to contain any mapping information along those
lines, which makes me think this is an autogenerated record using the Gene
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Are you suggesting one option is (if the mat_peptide annotation
is lacking a protein or GI number) to go online to the Gene
database using the gene ID of the precursor (parent) protein
to find the IDs of the mature (child) peptides?
Peter
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Peter
2009-10-27 22:03:12 UTC
Permalink
Post by bill
These mature proteins do have gi.
Yes, they are in the original GenBank file too. The
problem is Chris said he picked a poor example - this
one is well annotated, but in general his GenBank files
lack these cross references. Which is why I asked for a
more realistic example if he can share one ;)

Peter
Peter
2009-10-27 22:03:12 UTC
Permalink
Post by bill
These mature proteins do have gi.
Yes, they are in the original GenBank file too. The
problem is Chris said he picked a poor example - this
one is well annotated, but in general his GenBank files
lack these cross references. Which is why I asked for a
more realistic example if he can share one ;)

Peter
Peter
2009-10-27 22:03:12 UTC
Permalink
Post by bill
These mature proteins do have gi.
Yes, they are in the original GenBank file too. The
problem is Chris said he picked a poor example - this
one is well annotated, but in general his GenBank files
lack these cross references. Which is why I asked for a
more realistic example if he can share one ;)

Peter
Peter
2009-10-27 22:03:12 UTC
Permalink
Post by bill
These mature proteins do have gi.
Yes, they are in the original GenBank file too. The
problem is Chris said he picked a poor example - this
one is well annotated, but in general his GenBank files
lack these cross references. Which is why I asked for a
more realistic example if he can share one ;)

Peter
bill
2009-10-27 21:47:02 UTC
Permalink
These mature proteins do have gi.
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
grep NC_001959 gene2accession
11983 1491970 REVIEWED - - NP_786945.1 28416959
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786946.1 28416960
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786947.1 28416961
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786948.1 28416962
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786949.1 28416963
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786950.1 28416964
NC_001959.2 106060735 4 5373 + -
11983 1491971 PROVISIONAL - - NP_056822.1 9630806
NC_001959.2 106060735 6949 7587 + -
11983 1491972 PROVISIONAL - - NP_056821.2 106060736
NC_001959.2 106060735 5357 6949 + -

Bill
Post by Peter
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for the
mat_peptide feature key, they relate back to the full protein sequence with
# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts
# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959
Right - the protein ID link is just based on the GI number, 28416959.
This link (or EFetch) gives you the (short) mature peptide.
This record doesn't appear to contain any mapping information along those
lines, which makes me think this is an autogenerated record using the Gene
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Are you suggesting one option is (if the mat_peptide annotation
is lacking a protein or GI number) to go online to the Gene
database using the gene ID of the precursor (parent) protein
to find the IDs of the mature (child) peptides?
Peter
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
bill
2009-10-27 21:47:02 UTC
Permalink
These mature proteins do have gi.
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
grep NC_001959 gene2accession
11983 1491970 REVIEWED - - NP_786945.1 28416959
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786946.1 28416960
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786947.1 28416961
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786948.1 28416962
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786949.1 28416963
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786950.1 28416964
NC_001959.2 106060735 4 5373 + -
11983 1491971 PROVISIONAL - - NP_056822.1 9630806
NC_001959.2 106060735 6949 7587 + -
11983 1491972 PROVISIONAL - - NP_056821.2 106060736
NC_001959.2 106060735 5357 6949 + -

Bill
Post by Peter
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for the
mat_peptide feature key, they relate back to the full protein sequence with
# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts
# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959
Right - the protein ID link is just based on the GI number, 28416959.
This link (or EFetch) gives you the (short) mature peptide.
This record doesn't appear to contain any mapping information along those
lines, which makes me think this is an autogenerated record using the Gene
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Are you suggesting one option is (if the mat_peptide annotation
is lacking a protein or GI number) to go online to the Gene
database using the gene ID of the precursor (parent) protein
to find the IDs of the mature (child) peptides?
Peter
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
bill
2009-10-27 21:47:02 UTC
Permalink
These mature proteins do have gi.
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
grep NC_001959 gene2accession
11983 1491970 REVIEWED - - NP_786945.1 28416959
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786946.1 28416960
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786947.1 28416961
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786948.1 28416962
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786949.1 28416963
NC_001959.2 106060735 4 5373 + -
11983 1491970 REVIEWED - - NP_786950.1 28416964
NC_001959.2 106060735 4 5373 + -
11983 1491971 PROVISIONAL - - NP_056822.1 9630806
NC_001959.2 106060735 6949 7587 + -
11983 1491972 PROVISIONAL - - NP_056821.2 106060736
NC_001959.2 106060735 5357 6949 + -

Bill
Post by Peter
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for the
mat_peptide feature key, they relate back to the full protein sequence with
# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts
# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959
Right - the protein ID link is just based on the GI number, 28416959.
This link (or EFetch) gives you the (short) mature peptide.
This record doesn't appear to contain any mapping information along those
lines, which makes me think this is an autogenerated record using the Gene
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Are you suggesting one option is (if the mat_peptide annotation
is lacking a protein or GI number) to go online to the Gene
database using the gene ID of the precursor (parent) protein
to find the IDs of the mature (child) peptides?
Peter
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Chris Larsen
2009-10-27 22:13:22 UTC
Permalink
Peter, Chris,

Thank you muchly for your expert and well presented dialog. Yes here
is an actual and typical problem in generating protein seq from viral
polyproteins, in the absence of mat_peptide Seq and unique ID:

For record:

http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank

this is the coronavirus:
LOCUS DQ848678 29277 bp RNA linear VRL 12-
SEP-2006
DEFINITION Feline coronavirus strain FCoV C1Je, complete genome.
ACCESSION DQ848678
VERSION DQ848678.1 GI:112253723
Containing a poly protein 1ab where the ribosome stalls at 12358,
backs up one nuc, changes frame, and then continues:
CDS join(311..12358,12358..20391)
/ribosomal_slippage
/codon_start=1
/product="polyprotein 1ab"
and which has multiple, gigantic polyproteins, the child peptides
towards whom we would actually love to focus our comparative
bioinformatic scrutiny, but none of which have mappable IDs or seqs
below the polyprotein level as can be seen:
mat_peptide 15118..16914 <===
/product="nsp13" <===
/note="helicase" <==these are all we have to go
on
and where are given no ID below the polyprotein level, no protein
sequence...just positions. They are nuc positions at that, but we are
given the complete polyprotein seq and have the components to do this
on paper, but no code. In summary we would like to dump in genbank
files to a method, and get out fasta protein files which have some IDs
and seqs.

You guys are forcing me (thank god) to think critically and clearly
about it too, so let me extend the proposed module or method as best I
can:

Input offending Virus Genbank file
For each mat_peptide in a CDS
get nuc coords e.g. 15118..161914
translate it to aa
if slippage then stop translating at end of frame 1 and hold for later
now translate frame 2
$full_translation = $translation_part_1 . $translation_part_2
compare $full_translation to CDS translation e.g "/
translation="MSSKHFKILVNE..."
if identical subsequence then admit as the real, valid mat_peptide
sequence
Annotate with parent Unique ID+product name e.g. 112253723_helicase
go to next mat_peptide
concatenate products to a fasta amino acid file for whole virus
isolate or CDS set for your virus

PHEW.

Chris L

==========
Post by Chris Fields
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output
will have the frameshifts designated with '/' or '\', so it would
then be a matter of splitting the sequence based on the midline,
then mapping those protein fragments back to the original sequence
coordinates. Is this along the lines of what you mean?
chris
Let me look into this thank you CF, I have not used that in the past.
bill
2009-10-27 23:39:56 UTC
Permalink
It appears that acc/gi is not required for mat_peptide. Your solution
should work fine.

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objmgr/util/create_defline.cpp#L870

Bill
Post by Chris Larsen
Peter, Chris,
Thank you muchly for your expert and well presented dialog. Yes here
is an actual and typical problem in generating protein seq from viral
http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank
LOCUS DQ848678 29277 bp RNA linear VRL 12-
SEP-2006
DEFINITION Feline coronavirus strain FCoV C1Je, complete genome.
ACCESSION DQ848678
VERSION DQ848678.1 GI:112253723
Containing a poly protein 1ab where the ribosome stalls at 12358,
CDS join(311..12358,12358..20391)
/ribosomal_slippage
/codon_start=1
/product="polyprotein 1ab"
and which has multiple, gigantic polyproteins, the child peptides
towards whom we would actually love to focus our comparative
bioinformatic scrutiny, but none of which have mappable IDs or seqs
mat_peptide 15118..16914 <===
/product="nsp13" <===
/note="helicase" <==these are all we have to go
on
and where are given no ID below the polyprotein level, no protein
sequence...just positions. They are nuc positions at that, but we are
given the complete polyprotein seq and have the components to do this
on paper, but no code. In summary we would like to dump in genbank
files to a method, and get out fasta protein files which have some IDs
and seqs.
You guys are forcing me (thank god) to think critically and clearly
about it too, so let me extend the proposed module or method as best I
Input offending Virus Genbank file
For each mat_peptide in a CDS
get nuc coords e.g. 15118..161914
translate it to aa
if slippage then stop translating at end of frame 1 and hold for later
now translate frame 2
$full_translation = $translation_part_1 . $translation_part_2
compare $full_translation to CDS translation e.g "/
translation="MSSKHFKILVNE..."
if identical subsequence then admit as the real, valid mat_peptide
sequence
Annotate with parent Unique ID+product name e.g. 112253723_helicase
go to next mat_peptide
concatenate products to a fasta amino acid file for whole virus
isolate or CDS set for your virus
PHEW.
Chris L
==========
Post by Chris Fields
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output
will have the frameshifts designated with '/' or '\', so it would
then be a matter of splitting the sequence based on the midline,
then mapping those protein fragments back to the original sequence
coordinates. Is this along the lines of what you mean?
chris
Let me look into this thank you CF, I have not used that in the past.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
bill
2009-10-27 23:39:56 UTC
Permalink
It appears that acc/gi is not required for mat_peptide. Your solution
should work fine.

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objmgr/util/create_defline.cpp#L870

Bill
Post by Chris Larsen
Peter, Chris,
Thank you muchly for your expert and well presented dialog. Yes here
is an actual and typical problem in generating protein seq from viral
http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank
LOCUS DQ848678 29277 bp RNA linear VRL 12-
SEP-2006
DEFINITION Feline coronavirus strain FCoV C1Je, complete genome.
ACCESSION DQ848678
VERSION DQ848678.1 GI:112253723
Containing a poly protein 1ab where the ribosome stalls at 12358,
CDS join(311..12358,12358..20391)
/ribosomal_slippage
/codon_start=1
/product="polyprotein 1ab"
and which has multiple, gigantic polyproteins, the child peptides
towards whom we would actually love to focus our comparative
bioinformatic scrutiny, but none of which have mappable IDs or seqs
mat_peptide 15118..16914 <===
/product="nsp13" <===
/note="helicase" <==these are all we have to go
on
and where are given no ID below the polyprotein level, no protein
sequence...just positions. They are nuc positions at that, but we are
given the complete polyprotein seq and have the components to do this
on paper, but no code. In summary we would like to dump in genbank
files to a method, and get out fasta protein files which have some IDs
and seqs.
You guys are forcing me (thank god) to think critically and clearly
about it too, so let me extend the proposed module or method as best I
Input offending Virus Genbank file
For each mat_peptide in a CDS
get nuc coords e.g. 15118..161914
translate it to aa
if slippage then stop translating at end of frame 1 and hold for later
now translate frame 2
$full_translation = $translation_part_1 . $translation_part_2
compare $full_translation to CDS translation e.g "/
translation="MSSKHFKILVNE..."
if identical subsequence then admit as the real, valid mat_peptide
sequence
Annotate with parent Unique ID+product name e.g. 112253723_helicase
go to next mat_peptide
concatenate products to a fasta amino acid file for whole virus
isolate or CDS set for your virus
PHEW.
Chris L
==========
Post by Chris Fields
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output
will have the frameshifts designated with '/' or '\', so it would
then be a matter of splitting the sequence based on the midline,
then mapping those protein fragments back to the original sequence
coordinates. Is this along the lines of what you mean?
chris
Let me look into this thank you CF, I have not used that in the past.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
bill
2009-10-27 23:39:56 UTC
Permalink
It appears that acc/gi is not required for mat_peptide. Your solution
should work fine.

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objmgr/util/create_defline.cpp#L870

Bill
Post by Chris Larsen
Peter, Chris,
Thank you muchly for your expert and well presented dialog. Yes here
is an actual and typical problem in generating protein seq from viral
http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank
LOCUS DQ848678 29277 bp RNA linear VRL 12-
SEP-2006
DEFINITION Feline coronavirus strain FCoV C1Je, complete genome.
ACCESSION DQ848678
VERSION DQ848678.1 GI:112253723
Containing a poly protein 1ab where the ribosome stalls at 12358,
CDS join(311..12358,12358..20391)
/ribosomal_slippage
/codon_start=1
/product="polyprotein 1ab"
and which has multiple, gigantic polyproteins, the child peptides
towards whom we would actually love to focus our comparative
bioinformatic scrutiny, but none of which have mappable IDs or seqs
mat_peptide 15118..16914 <===
/product="nsp13" <===
/note="helicase" <==these are all we have to go
on
and where are given no ID below the polyprotein level, no protein
sequence...just positions. They are nuc positions at that, but we are
given the complete polyprotein seq and have the components to do this
on paper, but no code. In summary we would like to dump in genbank
files to a method, and get out fasta protein files which have some IDs
and seqs.
You guys are forcing me (thank god) to think critically and clearly
about it too, so let me extend the proposed module or method as best I
Input offending Virus Genbank file
For each mat_peptide in a CDS
get nuc coords e.g. 15118..161914
translate it to aa
if slippage then stop translating at end of frame 1 and hold for later
now translate frame 2
$full_translation = $translation_part_1 . $translation_part_2
compare $full_translation to CDS translation e.g "/
translation="MSSKHFKILVNE..."
if identical subsequence then admit as the real, valid mat_peptide
sequence
Annotate with parent Unique ID+product name e.g. 112253723_helicase
go to next mat_peptide
concatenate products to a fasta amino acid file for whole virus
isolate or CDS set for your virus
PHEW.
Chris L
==========
Post by Chris Fields
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output
will have the frameshifts designated with '/' or '\', so it would
then be a matter of splitting the sequence based on the midline,
then mapping those protein fragments back to the original sequence
coordinates. Is this along the lines of what you mean?
chris
Let me look into this thank you CF, I have not used that in the past.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
bill
2009-10-27 23:39:56 UTC
Permalink
It appears that acc/gi is not required for mat_peptide. Your solution
should work fine.

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objmgr/util/create_defline.cpp#L870

Bill
Post by Chris Larsen
Peter, Chris,
Thank you muchly for your expert and well presented dialog. Yes here
is an actual and typical problem in generating protein seq from viral
http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank
LOCUS DQ848678 29277 bp RNA linear VRL 12-
SEP-2006
DEFINITION Feline coronavirus strain FCoV C1Je, complete genome.
ACCESSION DQ848678
VERSION DQ848678.1 GI:112253723
Containing a poly protein 1ab where the ribosome stalls at 12358,
CDS join(311..12358,12358..20391)
/ribosomal_slippage
/codon_start=1
/product="polyprotein 1ab"
and which has multiple, gigantic polyproteins, the child peptides
towards whom we would actually love to focus our comparative
bioinformatic scrutiny, but none of which have mappable IDs or seqs
mat_peptide 15118..16914 <===
/product="nsp13" <===
/note="helicase" <==these are all we have to go
on
and where are given no ID below the polyprotein level, no protein
sequence...just positions. They are nuc positions at that, but we are
given the complete polyprotein seq and have the components to do this
on paper, but no code. In summary we would like to dump in genbank
files to a method, and get out fasta protein files which have some IDs
and seqs.
You guys are forcing me (thank god) to think critically and clearly
about it too, so let me extend the proposed module or method as best I
Input offending Virus Genbank file
For each mat_peptide in a CDS
get nuc coords e.g. 15118..161914
translate it to aa
if slippage then stop translating at end of frame 1 and hold for later
now translate frame 2
$full_translation = $translation_part_1 . $translation_part_2
compare $full_translation to CDS translation e.g "/
translation="MSSKHFKILVNE..."
if identical subsequence then admit as the real, valid mat_peptide
sequence
Annotate with parent Unique ID+product name e.g. 112253723_helicase
go to next mat_peptide
concatenate products to a fasta amino acid file for whole virus
isolate or CDS set for your virus
PHEW.
Chris L
==========
Post by Chris Fields
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output
will have the frameshifts designated with '/' or '\', so it would
then be a matter of splitting the sequence based on the midline,
then mapping those protein fragments back to the original sequence
coordinates. Is this along the lines of what you mean?
chris
Let me look into this thank you CF, I have not used that in the past.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Peter
2009-10-27 21:31:55 UTC
Permalink
Post by Peter
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for the
mat_peptide feature key, they relate back to the full protein sequence with
# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts
# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959
Right - the protein ID link is just based on the GI number, 28416959.
This link (or EFetch) gives you the (short) mature peptide.
This record doesn't appear to contain any mapping information along those
lines, which makes me think this is an autogenerated record using the Gene
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Are you suggesting one option is (if the mat_peptide annotation
is lacking a protein or GI number) to go online to the Gene
database using the gene ID of the precursor (parent) protein
to find the IDs of the mature (child) peptides?

Peter
Chris Larsen
2009-10-27 22:13:22 UTC
Permalink
Peter, Chris,

Thank you muchly for your expert and well presented dialog. Yes here
is an actual and typical problem in generating protein seq from viral
polyproteins, in the absence of mat_peptide Seq and unique ID:

For record:

http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank

this is the coronavirus:
LOCUS DQ848678 29277 bp RNA linear VRL 12-
SEP-2006
DEFINITION Feline coronavirus strain FCoV C1Je, complete genome.
ACCESSION DQ848678
VERSION DQ848678.1 GI:112253723
Containing a poly protein 1ab where the ribosome stalls at 12358,
backs up one nuc, changes frame, and then continues:
CDS join(311..12358,12358..20391)
/ribosomal_slippage
/codon_start=1
/product="polyprotein 1ab"
and which has multiple, gigantic polyproteins, the child peptides
towards whom we would actually love to focus our comparative
bioinformatic scrutiny, but none of which have mappable IDs or seqs
below the polyprotein level as can be seen:
mat_peptide 15118..16914 <===
/product="nsp13" <===
/note="helicase" <==these are all we have to go
on
and where are given no ID below the polyprotein level, no protein
sequence...just positions. They are nuc positions at that, but we are
given the complete polyprotein seq and have the components to do this
on paper, but no code. In summary we would like to dump in genbank
files to a method, and get out fasta protein files which have some IDs
and seqs.

You guys are forcing me (thank god) to think critically and clearly
about it too, so let me extend the proposed module or method as best I
can:

Input offending Virus Genbank file
For each mat_peptide in a CDS
get nuc coords e.g. 15118..161914
translate it to aa
if slippage then stop translating at end of frame 1 and hold for later
now translate frame 2
$full_translation = $translation_part_1 . $translation_part_2
compare $full_translation to CDS translation e.g "/
translation="MSSKHFKILVNE..."
if identical subsequence then admit as the real, valid mat_peptide
sequence
Annotate with parent Unique ID+product name e.g. 112253723_helicase
go to next mat_peptide
concatenate products to a fasta amino acid file for whole virus
isolate or CDS set for your virus

PHEW.

Chris L

==========
Post by Chris Fields
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output
will have the frameshifts designated with '/' or '\', so it would
then be a matter of splitting the sequence based on the midline,
then mapping those protein fragments back to the original sequence
coordinates. Is this along the lines of what you mean?
chris
Let me look into this thank you CF, I have not used that in the past.
Peter
2009-10-27 21:31:55 UTC
Permalink
Post by Peter
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for the
mat_peptide feature key, they relate back to the full protein sequence with
# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts
# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959
Right - the protein ID link is just based on the GI number, 28416959.
This link (or EFetch) gives you the (short) mature peptide.
This record doesn't appear to contain any mapping information along those
lines, which makes me think this is an autogenerated record using the Gene
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Are you suggesting one option is (if the mat_peptide annotation
is lacking a protein or GI number) to go online to the Gene
database using the gene ID of the precursor (parent) protein
to find the IDs of the mature (child) peptides?

Peter
Chris Larsen
2009-10-27 22:13:22 UTC
Permalink
Peter, Chris,

Thank you muchly for your expert and well presented dialog. Yes here
is an actual and typical problem in generating protein seq from viral
polyproteins, in the absence of mat_peptide Seq and unique ID:

For record:

http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank

this is the coronavirus:
LOCUS DQ848678 29277 bp RNA linear VRL 12-
SEP-2006
DEFINITION Feline coronavirus strain FCoV C1Je, complete genome.
ACCESSION DQ848678
VERSION DQ848678.1 GI:112253723
Containing a poly protein 1ab where the ribosome stalls at 12358,
backs up one nuc, changes frame, and then continues:
CDS join(311..12358,12358..20391)
/ribosomal_slippage
/codon_start=1
/product="polyprotein 1ab"
and which has multiple, gigantic polyproteins, the child peptides
towards whom we would actually love to focus our comparative
bioinformatic scrutiny, but none of which have mappable IDs or seqs
below the polyprotein level as can be seen:
mat_peptide 15118..16914 <===
/product="nsp13" <===
/note="helicase" <==these are all we have to go
on
and where are given no ID below the polyprotein level, no protein
sequence...just positions. They are nuc positions at that, but we are
given the complete polyprotein seq and have the components to do this
on paper, but no code. In summary we would like to dump in genbank
files to a method, and get out fasta protein files which have some IDs
and seqs.

You guys are forcing me (thank god) to think critically and clearly
about it too, so let me extend the proposed module or method as best I
can:

Input offending Virus Genbank file
For each mat_peptide in a CDS
get nuc coords e.g. 15118..161914
translate it to aa
if slippage then stop translating at end of frame 1 and hold for later
now translate frame 2
$full_translation = $translation_part_1 . $translation_part_2
compare $full_translation to CDS translation e.g "/
translation="MSSKHFKILVNE..."
if identical subsequence then admit as the real, valid mat_peptide
sequence
Annotate with parent Unique ID+product name e.g. 112253723_helicase
go to next mat_peptide
concatenate products to a fasta amino acid file for whole virus
isolate or CDS set for your virus

PHEW.

Chris L

==========
Post by Chris Fields
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output
will have the frameshifts designated with '/' or '\', so it would
then be a matter of splitting the sequence based on the midline,
then mapping those protein fragments back to the original sequence
coordinates. Is this along the lines of what you mean?
chris
Let me look into this thank you CF, I have not used that in the past.
Peter
2009-10-27 21:31:55 UTC
Permalink
Post by Peter
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for the
mat_peptide feature key, they relate back to the full protein sequence with
# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts
# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959
Right - the protein ID link is just based on the GI number, 28416959.
This link (or EFetch) gives you the (short) mature peptide.
This record doesn't appear to contain any mapping information along those
lines, which makes me think this is an autogenerated record using the Gene
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Are you suggesting one option is (if the mat_peptide annotation
is lacking a protein or GI number) to go online to the Gene
database using the gene ID of the precursor (parent) protein
to find the IDs of the mature (child) peptides?

Peter
Chris Larsen
2009-10-27 22:13:22 UTC
Permalink
Peter, Chris,

Thank you muchly for your expert and well presented dialog. Yes here
is an actual and typical problem in generating protein seq from viral
polyproteins, in the absence of mat_peptide Seq and unique ID:

For record:

http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank

this is the coronavirus:
LOCUS DQ848678 29277 bp RNA linear VRL 12-
SEP-2006
DEFINITION Feline coronavirus strain FCoV C1Je, complete genome.
ACCESSION DQ848678
VERSION DQ848678.1 GI:112253723
Containing a poly protein 1ab where the ribosome stalls at 12358,
backs up one nuc, changes frame, and then continues:
CDS join(311..12358,12358..20391)
/ribosomal_slippage
/codon_start=1
/product="polyprotein 1ab"
and which has multiple, gigantic polyproteins, the child peptides
towards whom we would actually love to focus our comparative
bioinformatic scrutiny, but none of which have mappable IDs or seqs
below the polyprotein level as can be seen:
mat_peptide 15118..16914 <===
/product="nsp13" <===
/note="helicase" <==these are all we have to go
on
and where are given no ID below the polyprotein level, no protein
sequence...just positions. They are nuc positions at that, but we are
given the complete polyprotein seq and have the components to do this
on paper, but no code. In summary we would like to dump in genbank
files to a method, and get out fasta protein files which have some IDs
and seqs.

You guys are forcing me (thank god) to think critically and clearly
about it too, so let me extend the proposed module or method as best I
can:

Input offending Virus Genbank file
For each mat_peptide in a CDS
get nuc coords e.g. 15118..161914
translate it to aa
if slippage then stop translating at end of frame 1 and hold for later
now translate frame 2
$full_translation = $translation_part_1 . $translation_part_2
compare $full_translation to CDS translation e.g "/
translation="MSSKHFKILVNE..."
if identical subsequence then admit as the real, valid mat_peptide
sequence
Annotate with parent Unique ID+product name e.g. 112253723_helicase
go to next mat_peptide
concatenate products to a fasta amino acid file for whole virus
isolate or CDS set for your virus

PHEW.

Chris L

==========
Post by Chris Fields
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output
will have the frameshifts designated with '/' or '\', so it would
then be a matter of splitting the sequence based on the midline,
then mapping those protein fragments back to the original sequence
coordinates. Is this along the lines of what you mean?
chris
Let me look into this thank you CF, I have not used that in the past.
Chris Fields
2009-10-27 20:46:05 UTC
Permalink
Post by Peter
Post by Chris Larsen
Peter,
This is a good strategy when the gi is given. However I failed to mention
that we are finding the example I gave is unusual (15%?)---most virus
'mature peptides' we will apply this analysis to do not in fact have a gi
number or unique identifier associated with them. There are
thousands of
dengue virus files to be processed to give mature proteins.
Should have mentioned this...Hence the problem--we cant look it up because
only the parent polyprotein has a gi. Theres nothing to look up /
by/ in most
cases. So we still have to build a set of proteins that are cleaved out of
every polyprotein, by local and high throughput methods, by
building it out
of the available information (sadly, kind of a run around-- it should be in
the genbank entry).
Chris
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for
the mat_peptide feature key, they relate back to the full protein
sequence with from/to, not to the protein_id for the feature. Example:

# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts

# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959

This record doesn't appear to contain any mapping information along
those lines, which makes me think this is an autogenerated record
using the Gene record, which does have those mappings:

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Post by Peter
I also note that in the example given, all the mature peptides
have nice and simple locations (in terms of their co-ordindates
for the nucleotides), no ribosomal slippages etc. This means
grabbing the relevant bits of the genome and translating it is
also pretty easy (option 2 in your original email).
Have you got a more typical entry you can point us at?
If there is nothing publicly available, I wouldn't mind you
emailing me one or two to look at off list (and if don't mind,
they might make good examples for Bio* project unit tests
or examples).
Peter
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output will
have the frameshifts designated with '/' or '\', so it would then be a
matter of splitting the sequence based on the midline, then mapping
those protein fragments back to the original sequence coordinates. Is
this along the lines of what you mean?

chris
Chris Fields
2009-10-27 20:46:05 UTC
Permalink
Post by Peter
Post by Chris Larsen
Peter,
This is a good strategy when the gi is given. However I failed to mention
that we are finding the example I gave is unusual (15%?)---most virus
'mature peptides' we will apply this analysis to do not in fact have a gi
number or unique identifier associated with them. There are
thousands of
dengue virus files to be processed to give mature proteins.
Should have mentioned this...Hence the problem--we cant look it up because
only the parent polyprotein has a gi. Theres nothing to look up /
by/ in most
cases. So we still have to build a set of proteins that are cleaved out of
every polyprotein, by local and high throughput methods, by
building it out
of the available information (sadly, kind of a run around-- it should be in
the genbank entry).
Chris
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for
the mat_peptide feature key, they relate back to the full protein
sequence with from/to, not to the protein_id for the feature. Example:

# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts

# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959

This record doesn't appear to contain any mapping information along
those lines, which makes me think this is an autogenerated record
using the Gene record, which does have those mappings:

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Post by Peter
I also note that in the example given, all the mature peptides
have nice and simple locations (in terms of their co-ordindates
for the nucleotides), no ribosomal slippages etc. This means
grabbing the relevant bits of the genome and translating it is
also pretty easy (option 2 in your original email).
Have you got a more typical entry you can point us at?
If there is nothing publicly available, I wouldn't mind you
emailing me one or two to look at off list (and if don't mind,
they might make good examples for Bio* project unit tests
or examples).
Peter
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output will
have the frameshifts designated with '/' or '\', so it would then be a
matter of splitting the sequence based on the midline, then mapping
those protein fragments back to the original sequence coordinates. Is
this along the lines of what you mean?

chris
Chris Fields
2009-10-27 20:46:05 UTC
Permalink
Post by Peter
Post by Chris Larsen
Peter,
This is a good strategy when the gi is given. However I failed to mention
that we are finding the example I gave is unusual (15%?)---most virus
'mature peptides' we will apply this analysis to do not in fact have a gi
number or unique identifier associated with them. There are
thousands of
dengue virus files to be processed to give mature proteins.
Should have mentioned this...Hence the problem--we cant look it up because
only the parent polyprotein has a gi. Theres nothing to look up /
by/ in most
cases. So we still have to build a set of proteins that are cleaved out of
every polyprotein, by local and high throughput methods, by
building it out
of the available information (sadly, kind of a run around-- it should be in
the genbank entry).
Chris
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.
Interesting thing about that example: if you follow the hyperlinks for
the mat_peptide feature key, they relate back to the full protein
sequence with from/to, not to the protein_id for the feature. Example:

# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts

# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959

This record doesn't appear to contain any mapping information along
those lines, which makes me think this is an autogenerated record
using the Gene record, which does have those mappings:

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
Post by Peter
I also note that in the example given, all the mature peptides
have nice and simple locations (in terms of their co-ordindates
for the nucleotides), no ribosomal slippages etc. This means
grabbing the relevant bits of the genome and translating it is
also pretty easy (option 2 in your original email).
Have you got a more typical entry you can point us at?
If there is nothing publicly available, I wouldn't mind you
emailing me one or two to look at off list (and if don't mind,
they might make good examples for Bio* project unit tests
or examples).
Peter
I think one could use the full-length protein and run TFASTX (which
allows frameshifts) against the nucleotide sequence. The output will
have the frameshifts designated with '/' or '\', so it would then be a
matter of splitting the sequence based on the midline, then mapping
those protein fragments back to the original sequence coordinates. Is
this along the lines of what you mean?

chris
Peter
2009-10-27 20:17:19 UTC
Permalink
Post by Chris Larsen
Peter,
This is a good strategy when the gi is given. However I failed to mention
that we are finding the example I gave is unusual (15%?)---most virus
'mature peptides' we will apply this analysis to do not in fact have a gi
number or unique identifier associated with them. There are thousands of
dengue virus files to be processed to give mature proteins.
Should have mentioned this...Hence the problem--we cant look it up because
only the parent polyprotein has a gi. Theres nothing to look up /by/ in most
cases. So we still have to build a set of proteins that are cleaved out of
every polyprotein, by local and high throughput methods, by building it out
of the available information (sadly, kind of a run around-- it should be in
the genbank entry).
Chris
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.

I also note that in the example given, all the mature peptides
have nice and simple locations (in terms of their co-ordindates
for the nucleotides), no ribosomal slippages etc. This means
grabbing the relevant bits of the genome and translating it is
also pretty easy (option 2 in your original email).

Have you got a more typical entry you can point us at?

If there is nothing publicly available, I wouldn't mind you
emailing me one or two to look at off list (and if don't mind,
they might make good examples for Bio* project unit tests
or examples).

Peter
Peter
2009-10-27 20:17:19 UTC
Permalink
Post by Chris Larsen
Peter,
This is a good strategy when the gi is given. However I failed to mention
that we are finding the example I gave is unusual (15%?)---most virus
'mature peptides' we will apply this analysis to do not in fact have a gi
number or unique identifier associated with them. There are thousands of
dengue virus files to be processed to give mature proteins.
Should have mentioned this...Hence the problem--we cant look it up because
only the parent polyprotein has a gi. Theres nothing to look up /by/ in most
cases. So we still have to build a set of proteins that are cleaved out of
every polyprotein, by local and high throughput methods, by building it out
of the available information (sadly, kind of a run around-- it should be in
the genbank entry).
Chris
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.

I also note that in the example given, all the mature peptides
have nice and simple locations (in terms of their co-ordindates
for the nucleotides), no ribosomal slippages etc. This means
grabbing the relevant bits of the genome and translating it is
also pretty easy (option 2 in your original email).

Have you got a more typical entry you can point us at?

If there is nothing publicly available, I wouldn't mind you
emailing me one or two to look at off list (and if don't mind,
they might make good examples for Bio* project unit tests
or examples).

Peter
Peter
2009-10-27 20:17:19 UTC
Permalink
Post by Chris Larsen
Peter,
This is a good strategy when the gi is given. However I failed to mention
that we are finding the example I gave is unusual (15%?)---most virus
'mature peptides' we will apply this analysis to do not in fact have a gi
number or unique identifier associated with them. There are thousands of
dengue virus files to be processed to give mature proteins.
Should have mentioned this...Hence the problem--we cant look it up because
only the parent polyprotein has a gi. Theres nothing to look up /by/ in most
cases. So we still have to build a set of proteins that are cleaved out of
every polyprotein, by local and high throughput methods, by building it out
of the available information (sadly, kind of a run around-- it should be in
the genbank entry).
Chris
Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.

I also note that in the example given, all the mature peptides
have nice and simple locations (in terms of their co-ordindates
for the nucleotides), no ribosomal slippages etc. This means
grabbing the relevant bits of the genome and translating it is
also pretty easy (option 2 in your original email).

Have you got a more typical entry you can point us at?

If there is nothing publicly available, I wouldn't mind you
emailing me one or two to look at off list (and if don't mind,
they might make good examples for Bio* project unit tests
or examples).

Peter
Chris Larsen
2009-10-27 20:07:55 UTC
Permalink
Peter,

This is a good strategy when the gi is given. However I failed to
mention that we are finding the example I gave is unusual (15%?)---
most virus 'mature peptides' we will apply this analysis to do not in
fact have a gi number or unique identifier associated with them. There
are thousands of dengue virus files to be processed to give mature
proteins.

Should have mentioned this...Hence the problem--we cant look it up
because only the parent polyprotein has a gi. Theres nothing to look
up /by/ in most cases. So we still have to build a set of proteins
that are cleaved out of every polyprotein, by local and high
throughput methods, by building it out of the available information
(sadly, kind of a run around-- it should be in the genbank entry).

Chris
Post by Peter
Hello Peter!
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
...
No mat_peptide sequence is given. We want that...
Looking at the GenBank file displayed, the mat_peptide features
(mature peptides) do not include a translation entry (like the parent
CDS feature does). However, they do have protein IDs - which are
actually links in the HTML version.
This leads me to suggest a third option as an alternative to the two
ideas you outlined. You could parse the GenBank file(s), and for each
mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as
a FASTA file, or a GenPept file). If you only have a relatively small
number of viruses and proteins this is probably going to be pretty
easy. At least, I could do it in Biopython and I am sure the same is
true with the BioPerl GenBank parser and their EFetch interface.
However, for a large dataset, handling it all locally (your options
(1) and (2) sound best).
Peter
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Chris Larsen
2009-10-27 20:07:55 UTC
Permalink
Peter,

This is a good strategy when the gi is given. However I failed to
mention that we are finding the example I gave is unusual (15%?)---
most virus 'mature peptides' we will apply this analysis to do not in
fact have a gi number or unique identifier associated with them. There
are thousands of dengue virus files to be processed to give mature
proteins.

Should have mentioned this...Hence the problem--we cant look it up
because only the parent polyprotein has a gi. Theres nothing to look
up /by/ in most cases. So we still have to build a set of proteins
that are cleaved out of every polyprotein, by local and high
throughput methods, by building it out of the available information
(sadly, kind of a run around-- it should be in the genbank entry).

Chris
Post by Peter
Hello Peter!
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
...
No mat_peptide sequence is given. We want that...
Looking at the GenBank file displayed, the mat_peptide features
(mature peptides) do not include a translation entry (like the parent
CDS feature does). However, they do have protein IDs - which are
actually links in the HTML version.
This leads me to suggest a third option as an alternative to the two
ideas you outlined. You could parse the GenBank file(s), and for each
mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as
a FASTA file, or a GenPept file). If you only have a relatively small
number of viruses and proteins this is probably going to be pretty
easy. At least, I could do it in Biopython and I am sure the same is
true with the BioPerl GenBank parser and their EFetch interface.
However, for a large dataset, handling it all locally (your options
(1) and (2) sound best).
Peter
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Chris Larsen
2009-10-27 20:07:55 UTC
Permalink
Peter,

This is a good strategy when the gi is given. However I failed to
mention that we are finding the example I gave is unusual (15%?)---
most virus 'mature peptides' we will apply this analysis to do not in
fact have a gi number or unique identifier associated with them. There
are thousands of dengue virus files to be processed to give mature
proteins.

Should have mentioned this...Hence the problem--we cant look it up
because only the parent polyprotein has a gi. Theres nothing to look
up /by/ in most cases. So we still have to build a set of proteins
that are cleaved out of every polyprotein, by local and high
throughput methods, by building it out of the available information
(sadly, kind of a run around-- it should be in the genbank entry).

Chris
Post by Peter
Hello Peter!
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
...
No mat_peptide sequence is given. We want that...
Looking at the GenBank file displayed, the mat_peptide features
(mature peptides) do not include a translation entry (like the parent
CDS feature does). However, they do have protein IDs - which are
actually links in the HTML version.
This leads me to suggest a third option as an alternative to the two
ideas you outlined. You could parse the GenBank file(s), and for each
mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as
a FASTA file, or a GenPept file). If you only have a relatively small
number of viruses and proteins this is probably going to be pretty
easy. At least, I could do it in Biopython and I am sure the same is
true with the BioPerl GenBank parser and their EFetch interface.
However, for a large dataset, handling it all locally (your options
(1) and (2) sound best).
Peter
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Peter
2009-10-27 19:54:04 UTC
Permalink
Hello Peter!
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
...
No mat_peptide sequence is given. We want that...
Looking at the GenBank file displayed, the mat_peptide features
(mature peptides) do not include a translation entry (like the parent
CDS feature does). However, they do have protein IDs - which are
actually links in the HTML version.

This leads me to suggest a third option as an alternative to the two
ideas you outlined. You could parse the GenBank file(s), and for each
mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as
a FASTA file, or a GenPept file). If you only have a relatively small
number of viruses and proteins this is probably going to be pretty
easy. At least, I could do it in Biopython and I am sure the same is
true with the BioPerl GenBank parser and their EFetch interface.

However, for a large dataset, handling it all locally (your options
(1) and (2) sound best).

Peter
Peter
2009-10-27 19:54:04 UTC
Permalink
Hello Peter!
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
...
No mat_peptide sequence is given. We want that...
Looking at the GenBank file displayed, the mat_peptide features
(mature peptides) do not include a translation entry (like the parent
CDS feature does). However, they do have protein IDs - which are
actually links in the HTML version.

This leads me to suggest a third option as an alternative to the two
ideas you outlined. You could parse the GenBank file(s), and for each
mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as
a FASTA file, or a GenPept file). If you only have a relatively small
number of viruses and proteins this is probably going to be pretty
easy. At least, I could do it in Biopython and I am sure the same is
true with the BioPerl GenBank parser and their EFetch interface.

However, for a large dataset, handling it all locally (your options
(1) and (2) sound best).

Peter
Peter
2009-10-27 19:54:04 UTC
Permalink
Hello Peter!
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
...
No mat_peptide sequence is given. We want that...
Looking at the GenBank file displayed, the mat_peptide features
(mature peptides) do not include a translation entry (like the parent
CDS feature does). However, they do have protein IDs - which are
actually links in the HTML version.

This leads me to suggest a third option as an alternative to the two
ideas you outlined. You could parse the GenBank file(s), and for each
mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as
a FASTA file, or a GenPept file). If you only have a relatively small
number of viruses and proteins this is probably going to be pretty
easy. At least, I could do it in Biopython and I am sure the same is
true with the BioPerl GenBank parser and their EFetch interface.

However, for a large dataset, handling it all locally (your options
(1) and (2) sound best).

Peter
Chris Larsen
2009-11-12 17:22:26 UTC
Permalink
All,

This is a short followup on the prior thread of discussion, regarding
computing mature peptide sequences for viruses. The topic has gone
underwater for the time being as we solve some problems with source
data. While the biopython effort and contributors on this board have
given good guidance, and we now have scripts that function (thanks
mostly to pcock), however, the source data on which everything relies
is suspect:

mat_peptide 15118..16914 <===
/product="nsp13"
/note="helicase"
I can tell you the virus community does not want to rely heavily, on
those position numbers. Furthermore we have found fewer compete source
genomes for viruses than bacteria, more virus-to-virus variation in
the data fields annotated in the GBK file, (Gene, CDS, ORF, Protein,
Polyprotein, mat_peptide, db_xref) and in fact the community will have
to come together significantly on how these molecules are defined in
public repositories, before a mature scripting effort becomes
reliable, public and well received. Because of the variation in
viruses, it's not even clear at this point what a 'gene' is. I will
let you know how we proceed when more sequence data has been fully
analyzed, and we can think about making any perl based solution a new
viral protein module.

Thanks,

Chris
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Chris Larsen
2009-11-12 17:22:26 UTC
Permalink
All,

This is a short followup on the prior thread of discussion, regarding
computing mature peptide sequences for viruses. The topic has gone
underwater for the time being as we solve some problems with source
data. While the biopython effort and contributors on this board have
given good guidance, and we now have scripts that function (thanks
mostly to pcock), however, the source data on which everything relies
is suspect:

mat_peptide 15118..16914 <===
/product="nsp13"
/note="helicase"
I can tell you the virus community does not want to rely heavily, on
those position numbers. Furthermore we have found fewer compete source
genomes for viruses than bacteria, more virus-to-virus variation in
the data fields annotated in the GBK file, (Gene, CDS, ORF, Protein,
Polyprotein, mat_peptide, db_xref) and in fact the community will have
to come together significantly on how these molecules are defined in
public repositories, before a mature scripting effort becomes
reliable, public and well received. Because of the variation in
viruses, it's not even clear at this point what a 'gene' is. I will
let you know how we proceed when more sequence data has been fully
analyzed, and we can think about making any perl based solution a new
viral protein module.

Thanks,

Chris
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Chris Larsen
2009-11-12 17:22:26 UTC
Permalink
All,

This is a short followup on the prior thread of discussion, regarding
computing mature peptide sequences for viruses. The topic has gone
underwater for the time being as we solve some problems with source
data. While the biopython effort and contributors on this board have
given good guidance, and we now have scripts that function (thanks
mostly to pcock), however, the source data on which everything relies
is suspect:

mat_peptide 15118..16914 <===
/product="nsp13"
/note="helicase"
I can tell you the virus community does not want to rely heavily, on
those position numbers. Furthermore we have found fewer compete source
genomes for viruses than bacteria, more virus-to-virus variation in
the data fields annotated in the GBK file, (Gene, CDS, ORF, Protein,
Polyprotein, mat_peptide, db_xref) and in fact the community will have
to come together significantly on how these molecules are defined in
public repositories, before a mature scripting effort becomes
reliable, public and well received. Because of the variation in
viruses, it's not even clear at this point what a 'gene' is. I will
let you know how we proceed when more sequence data has been fully
analyzed, and we can think about making any perl based solution a new
viral protein module.

Thanks,

Chris
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Chris Larsen
2009-11-12 17:22:26 UTC
Permalink
All,

This is a short followup on the prior thread of discussion, regarding
computing mature peptide sequences for viruses. The topic has gone
underwater for the time being as we solve some problems with source
data. While the biopython effort and contributors on this board have
given good guidance, and we now have scripts that function (thanks
mostly to pcock), however, the source data on which everything relies
is suspect:

mat_peptide 15118..16914 <===
/product="nsp13"
/note="helicase"
I can tell you the virus community does not want to rely heavily, on
those position numbers. Furthermore we have found fewer compete source
genomes for viruses than bacteria, more virus-to-virus variation in
the data fields annotated in the GBK file, (Gene, CDS, ORF, Protein,
Polyprotein, mat_peptide, db_xref) and in fact the community will have
to come together significantly on how these molecules are defined in
public repositories, before a mature scripting effort becomes
reliable, public and well received. Because of the variation in
viruses, it's not even clear at this point what a 'gene' is. I will
let you know how we proceed when more sequence data has been fully
analyzed, and we can think about making any perl based solution a new
viral protein module.

Thanks,

Chris
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Chris Larsen
2009-10-27 19:47:27 UTC
Permalink
My bad...thanks everyone.

An example of a virus that does this is in order so we don't have to
do this in our heads. There's no slippage in this isolate per se,
however the main problem of generating component *.faa from
polyprotein still exists.

Chris
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
You may recall some cruise where half the ship threw up. YAY
norovirus. Lets take that as a useful problem to address solutions to.
No mat_peptide sequence is given. How to get that?
Chris Larsen
2010-01-18 17:42:13 UTC
Permalink
Bhakti, (and Chris, Mark)--

Yes there is some perl available to parse reciprocal best blast hits.

Mark's referenced / archived post was mine, we were looking to do what
you wanted. Here we proceed with the thread.

We ended up implementing OrthoMCL 1.4 as Chris F pointed to, and then
made a simple perl parser that would take the raw OrthoMCL output, do
splits, and spit out a delimited table of all the orthologs in a
group, for say Mycobacterium Genus, so you could stuff it into DBLoader.

The link to the script, SOP, and method is at:
http://www.biohealthbase.org/brcDocs/documents/BHB_ORTHOLOG_SOP.pdf

Giving e.g.:

Francisella 1 110321310
Francisella 1 110321361
Francisella 1 56707275
Francisella 1 56707366
Francisella 1 56707462

Five members of Ortholog Group 1, with just their gi number. And you
can see the results of that parsing, supported by a database, being
used to load BioHealthbase with all the reciprocal best blast hits
plus other OrthoMCL parsing, for mycobacterial PolA at:

http://www.biohealthbase.org/brc/details.do?locus=MAV_3155&decorator=mycobacterium

See? Pretty? We were just interested in making ortholog groups on the
bais of paralog-conscious reciprocal blast stuff. Like you. This
package and doc I've made does what you want I think, as long as you
stay in prokaryotes. But--careful...garbage in, garbage out. We
started with clean Genuses. (. o O Genii?). You'll get more junky HUGE
and TINY ortholog groups if you put in different Orders of microbes.
Its taxa sensitive. OrthoMCL author David Roos is great at it though
and designed it in mind of higher unicellular euks too...comb the docs
for that; sorry I was doing bacterial work at the time and cant guide
you if thats what you want.. If you end up installing OrthMCL 1.4, you
can pipe the output to this method and get out useable stuff.

Hope it works for you.

Cheers,

Chris L
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Chris Larsen
2010-01-18 17:42:13 UTC
Permalink
Bhakti, (and Chris, Mark)--

Yes there is some perl available to parse reciprocal best blast hits.

Mark's referenced / archived post was mine, we were looking to do what
you wanted. Here we proceed with the thread.

We ended up implementing OrthoMCL 1.4 as Chris F pointed to, and then
made a simple perl parser that would take the raw OrthoMCL output, do
splits, and spit out a delimited table of all the orthologs in a
group, for say Mycobacterium Genus, so you could stuff it into DBLoader.

The link to the script, SOP, and method is at:
http://www.biohealthbase.org/brcDocs/documents/BHB_ORTHOLOG_SOP.pdf

Giving e.g.:

Francisella 1 110321310
Francisella 1 110321361
Francisella 1 56707275
Francisella 1 56707366
Francisella 1 56707462

Five members of Ortholog Group 1, with just their gi number. And you
can see the results of that parsing, supported by a database, being
used to load BioHealthbase with all the reciprocal best blast hits
plus other OrthoMCL parsing, for mycobacterial PolA at:

http://www.biohealthbase.org/brc/details.do?locus=MAV_3155&decorator=mycobacterium

See? Pretty? We were just interested in making ortholog groups on the
bais of paralog-conscious reciprocal blast stuff. Like you. This
package and doc I've made does what you want I think, as long as you
stay in prokaryotes. But--careful...garbage in, garbage out. We
started with clean Genuses. (. o O Genii?). You'll get more junky HUGE
and TINY ortholog groups if you put in different Orders of microbes.
Its taxa sensitive. OrthoMCL author David Roos is great at it though
and designed it in mind of higher unicellular euks too...comb the docs
for that; sorry I was doing bacterial work at the time and cant guide
you if thats what you want.. If you end up installing OrthMCL 1.4, you
can pipe the output to this method and get out useable stuff.

Hope it works for you.

Cheers,

Chris L
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Chris Larsen
2010-01-18 17:42:13 UTC
Permalink
Bhakti, (and Chris, Mark)--

Yes there is some perl available to parse reciprocal best blast hits.

Mark's referenced / archived post was mine, we were looking to do what
you wanted. Here we proceed with the thread.

We ended up implementing OrthoMCL 1.4 as Chris F pointed to, and then
made a simple perl parser that would take the raw OrthoMCL output, do
splits, and spit out a delimited table of all the orthologs in a
group, for say Mycobacterium Genus, so you could stuff it into DBLoader.

The link to the script, SOP, and method is at:
http://www.biohealthbase.org/brcDocs/documents/BHB_ORTHOLOG_SOP.pdf

Giving e.g.:

Francisella 1 110321310
Francisella 1 110321361
Francisella 1 56707275
Francisella 1 56707366
Francisella 1 56707462

Five members of Ortholog Group 1, with just their gi number. And you
can see the results of that parsing, supported by a database, being
used to load BioHealthbase with all the reciprocal best blast hits
plus other OrthoMCL parsing, for mycobacterial PolA at:

http://www.biohealthbase.org/brc/details.do?locus=MAV_3155&decorator=mycobacterium

See? Pretty? We were just interested in making ortholog groups on the
bais of paralog-conscious reciprocal blast stuff. Like you. This
package and doc I've made does what you want I think, as long as you
stay in prokaryotes. But--careful...garbage in, garbage out. We
started with clean Genuses. (. o O Genii?). You'll get more junky HUGE
and TINY ortholog groups if you put in different Orders of microbes.
Its taxa sensitive. OrthoMCL author David Roos is great at it though
and designed it in mind of higher unicellular euks too...comb the docs
for that; sorry I was doing bacterial work at the time and cant guide
you if thats what you want.. If you end up installing OrthMCL 1.4, you
can pipe the output to this method and get out useable stuff.

Hope it works for you.

Cheers,

Chris L
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Chris Larsen
2010-01-18 17:42:13 UTC
Permalink
Bhakti, (and Chris, Mark)--

Yes there is some perl available to parse reciprocal best blast hits.

Mark's referenced / archived post was mine, we were looking to do what
you wanted. Here we proceed with the thread.

We ended up implementing OrthoMCL 1.4 as Chris F pointed to, and then
made a simple perl parser that would take the raw OrthoMCL output, do
splits, and spit out a delimited table of all the orthologs in a
group, for say Mycobacterium Genus, so you could stuff it into DBLoader.

The link to the script, SOP, and method is at:
http://www.biohealthbase.org/brcDocs/documents/BHB_ORTHOLOG_SOP.pdf

Giving e.g.:

Francisella 1 110321310
Francisella 1 110321361
Francisella 1 56707275
Francisella 1 56707366
Francisella 1 56707462

Five members of Ortholog Group 1, with just their gi number. And you
can see the results of that parsing, supported by a database, being
used to load BioHealthbase with all the reciprocal best blast hits
plus other OrthoMCL parsing, for mycobacterial PolA at:

http://www.biohealthbase.org/brc/details.do?locus=MAV_3155&decorator=mycobacterium

See? Pretty? We were just interested in making ortholog groups on the
bais of paralog-conscious reciprocal blast stuff. Like you. This
package and doc I've made does what you want I think, as long as you
stay in prokaryotes. But--careful...garbage in, garbage out. We
started with clean Genuses. (. o O Genii?). You'll get more junky HUGE
and TINY ortholog groups if you put in different Orders of microbes.
Its taxa sensitive. OrthoMCL author David Roos is great at it though
and designed it in mind of higher unicellular euks too...comb the docs
for that; sorry I was doing bacterial work at the time and cant guide
you if thats what you want.. If you end up installing OrthMCL 1.4, you
can pipe the output to this method and get out useable stuff.

Hope it works for you.

Cheers,

Chris L
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Chris Larsen
2009-10-27 16:33:01 UTC
Permalink
All,

I am attempting to find some solutions to a DB loading problem we are
encountering in viruses. It is multifold:

Some viruses churn out a polyprotein rather than individual peptides;
further they also slip the ribosome, so a source nucleotide is used
more than once in translation (ribosome halts, backs up one
nucleotide, and continues in a new frame); and finally we have post
translational processing into mature peptides. The main thing is that
the mature peptide is contained a a subset of the whole parent
polyprotein, but is not provided as a single file in GBK for each
mat_peptide CDS. We have to get that in order to run algorithms on the
relevant processed proteins. Therefore we cannot directly load into
GUS, but rather have to choose how to get the mat_peptide sequence.
Actually I think the viruses know that, and are just messing with us
out of spite, since we have iPods and they dont. Anyway.. from anyone
who has encountered this I seek guidance.

We have as choices:

1. Get the locations of mature peptide children in /Protein/
carve the mat_peptide sequence out of the whole polyprotein translation
check that the mat_peptide is infact an identical subset of the
translated protein
load that

OR

2. Use the locations of starts and stops in /Nucleotide/
translate that, using the slippage information
get mature peptides that line up exactly to the parent polyprotein

If you know of BioPerl sequence handling support for this, I would
love to hear more. Clearly this is a nonstandard thingamabob.

Stupid viruses

Chris
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Peter
2009-10-27 17:29:08 UTC
Permalink
Post by Chris Larsen
All,
I am attempting to find some solutions to a DB loading problem we are
Some viruses churn out a polyprotein rather than individual peptides;
further they also slip the ribosome, so a source nucleotide is used more
than once ?in translation (ribosome halts, backs up one nucleotide, and
continues in a new frame); and finally we have post translational processing
into mature peptides. The main thing is that the mature peptide is contained
a a subset of the whole parent polyprotein, but is not provided as a single
file in GBK for each mat_peptide CDS. We have to get that in order to run
algorithms on the relevant processed proteins. Therefore we cannot directly
load into GUS, but rather have to choose how to get the mat_peptide
sequence. Actually I think the viruses know that, and are just messing with
us out of spite, since we have iPods and they dont. Anyway.. from anyone who
has encountered this I seek guidance.
1. Get the locations of mature peptide children in /Protein/
carve the mat_peptide sequence out of the whole polyprotein translation
check that the mat_peptide is infact an identical subset of the translated
protein
load that
OR
2. Use the locations of starts and stops in /Nucleotide/
translate that, using the slippage information
get mature peptides that line up exactly to the parent polyprotein
If you know of BioPerl sequence handling support for this, I would love to
hear more. Clearly this is a nonstandard thingamabob.
Stupid viruses
Chris
Cool viruses :)

Do you have some specific examples from GenBank? I'm starting
to deal with virus annotation in my work, so this is of interest.

Peter

P.S. As you might guess from my email address, I'm actually more
interested in Biopython than BioPerl, but the same algorithmic
approach could be tested in either.
Chris Larsen
2009-10-27 19:47:27 UTC
Permalink
My bad...thanks everyone.

An example of a virus that does this is in order so we don't have to
do this in our heads. There's no slippage in this isolate per se,
however the main problem of generating component *.faa from
polyprotein still exists.

Chris
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
You may recall some cruise where half the ship threw up. YAY
norovirus. Lets take that as a useful problem to address solutions to.
No mat_peptide sequence is given. How to get that?
Chris Larsen
2009-10-27 16:33:01 UTC
Permalink
All,

I am attempting to find some solutions to a DB loading problem we are
encountering in viruses. It is multifold:

Some viruses churn out a polyprotein rather than individual peptides;
further they also slip the ribosome, so a source nucleotide is used
more than once in translation (ribosome halts, backs up one
nucleotide, and continues in a new frame); and finally we have post
translational processing into mature peptides. The main thing is that
the mature peptide is contained a a subset of the whole parent
polyprotein, but is not provided as a single file in GBK for each
mat_peptide CDS. We have to get that in order to run algorithms on the
relevant processed proteins. Therefore we cannot directly load into
GUS, but rather have to choose how to get the mat_peptide sequence.
Actually I think the viruses know that, and are just messing with us
out of spite, since we have iPods and they dont. Anyway.. from anyone
who has encountered this I seek guidance.

We have as choices:

1. Get the locations of mature peptide children in /Protein/
carve the mat_peptide sequence out of the whole polyprotein translation
check that the mat_peptide is infact an identical subset of the
translated protein
load that

OR

2. Use the locations of starts and stops in /Nucleotide/
translate that, using the slippage information
get mature peptides that line up exactly to the parent polyprotein

If you know of BioPerl sequence handling support for this, I would
love to hear more. Clearly this is a nonstandard thingamabob.

Stupid viruses

Chris
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Peter
2009-10-27 17:29:08 UTC
Permalink
Post by Chris Larsen
All,
I am attempting to find some solutions to a DB loading problem we are
Some viruses churn out a polyprotein rather than individual peptides;
further they also slip the ribosome, so a source nucleotide is used more
than once ?in translation (ribosome halts, backs up one nucleotide, and
continues in a new frame); and finally we have post translational processing
into mature peptides. The main thing is that the mature peptide is contained
a a subset of the whole parent polyprotein, but is not provided as a single
file in GBK for each mat_peptide CDS. We have to get that in order to run
algorithms on the relevant processed proteins. Therefore we cannot directly
load into GUS, but rather have to choose how to get the mat_peptide
sequence. Actually I think the viruses know that, and are just messing with
us out of spite, since we have iPods and they dont. Anyway.. from anyone who
has encountered this I seek guidance.
1. Get the locations of mature peptide children in /Protein/
carve the mat_peptide sequence out of the whole polyprotein translation
check that the mat_peptide is infact an identical subset of the translated
protein
load that
OR
2. Use the locations of starts and stops in /Nucleotide/
translate that, using the slippage information
get mature peptides that line up exactly to the parent polyprotein
If you know of BioPerl sequence handling support for this, I would love to
hear more. Clearly this is a nonstandard thingamabob.
Stupid viruses
Chris
Cool viruses :)

Do you have some specific examples from GenBank? I'm starting
to deal with virus annotation in my work, so this is of interest.

Peter

P.S. As you might guess from my email address, I'm actually more
interested in Biopython than BioPerl, but the same algorithmic
approach could be tested in either.
Chris Larsen
2009-10-27 19:47:27 UTC
Permalink
My bad...thanks everyone.

An example of a virus that does this is in order so we don't have to
do this in our heads. There's no slippage in this isolate per se,
however the main problem of generating component *.faa from
polyprotein still exists.

Chris
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
You may recall some cruise where half the ship threw up. YAY
norovirus. Lets take that as a useful problem to address solutions to.
No mat_peptide sequence is given. How to get that?
Chris Larsen
2009-10-27 16:33:01 UTC
Permalink
All,

I am attempting to find some solutions to a DB loading problem we are
encountering in viruses. It is multifold:

Some viruses churn out a polyprotein rather than individual peptides;
further they also slip the ribosome, so a source nucleotide is used
more than once in translation (ribosome halts, backs up one
nucleotide, and continues in a new frame); and finally we have post
translational processing into mature peptides. The main thing is that
the mature peptide is contained a a subset of the whole parent
polyprotein, but is not provided as a single file in GBK for each
mat_peptide CDS. We have to get that in order to run algorithms on the
relevant processed proteins. Therefore we cannot directly load into
GUS, but rather have to choose how to get the mat_peptide sequence.
Actually I think the viruses know that, and are just messing with us
out of spite, since we have iPods and they dont. Anyway.. from anyone
who has encountered this I seek guidance.

We have as choices:

1. Get the locations of mature peptide children in /Protein/
carve the mat_peptide sequence out of the whole polyprotein translation
check that the mat_peptide is infact an identical subset of the
translated protein
load that

OR

2. Use the locations of starts and stops in /Nucleotide/
translate that, using the slippage information
get mature peptides that line up exactly to the parent polyprotein

If you know of BioPerl sequence handling support for this, I would
love to hear more. Clearly this is a nonstandard thingamabob.

Stupid viruses

Chris
--
Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525
Peter
2009-10-27 17:29:08 UTC
Permalink
Post by Chris Larsen
All,
I am attempting to find some solutions to a DB loading problem we are
Some viruses churn out a polyprotein rather than individual peptides;
further they also slip the ribosome, so a source nucleotide is used more
than once ?in translation (ribosome halts, backs up one nucleotide, and
continues in a new frame); and finally we have post translational processing
into mature peptides. The main thing is that the mature peptide is contained
a a subset of the whole parent polyprotein, but is not provided as a single
file in GBK for each mat_peptide CDS. We have to get that in order to run
algorithms on the relevant processed proteins. Therefore we cannot directly
load into GUS, but rather have to choose how to get the mat_peptide
sequence. Actually I think the viruses know that, and are just messing with
us out of spite, since we have iPods and they dont. Anyway.. from anyone who
has encountered this I seek guidance.
1. Get the locations of mature peptide children in /Protein/
carve the mat_peptide sequence out of the whole polyprotein translation
check that the mat_peptide is infact an identical subset of the translated
protein
load that
OR
2. Use the locations of starts and stops in /Nucleotide/
translate that, using the slippage information
get mature peptides that line up exactly to the parent polyprotein
If you know of BioPerl sequence handling support for this, I would love to
hear more. Clearly this is a nonstandard thingamabob.
Stupid viruses
Chris
Cool viruses :)

Do you have some specific examples from GenBank? I'm starting
to deal with virus annotation in my work, so this is of interest.

Peter

P.S. As you might guess from my email address, I'm actually more
interested in Biopython than BioPerl, but the same algorithmic
approach could be tested in either.
Chris Larsen
2009-10-27 19:47:27 UTC
Permalink
My bad...thanks everyone.

An example of a virus that does this is in order so we don't have to
do this in our heads. There's no slippage in this isolate per se,
however the main problem of generating component *.faa from
polyprotein still exists.

Chris
http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
You may recall some cruise where half the ship threw up. YAY
norovirus. Lets take that as a useful problem to address solutions to.
No mat_peptide sequence is given. How to get that?
Gowthaman Ramasamy
2009-12-14 19:16:32 UTC
Permalink
Hi All,
I have a list of GO terms. And would like to pull GO accessions for them.
I can easily do the revere of it using get_term("GO::00000051").

But can someone tell me how to get the GO accessions from GO Terms , for eg: retrive GO accession for "citrulline metabolic process".


Thanks very much,
Gowtham
Bernd Web
2009-12-15 08:37:44 UTC
Permalink
Dear Gowthaman,

A non-BioPerl solution: the Ontology Lookup service at EBI. It also
provides a web service interface.

http://www.ebi.ac.uk/ontology-lookup/

citrulline metabolic process has to be selected from the pull-down
list in the interactive page. This will return the ID (GO:0000052) and
addional info:

definition The chemical reactions and pathways involving citrulline,
N5-carbamoyl-L-ornithine, an alpha amino acid not found in proteins.
preferred name citrulline metabolic process
exact synonym citrulline metabolism
subset Prokaryotic GO subset
xref_definition ISBN:209853"Oxford Dictionary of Biochemistry and
Molecular Biology"

The webservice is described at
http://www.ebi.ac.uk/ontology-lookup/WSDLDocumentation.do


Regards,
Bernd


On Mon, Dec 14, 2009 at 8:16 PM, Gowthaman Ramasamy
Post by Gowthaman Ramasamy
Hi All,
I have a list of GO terms. And would like to pull GO accessions for them.
I can easily do the revere of it using get_term("GO::00000051").
But can someone tell me how to get the GO accessions from GO Terms , for eg: retrive GO accession for "citrulline metabolic process".
Thanks very much,
Gowtham
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Bernd Web
2009-12-15 08:37:44 UTC
Permalink
Dear Gowthaman,

A non-BioPerl solution: the Ontology Lookup service at EBI. It also
provides a web service interface.

http://www.ebi.ac.uk/ontology-lookup/

citrulline metabolic process has to be selected from the pull-down
list in the interactive page. This will return the ID (GO:0000052) and
addional info:

definition The chemical reactions and pathways involving citrulline,
N5-carbamoyl-L-ornithine, an alpha amino acid not found in proteins.
preferred name citrulline metabolic process
exact synonym citrulline metabolism
subset Prokaryotic GO subset
xref_definition ISBN:209853"Oxford Dictionary of Biochemistry and
Molecular Biology"

The webservice is described at
http://www.ebi.ac.uk/ontology-lookup/WSDLDocumentation.do


Regards,
Bernd


On Mon, Dec 14, 2009 at 8:16 PM, Gowthaman Ramasamy
Post by Gowthaman Ramasamy
Hi All,
I have a list of GO terms. And would like to pull GO accessions for them.
I can easily do the revere of it using get_term("GO::00000051").
But can someone tell me how to get the GO accessions from GO Terms , for eg: retrive GO accession for "citrulline metabolic process".
Thanks very much,
Gowtham
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Bernd Web
2009-12-15 08:37:44 UTC
Permalink
Dear Gowthaman,

A non-BioPerl solution: the Ontology Lookup service at EBI. It also
provides a web service interface.

http://www.ebi.ac.uk/ontology-lookup/

citrulline metabolic process has to be selected from the pull-down
list in the interactive page. This will return the ID (GO:0000052) and
addional info:

definition The chemical reactions and pathways involving citrulline,
N5-carbamoyl-L-ornithine, an alpha amino acid not found in proteins.
preferred name citrulline metabolic process
exact synonym citrulline metabolism
subset Prokaryotic GO subset
xref_definition ISBN:209853"Oxford Dictionary of Biochemistry and
Molecular Biology"

The webservice is described at
http://www.ebi.ac.uk/ontology-lookup/WSDLDocumentation.do


Regards,
Bernd


On Mon, Dec 14, 2009 at 8:16 PM, Gowthaman Ramasamy
Post by Gowthaman Ramasamy
Hi All,
I have a list of GO terms. And would like to pull GO accessions for them.
I can easily do the revere of it using get_term("GO::00000051").
But can someone tell me how to get the GO accessions from GO Terms , for eg: retrive GO accession for "citrulline metabolic process".
Thanks very much,
Gowtham
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Bernd Web
2009-12-15 08:37:44 UTC
Permalink
Dear Gowthaman,

A non-BioPerl solution: the Ontology Lookup service at EBI. It also
provides a web service interface.

http://www.ebi.ac.uk/ontology-lookup/

citrulline metabolic process has to be selected from the pull-down
list in the interactive page. This will return the ID (GO:0000052) and
addional info:

definition The chemical reactions and pathways involving citrulline,
N5-carbamoyl-L-ornithine, an alpha amino acid not found in proteins.
preferred name citrulline metabolic process
exact synonym citrulline metabolism
subset Prokaryotic GO subset
xref_definition ISBN:209853"Oxford Dictionary of Biochemistry and
Molecular Biology"

The webservice is described at
http://www.ebi.ac.uk/ontology-lookup/WSDLDocumentation.do


Regards,
Bernd


On Mon, Dec 14, 2009 at 8:16 PM, Gowthaman Ramasamy
Post by Gowthaman Ramasamy
Hi All,
I have a list of GO terms. And would like to pull GO accessions for them.
I can easily do the revere of it using get_term("GO::00000051").
But can someone tell me how to get the GO accessions from GO Terms , for eg: retrive GO accession for "citrulline metabolic process".
Thanks very much,
Gowtham
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Gowthaman Ramasamy
2009-12-14 19:16:32 UTC
Permalink
Hi All,
I have a list of GO terms. And would like to pull GO accessions for them.
I can easily do the revere of it using get_term("GO::00000051").

But can someone tell me how to get the GO accessions from GO Terms , for eg: retrive GO accession for "citrulline metabolic process".


Thanks very much,
Gowtham
Gowthaman Ramasamy
2009-12-14 19:16:32 UTC
Permalink
Hi All,
I have a list of GO terms. And would like to pull GO accessions for them.
I can easily do the revere of it using get_term("GO::00000051").

But can someone tell me how to get the GO accessions from GO Terms , for eg: retrive GO accession for "citrulline metabolic process".


Thanks very much,
Gowtham
Gowthaman Ramasamy
2009-12-14 19:16:32 UTC
Permalink
Hi All,
I have a list of GO terms. And would like to pull GO accessions for them.
I can easily do the revere of it using get_term("GO::00000051").

But can someone tell me how to get the GO accessions from GO Terms , for eg: retrive GO accession for "citrulline metabolic process".


Thanks very much,
Gowtham
Loading...