Discussion:
Output a subset of FASTA data from a single large file
(too old to reply)
Michael Oldham
2006-06-09 02:07:34 UTC
Permalink
Dear all,

I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC

What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I have these
8,175 IDs listed in a separate file. I *think* that I managed to create an
index of all 200,000 probes in the original fasta file using the following
script:

#!/usr/bin/perl -w

# script 1: create the index

use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
$inx->make_index(@ARGV);

I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!

Many thanks,
Mike O.




--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
Sendu Bala
2006-06-09 14:52:59 UTC
Permalink
Post by Michael Oldham
Dear all,
I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes
[snip]
Post by Michael Oldham
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
[snip]
Post by Michael Oldham
I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
I'd say you're on the right lines. Next, you should continue reading the
rest of the synopsis and description in the docs for Bio::Index::Fasta.

Perhaps it's not clear, but you don't need to say
$inx->make_index(@ARGV); if you've already provided -file to new() and
are only dealing with one file. You also can't supply -file to new() if
you want to change the id_parser (which you do, since you need to tell
it how to detect your probe set ID).

Having indexed your file you can then output the desired sequences, just
like the foreach loop suggested in the synopsis. (You could have that in
the same script.)


One thing I'm not clear on is why it needs -write_flag => 1. Why can't
it index a read-only database? Even when you set -write_flag allowing it
to work, it doesn't write anything...
simon andrews (BI)
2006-06-09 15:01:05 UTC
Permalink
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: 09 June 2006 03:08
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple task. I have a single large fasta file
containing about 200,000 probe sequences (from an Affymetrix
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
Unfortunately that's not Fasta format (which only has a single header
line starting with a '>'. I'd imagine that most programs which deal
with fasta which read that entry would see it as two sequences, the
first of which is empty.
What I would like to do is extract from this file a subset of
~130,800 probes (both the header and the sequence) and output
this subset into a new fasta file. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is the probe set
ID in the header listed above)
If you're only having to do this once then it should be fairly quick to
knock up a one off script to do this. Since you've only got 8000ish
probeset ids then you can probably just read those into a hash to start
with then parse through your big sequence file with something like;


#!perl
use warnings;
use strict;

my %probe_ids;

# Add real code here to populate your hash
$probe_ids{1138_at} = 1;
##########################################


open (IN,'your_affy_file.txt') or die "Can't read affy file: $!";

open (OUT,'>','probe_list.txt') or die "Can't write output: $!";

while (<IN>) {

if (/^>probe/) {
# This assumes there are always 3 lines per probe entry
if (exists $probe_ids{(split(/:/))[2]}) {
print OUT;
print OUT scalar <IN>;
print OUT scalar <IN>;
}
}
}
Cook, Malcolm
2006-06-09 14:58:22 UTC
Permalink
I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.

Assuming your ids are one per line in a file named id.dat looking like
this

1138_at
1134_at
etc..

this should work:

perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
<idfile>; chomp @ID; %ID = map {($_, 1)} @ID;} $inmatch =
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa

good luck

--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
M Senthil Kumar
2006-06-09 22:21:11 UTC
Permalink
On Fri, 9 Jun 2006, simon andrews (BI) wrote:
|
|
|> -----Original Message-----
|> From: bioperl-l-bounces at lists.open-bio.org
|> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
|> Michael Oldham
|> Sent: 09 June 2006 03:08
|> To: bioperl-l at lists.open-bio.org
|> Subject: [Bioperl-l] Output a subset of FASTA data from a
|> single large file
|>
|> Dear all,
|>
|> I am a total Bioperl newbie struggling to accomplish a
|> conceptually simple task. I have a single large fasta file
|> containing about 200,000 probe sequences (from an Affymetrix
|> microarray), each of which looks like this:
|>
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|

[snipped]

hi,

I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.

Senthil
Chris Fields
2006-06-09 17:59:18 UTC
Permalink
No; I saw the same thing here. It's not FASTA in the traditional sense:

http://www.bioperl.org/wiki/FASTA_sequence_format

though he did get it to build a database successfully. Well, 'success' in
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.

It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the database,
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.

Chris
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
bounces at lists.open-bio.org] On Behalf Of M Senthil Kumar
Sent: Friday, June 09, 2006 5:21 PM
To: simon andrews (BI)
Cc: bioperl-l at lists.open-bio.org; Michael Oldham
Subject: Re: [Bioperl-l] Output a subset of FASTA data from a single large
file
|
|
|> -----Original Message-----
|> From: bioperl-l-bounces at lists.open-bio.org
|> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
|> Michael Oldham
|> Sent: 09 June 2006 03:08
|> To: bioperl-l at lists.open-bio.org
|> Subject: [Bioperl-l] Output a subset of FASTA data from a
|> single large file
|>
|> Dear all,
|>
|> I am a total Bioperl newbie struggling to accomplish a
|> conceptually simple task. I have a single large fasta file
|> containing about 200,000 probe sequences (from an Affymetrix
|>
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
Senthil
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Sean Davis
2006-06-09 18:29:53 UTC
Permalink
Post by Chris Fields
http://www.bioperl.org/wiki/FASTA_sequence_format
though he did get it to build a database successfully. Well, 'success' in
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.
It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the database,
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.
I think that Senthil was pointing out that even though >Antisense looks to
be on its own line, it isn't, but is simply a continutation of the FASTA
header. Judging from the context, that is the only interpretation that
makes sense.

Sean
Post by Chris Fields
Post by M Senthil Kumar
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
Chris Fields
2006-06-09 19:05:44 UTC
Permalink
There's information in the HOWTOs:

http://www.bioperl.org/wiki/HOWTO:Flat_databases

http://www.bioperl.org/wiki/HOWTO:OBDA

Just so you know, I tried, out of curiosity, passing this through Bio::SeqIO
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
i.e. an empty sequence, which is what I guessed might happen, though I
thought it might pick up the second '>' and the full sequence there. Since
the sequence is tossed you'll have to prescreen your sequence input stream
by either concatenating the two '>' lines together or screening for the
relevant information you want to retain. You can try maybe getting this
info into Bio::Seq objects and writing to a Bio::SeqIO stream (to file or
file handle).

Once you have that set up, the HOWTO tells you how to set up custom or
secondary namespaces, so you can use a regex to parse out the information
for a primary or secondary keys:

http://www.bioperl.org/wiki/HOWTO:Flat_databases#Secondary_or_custom_namespa
ces

then you could select specific sequences this way (per the HOWTO):

$db->secondary_namespaces("GI");
my $acc_seq = $db->get_Seq_by_id("P84139");
my $gi_seq = $db->get_Seq_by_secondary("GI",443893);

or for multiple sequences (judging from the POD):

my $acc_seqio = $db->get_Stream_by_id(@ids);

Chris
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
bounces at lists.open-bio.org] On Behalf Of Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a single large
file
Dear all,
I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I have these
8,175 IDs listed in a separate file. I *think* that I managed to create
an
index of all 200,000 probes in the original fasta file using the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Sendu Bala
2006-06-12 08:21:31 UTC
Permalink
Post by Chris Fields
http://www.bioperl.org/wiki/HOWTO:Flat_databases
http://www.bioperl.org/wiki/HOWTO:OBDA
Just so you know, I tried, out of curiosity, passing this through Bio::SeqIO
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
i.e. an empty sequence, which is what I guessed might happen
[snip]

As you later discovered, that was an Outlook problem. Just to make this
thread relevant to bioperl, the bioperl solution is:

use Bio::SeqIO;
use Bio::Index::Fasta;
my $inx = Bio::Index::Fasta->new(-write_flag => 1);
$inx->id_parser(\&get_id);
$inx->make_index(shift);

my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
my $wanted_ids_file = shift;
open(IDS, $wanted_ids_file);
while (<IDS>) {
chomp;
my $seq = $inx->fetch($_);
$out->write_seq($seq);
}

sub get_id {
my $line = shift;
$line =~ /^>probe:\S+?:(\S+?):/;
$1;
}

It works for me on the sample sequence given by the OP.
Sendu Bala
2006-06-12 08:21:31 UTC
Permalink
Post by Chris Fields
http://www.bioperl.org/wiki/HOWTO:Flat_databases
http://www.bioperl.org/wiki/HOWTO:OBDA
Just so you know, I tried, out of curiosity, passing this through Bio::SeqIO
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
i.e. an empty sequence, which is what I guessed might happen
[snip]

As you later discovered, that was an Outlook problem. Just to make this
thread relevant to bioperl, the bioperl solution is:

use Bio::SeqIO;
use Bio::Index::Fasta;
my $inx = Bio::Index::Fasta->new(-write_flag => 1);
$inx->id_parser(\&get_id);
$inx->make_index(shift);

my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
my $wanted_ids_file = shift;
open(IDS, $wanted_ids_file);
while (<IDS>) {
chomp;
my $seq = $inx->fetch($_);
$out->write_seq($seq);
}

sub get_id {
my $line = shift;
$line =~ /^>probe:\S+?:(\S+?):/;
$1;
}

It works for me on the sample sequence given by the OP.
Sendu Bala
2006-06-12 08:21:31 UTC
Permalink
Post by Chris Fields
http://www.bioperl.org/wiki/HOWTO:Flat_databases
http://www.bioperl.org/wiki/HOWTO:OBDA
Just so you know, I tried, out of curiosity, passing this through Bio::SeqIO
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
i.e. an empty sequence, which is what I guessed might happen
[snip]

As you later discovered, that was an Outlook problem. Just to make this
thread relevant to bioperl, the bioperl solution is:

use Bio::SeqIO;
use Bio::Index::Fasta;
my $inx = Bio::Index::Fasta->new(-write_flag => 1);
$inx->id_parser(\&get_id);
$inx->make_index(shift);

my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
my $wanted_ids_file = shift;
open(IDS, $wanted_ids_file);
while (<IDS>) {
chomp;
my $seq = $inx->fetch($_);
$out->write_seq($seq);
}

sub get_id {
my $line = shift;
$line =~ /^>probe:\S+?:(\S+?):/;
$1;
}

It works for me on the sample sequence given by the OP.
Sendu Bala
2006-06-12 08:21:31 UTC
Permalink
Post by Chris Fields
http://www.bioperl.org/wiki/HOWTO:Flat_databases
http://www.bioperl.org/wiki/HOWTO:OBDA
Just so you know, I tried, out of curiosity, passing this through Bio::SeqIO
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
i.e. an empty sequence, which is what I guessed might happen
[snip]

As you later discovered, that was an Outlook problem. Just to make this
thread relevant to bioperl, the bioperl solution is:

use Bio::SeqIO;
use Bio::Index::Fasta;
my $inx = Bio::Index::Fasta->new(-write_flag => 1);
$inx->id_parser(\&get_id);
$inx->make_index(shift);

my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
my $wanted_ids_file = shift;
open(IDS, $wanted_ids_file);
while (<IDS>) {
chomp;
my $seq = $inx->fetch($_);
$out->write_seq($seq);
}

sub get_id {
my $line = shift;
$line =~ /^>probe:\S+?:(\S+?):/;
$1;
}

It works for me on the sample sequence given by the OP.
Chris Fields
2006-06-09 19:49:51 UTC
Permalink
Post by Chris Fields
Post by Chris Fields
http://www.bioperl.org/wiki/FASTA_sequence_format
though he did get it to build a database successfully. Well, 'success'
in
Post by Chris Fields
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.
It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the
database,
Post by Chris Fields
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.
I think that Senthil was pointing out that even though >Antisense looks to
be on its own line, it isn't, but is simply a continutation of the FASTA
header. Judging from the context, that is the only interpretation that
makes sense.
Sean
Sorry. Just checked through another mail client and you're right. That's
what I get for trusting Mr. Gates (stupid Outlook). I have seen a few funky
FASTA derivations, so I thought that's what was going on here. My bad!

My point, though erroneous, was that the fasta format parser may not parse
this data correctly if he did have two description lines, but may not
indicate there are problems by throwing an exception. I demonstrated that
using Bio::SeqIO as an example (you get empty sequences). Bio::Index::Fasta
parses the file itself using this loop to index:

# Main indexing loop
while (<FASTA>) {
if (/^>/) {
# $begin is the position of the first character
after the '>'
my $begin = tell(FASTA) - length( $_ ) + 1;

foreach my $id (&$id_parser($_)) {
$self->add_record($id, $i, $begin);
}
}
}

Which simply looks for '>'. That's fine for a vast majority of sequences.
I thought it would be nice to have something that's a little more strenuous
in verifying the format rather than trusting it implicitly, maybe by using
an eval{} block to make sure the format is FASTA-like and looks like
DNA/RNA/protein.

Chris
Post by Chris Fields
Post by Chris Fields
Post by M Senthil Kumar
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Michael Oldham
2006-06-10 01:39:45 UTC
Permalink
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no errors--but
the output file is empty. It seems that the problem must lie somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more experienced
eye--but not to mine! Any suggestions? Thanks again for your help.

--Mike O.


#!/usr/bin/perl -w

use strict;

my $IDs = 'ID.dat.txt';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes = 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
chomp @ID;
my %ID = map {($_, 1)} @ID; #Note: This creates a hash with keys=PSIDs and
all values=1.

while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;


-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a single large
file



I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.

Assuming your ids are one per line in a file named id.dat looking like
this

1138_at
1134_at
etc..

this should work:

perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
<idfile>; chomp @ID; %ID = map {($_, 1)} @ID;} $inmatch =
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa

good luck

--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/360 - Release Date: 6/9/2006
Chris Fields
2006-06-11 04:32:04 UTC
Permalink
What happens if you just print $idmatch or $1 (i.e. check to see if
the regex matches anything)? If there is nothing printed then either
the regex isn't working as expected or there is something logically
wrong. The problem may be that the captured string must match the id
exactly, the id being the key to the %ID hash; any extra characters
picked up by the regex outside of your id key and you will not get
anything. Looking at Malcolm's regex it should work just fine, but
we only had one example sequence to try here.

If your while loop is set up like this won't it only print only the
matched description lines to the outfile (no sequence) even if there
is a match? Or is this what you wanted? If you want the sequence
you should add 'print OUT <PROBES>;' after the 'print OUT;' line.

Chris
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no
errors--but
the output file is empty. It seems that the problem must lie
somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more
experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
keys=PSIDs and
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
single large
file
I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.
Assuming your ids are one per line in a file named id.dat looking like
this
1138_at
1134_at
etc..
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa
good luck
--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of
~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
6/8/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/9/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
Sendu Bala
2006-06-12 08:49:49 UTC
Permalink
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no errors--but
the output file is empty. It seems that the problem must lie somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
Not sure why it would print nothing (are the ids in IDFILE the same case
as the ids in the fasta file, do they only contain word characters?),
but even if it did you would only be printing out the fasta headers and
not the sequences. Doing it the bioperl way gives you more flexibility
in the future; you may want to do something with the sequences after
printing them out, in which case do it in bioperl using Seq objects and
skip the intermediate step of printing them.
Chris Fields
2006-06-11 04:32:04 UTC
Permalink
What happens if you just print $idmatch or $1 (i.e. check to see if
the regex matches anything)? If there is nothing printed then either
the regex isn't working as expected or there is something logically
wrong. The problem may be that the captured string must match the id
exactly, the id being the key to the %ID hash; any extra characters
picked up by the regex outside of your id key and you will not get
anything. Looking at Malcolm's regex it should work just fine, but
we only had one example sequence to try here.

If your while loop is set up like this won't it only print only the
matched description lines to the outfile (no sequence) even if there
is a match? Or is this what you wanted? If you want the sequence
you should add 'print OUT <PROBES>;' after the 'print OUT;' line.

Chris
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no
errors--but
the output file is empty. It seems that the problem must lie
somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more
experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
keys=PSIDs and
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
single large
file
I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.
Assuming your ids are one per line in a file named id.dat looking like
this
1138_at
1134_at
etc..
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa
good luck
--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of
~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
6/8/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/9/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
Sendu Bala
2006-06-12 08:49:49 UTC
Permalink
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no errors--but
the output file is empty. It seems that the problem must lie somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
Not sure why it would print nothing (are the ids in IDFILE the same case
as the ids in the fasta file, do they only contain word characters?),
but even if it did you would only be printing out the fasta headers and
not the sequences. Doing it the bioperl way gives you more flexibility
in the future; you may want to do something with the sequences after
printing them out, in which case do it in bioperl using Seq objects and
skip the intermediate step of printing them.
Chris Fields
2006-06-11 04:32:04 UTC
Permalink
What happens if you just print $idmatch or $1 (i.e. check to see if
the regex matches anything)? If there is nothing printed then either
the regex isn't working as expected or there is something logically
wrong. The problem may be that the captured string must match the id
exactly, the id being the key to the %ID hash; any extra characters
picked up by the regex outside of your id key and you will not get
anything. Looking at Malcolm's regex it should work just fine, but
we only had one example sequence to try here.

If your while loop is set up like this won't it only print only the
matched description lines to the outfile (no sequence) even if there
is a match? Or is this what you wanted? If you want the sequence
you should add 'print OUT <PROBES>;' after the 'print OUT;' line.

Chris
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no
errors--but
the output file is empty. It seems that the problem must lie
somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more
experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
keys=PSIDs and
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
single large
file
I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.
Assuming your ids are one per line in a file named id.dat looking like
this
1138_at
1134_at
etc..
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa
good luck
--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of
~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
6/8/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/9/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
Sendu Bala
2006-06-12 08:49:49 UTC
Permalink
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no errors--but
the output file is empty. It seems that the problem must lie somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
Not sure why it would print nothing (are the ids in IDFILE the same case
as the ids in the fasta file, do they only contain word characters?),
but even if it did you would only be printing out the fasta headers and
not the sequences. Doing it the bioperl way gives you more flexibility
in the future; you may want to do something with the sequences after
printing them out, in which case do it in bioperl using Seq objects and
skip the intermediate step of printing them.
Chris Fields
2006-06-11 04:32:04 UTC
Permalink
What happens if you just print $idmatch or $1 (i.e. check to see if
the regex matches anything)? If there is nothing printed then either
the regex isn't working as expected or there is something logically
wrong. The problem may be that the captured string must match the id
exactly, the id being the key to the %ID hash; any extra characters
picked up by the regex outside of your id key and you will not get
anything. Looking at Malcolm's regex it should work just fine, but
we only had one example sequence to try here.

If your while loop is set up like this won't it only print only the
matched description lines to the outfile (no sequence) even if there
is a match? Or is this what you wanted? If you want the sequence
you should add 'print OUT <PROBES>;' after the 'print OUT;' line.

Chris
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no
errors--but
the output file is empty. It seems that the problem must lie
somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more
experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
keys=PSIDs and
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
single large
file
I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.
Assuming your ids are one per line in a file named id.dat looking like
this
1138_at
1134_at
etc..
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa
good luck
--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of
~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
6/8/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/9/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
Sendu Bala
2006-06-12 08:49:49 UTC
Permalink
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no errors--but
the output file is empty. It seems that the problem must lie somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
Not sure why it would print nothing (are the ids in IDFILE the same case
as the ids in the fasta file, do they only contain word characters?),
but even if it did you would only be printing out the fasta headers and
not the sequences. Doing it the bioperl way gives you more flexibility
in the future; you may want to do something with the sequences after
printing them out, in which case do it in bioperl using Seq objects and
skip the intermediate step of printing them.
Cook, Malcolm
2006-06-12 15:28:41 UTC
Permalink
Michael,

I don't think you can call perl's `print` on just a filehandle as you
are doing. This is probably your problem.

If you call `select OUT` after opeining it, print will print $_ to it.
And, every line in the fasta record whose header matches on of the IDS
will get printed, not just the fasta header lines. Read the code again
nothing that $idmatch is only getting reset when a correctly formatted
fasta header line is matched.

--Malcolm
-----Original Message-----
From: Chris Fields [mailto:cjfields at uiuc.edu]
Sent: Saturday, June 10, 2006 11:32 PM
To: Michael Oldham
Cc: Cook, Malcolm; bioperl-l at lists.open-bio.org
Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
single large file
What happens if you just print $idmatch or $1 (i.e. check to see if
the regex matches anything)? If there is nothing printed then either
the regex isn't working as expected or there is something logically
wrong. The problem may be that the captured string must match the id
exactly, the id being the key to the %ID hash; any extra characters
picked up by the regex outside of your id key and you will not get
anything. Looking at Malcolm's regex it should work just fine, but
we only had one example sequence to try here.
If your while loop is set up like this won't it only print only the
matched description lines to the outfile (no sequence) even if there
is a match? Or is this what you wanted? If you want the sequence
you should add 'print OUT <PROBES>;' after the 'print OUT;' line.
Chris
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting
closer,
but no cigar quite yet. The script below runs quickly with no
errors--but
the output file is empty. It seems that the problem must lie
somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more
experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
keys=PSIDs and
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
single large
file
I wouldn't bioperl for this, or create an index. Perl would do
fine and
probably be faster.
Assuming your ids are one per line in a file named id.dat
looking like
Post by Michael Oldham
this
1138_at
1134_at
etc..
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa
good luck
--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000
probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of
~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
6/8/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/9/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
Chris Fields
2006-06-12 20:35:10 UTC
Permalink
...
Post by Sendu Bala
Post by Chris Fields
http://www.bioperl.org/wiki/HOWTO:Flat_databases
http://www.bioperl.org/wiki/HOWTO:OBDA
...
Post by Sendu Bala
As you later discovered, that was an Outlook problem. Just to make this
Agreed (stupid Outlook). It might be much faster to use non-Bioperl-ish
ways, but it is easier to further manipulate sequences (convert format,
analyze sequences, etc) using Bioperl directly. I haven't used flat
databases much but it should move very quickly, even in an OO environment.

The one problem with the proposed non-bioperl method is, if you wanted
100,000 sequences (based on ID's) in a FASTA database file containing
200,000 sequences, all ID's would need to be stored (1) in an array (which
gulped the data from the ID file) and then map the ID's to (2) a hash;
that's may be a pretty big memory footprint depending on your system.

Sendu's BioPerl version indexes the FASTA file based on the ID, then (1)
reads the ID's in one at a time from the file, (2) retrieves the data, then
(3) prints it out. The advantage of this approach is that the built index
can be used in other bioperl scripts as well w/o having to rebuild it again,
so if you wanted a different set of ID's later on you can access the
database using the prebuilt index. More can be found in the
Bio::Index::Fasta POD.

You can also use the ideas and code in the HOWTO (Flat Databases) I
mentioned, which focuses on the Bio::DB::Flat system and ODBA. The
advantage of these is that you can use Sleepycat's Berkeley Database through
the Perl BerkeleyDB module (more functionality than DB_File) which is faster
than a standard flat database. In the HOWTO, specifically look under
'Secondary or custom namespaces' for ideas on how to use your ID as a
primary or secondary key.

Chris
Post by Sendu Bala
use Bio::SeqIO;
use Bio::Index::Fasta;
my $inx = Bio::Index::Fasta->new(-write_flag => 1);
$inx->id_parser(\&get_id);
$inx->make_index(shift);
my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
my $wanted_ids_file = shift;
open(IDS, $wanted_ids_file);
while (<IDS>) {
chomp;
my $seq = $inx->fetch($_);
$out->write_seq($seq);
}
sub get_id {
my $line = shift;
$line =~ /^>probe:\S+?:(\S+?):/;
$1;
}
It works for me on the sample sequence given by the OP.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Lincoln Stein
2006-06-27 22:35:08 UTC
Permalink
Hi All,

This is rather late, but just for future reference on the mailing list, here
is how I would do the task using Bio::DB::Fasta.

Script 1: index the file for future use:

#!/usr/bin/perl
use strict;
use Bio::DB::Fasta;

my $filename = shift; # name of file to index on command line
Bio::DB::Fasta->new($filename,-makeid=>\&make_my_id)
or die "Indexing failed";
print "Indexing succeeded!\n";
exit 0;

sub make_my_id {
my $description_line = shift;
$description_line =~ /(\d+_at)/ or die "malformed description line";
return $1;
}

Run this script once to create a reusable index of the file. The index will be
stored in the same directory as the FASTA file.

Script 2: extract the sequences using the IDs stored in a second file:

#!/usr/bin/perl
use strict;
use Bio::DB::Fasta;
use Bio::SeqIO;
use IO::File;

my $indexed_fasta_file = shift;
my $probe_id_file = shift;

# open up the indexed fasta file
my $db = Bio::DB::Fasta->new($indexed_fasta_file) or die;
# open up a FASTA writer
my $out = Bio::SeqIO->new(-format=>'Fasta',-fh=>\*STDOUT) or die;
# open the probe id file
my $in = IO::File->new($probe_id_file) or die;

# do the work
while (my $id = <$in>) {
chomp $id;
my $seq = $db->get_Seq_by_id($id) or die;
$out->write_seq($seq);
}

exit 0;

Bio::Index::Fasta will work in almost exactly the same way. The only
difference is that the Bio::DB::Fasta will allow you to retrieve subsequences
efficiently.

Lincoln
--
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu
Michael Oldham
2006-06-09 02:07:34 UTC
Permalink
Dear all,

I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC

What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I have these
8,175 IDs listed in a separate file. I *think* that I managed to create an
index of all 200,000 probes in the original fasta file using the following
script:

#!/usr/bin/perl -w

# script 1: create the index

use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
$inx->make_index(@ARGV);

I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!

Many thanks,
Mike O.




--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
Sendu Bala
2006-06-09 14:52:59 UTC
Permalink
Post by Michael Oldham
Dear all,
I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes
[snip]
Post by Michael Oldham
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
[snip]
Post by Michael Oldham
I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
I'd say you're on the right lines. Next, you should continue reading the
rest of the synopsis and description in the docs for Bio::Index::Fasta.

Perhaps it's not clear, but you don't need to say
$inx->make_index(@ARGV); if you've already provided -file to new() and
are only dealing with one file. You also can't supply -file to new() if
you want to change the id_parser (which you do, since you need to tell
it how to detect your probe set ID).

Having indexed your file you can then output the desired sequences, just
like the foreach loop suggested in the synopsis. (You could have that in
the same script.)


One thing I'm not clear on is why it needs -write_flag => 1. Why can't
it index a read-only database? Even when you set -write_flag allowing it
to work, it doesn't write anything...
simon andrews (BI)
2006-06-09 15:01:05 UTC
Permalink
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: 09 June 2006 03:08
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple task. I have a single large fasta file
containing about 200,000 probe sequences (from an Affymetrix
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
Unfortunately that's not Fasta format (which only has a single header
line starting with a '>'. I'd imagine that most programs which deal
with fasta which read that entry would see it as two sequences, the
first of which is empty.
What I would like to do is extract from this file a subset of
~130,800 probes (both the header and the sequence) and output
this subset into a new fasta file. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is the probe set
ID in the header listed above)
If you're only having to do this once then it should be fairly quick to
knock up a one off script to do this. Since you've only got 8000ish
probeset ids then you can probably just read those into a hash to start
with then parse through your big sequence file with something like;


#!perl
use warnings;
use strict;

my %probe_ids;

# Add real code here to populate your hash
$probe_ids{1138_at} = 1;
##########################################


open (IN,'your_affy_file.txt') or die "Can't read affy file: $!";

open (OUT,'>','probe_list.txt') or die "Can't write output: $!";

while (<IN>) {

if (/^>probe/) {
# This assumes there are always 3 lines per probe entry
if (exists $probe_ids{(split(/:/))[2]}) {
print OUT;
print OUT scalar <IN>;
print OUT scalar <IN>;
}
}
}
Cook, Malcolm
2006-06-09 14:58:22 UTC
Permalink
I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.

Assuming your ids are one per line in a file named id.dat looking like
this

1138_at
1134_at
etc..

this should work:

perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
<idfile>; chomp @ID; %ID = map {($_, 1)} @ID;} $inmatch =
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa

good luck

--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
M Senthil Kumar
2006-06-09 22:21:11 UTC
Permalink
On Fri, 9 Jun 2006, simon andrews (BI) wrote:
|
|
|> -----Original Message-----
|> From: bioperl-l-bounces at lists.open-bio.org
|> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
|> Michael Oldham
|> Sent: 09 June 2006 03:08
|> To: bioperl-l at lists.open-bio.org
|> Subject: [Bioperl-l] Output a subset of FASTA data from a
|> single large file
|>
|> Dear all,
|>
|> I am a total Bioperl newbie struggling to accomplish a
|> conceptually simple task. I have a single large fasta file
|> containing about 200,000 probe sequences (from an Affymetrix
|> microarray), each of which looks like this:
|>
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|

[snipped]

hi,

I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.

Senthil
Chris Fields
2006-06-09 17:59:18 UTC
Permalink
No; I saw the same thing here. It's not FASTA in the traditional sense:

http://www.bioperl.org/wiki/FASTA_sequence_format

though he did get it to build a database successfully. Well, 'success' in
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.

It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the database,
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.

Chris
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
bounces at lists.open-bio.org] On Behalf Of M Senthil Kumar
Sent: Friday, June 09, 2006 5:21 PM
To: simon andrews (BI)
Cc: bioperl-l at lists.open-bio.org; Michael Oldham
Subject: Re: [Bioperl-l] Output a subset of FASTA data from a single large
file
|
|
|> -----Original Message-----
|> From: bioperl-l-bounces at lists.open-bio.org
|> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
|> Michael Oldham
|> Sent: 09 June 2006 03:08
|> To: bioperl-l at lists.open-bio.org
|> Subject: [Bioperl-l] Output a subset of FASTA data from a
|> single large file
|>
|> Dear all,
|>
|> I am a total Bioperl newbie struggling to accomplish a
|> conceptually simple task. I have a single large fasta file
|> containing about 200,000 probe sequences (from an Affymetrix
|>
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
Senthil
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Sean Davis
2006-06-09 18:29:53 UTC
Permalink
Post by Chris Fields
http://www.bioperl.org/wiki/FASTA_sequence_format
though he did get it to build a database successfully. Well, 'success' in
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.
It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the database,
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.
I think that Senthil was pointing out that even though >Antisense looks to
be on its own line, it isn't, but is simply a continutation of the FASTA
header. Judging from the context, that is the only interpretation that
makes sense.

Sean
Post by Chris Fields
Post by M Senthil Kumar
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
Chris Fields
2006-06-09 19:05:44 UTC
Permalink
There's information in the HOWTOs:

http://www.bioperl.org/wiki/HOWTO:Flat_databases

http://www.bioperl.org/wiki/HOWTO:OBDA

Just so you know, I tried, out of curiosity, passing this through Bio::SeqIO
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
i.e. an empty sequence, which is what I guessed might happen, though I
thought it might pick up the second '>' and the full sequence there. Since
the sequence is tossed you'll have to prescreen your sequence input stream
by either concatenating the two '>' lines together or screening for the
relevant information you want to retain. You can try maybe getting this
info into Bio::Seq objects and writing to a Bio::SeqIO stream (to file or
file handle).

Once you have that set up, the HOWTO tells you how to set up custom or
secondary namespaces, so you can use a regex to parse out the information
for a primary or secondary keys:

http://www.bioperl.org/wiki/HOWTO:Flat_databases#Secondary_or_custom_namespa
ces

then you could select specific sequences this way (per the HOWTO):

$db->secondary_namespaces("GI");
my $acc_seq = $db->get_Seq_by_id("P84139");
my $gi_seq = $db->get_Seq_by_secondary("GI",443893);

or for multiple sequences (judging from the POD):

my $acc_seqio = $db->get_Stream_by_id(@ids);

Chris
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
bounces at lists.open-bio.org] On Behalf Of Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a single large
file
Dear all,
I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I have these
8,175 IDs listed in a separate file. I *think* that I managed to create
an
index of all 200,000 probes in the original fasta file using the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Chris Fields
2006-06-09 19:49:51 UTC
Permalink
Post by Chris Fields
Post by Chris Fields
http://www.bioperl.org/wiki/FASTA_sequence_format
though he did get it to build a database successfully. Well, 'success'
in
Post by Chris Fields
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.
It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the
database,
Post by Chris Fields
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.
I think that Senthil was pointing out that even though >Antisense looks to
be on its own line, it isn't, but is simply a continutation of the FASTA
header. Judging from the context, that is the only interpretation that
makes sense.
Sean
Sorry. Just checked through another mail client and you're right. That's
what I get for trusting Mr. Gates (stupid Outlook). I have seen a few funky
FASTA derivations, so I thought that's what was going on here. My bad!

My point, though erroneous, was that the fasta format parser may not parse
this data correctly if he did have two description lines, but may not
indicate there are problems by throwing an exception. I demonstrated that
using Bio::SeqIO as an example (you get empty sequences). Bio::Index::Fasta
parses the file itself using this loop to index:

# Main indexing loop
while (<FASTA>) {
if (/^>/) {
# $begin is the position of the first character
after the '>'
my $begin = tell(FASTA) - length( $_ ) + 1;

foreach my $id (&$id_parser($_)) {
$self->add_record($id, $i, $begin);
}
}
}

Which simply looks for '>'. That's fine for a vast majority of sequences.
I thought it would be nice to have something that's a little more strenuous
in verifying the format rather than trusting it implicitly, maybe by using
an eval{} block to make sure the format is FASTA-like and looks like
DNA/RNA/protein.

Chris
Post by Chris Fields
Post by Chris Fields
Post by M Senthil Kumar
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Michael Oldham
2006-06-10 01:39:45 UTC
Permalink
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no errors--but
the output file is empty. It seems that the problem must lie somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more experienced
eye--but not to mine! Any suggestions? Thanks again for your help.

--Mike O.


#!/usr/bin/perl -w

use strict;

my $IDs = 'ID.dat.txt';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes = 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
chomp @ID;
my %ID = map {($_, 1)} @ID; #Note: This creates a hash with keys=PSIDs and
all values=1.

while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;


-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a single large
file



I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.

Assuming your ids are one per line in a file named id.dat looking like
this

1138_at
1134_at
etc..

this should work:

perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
<idfile>; chomp @ID; %ID = map {($_, 1)} @ID;} $inmatch =
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa

good luck

--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/360 - Release Date: 6/9/2006
Cook, Malcolm
2006-06-12 15:28:41 UTC
Permalink
Michael,

I don't think you can call perl's `print` on just a filehandle as you
are doing. This is probably your problem.

If you call `select OUT` after opeining it, print will print $_ to it.
And, every line in the fasta record whose header matches on of the IDS
will get printed, not just the fasta header lines. Read the code again
nothing that $idmatch is only getting reset when a correctly formatted
fasta header line is matched.

--Malcolm
-----Original Message-----
From: Chris Fields [mailto:cjfields at uiuc.edu]
Sent: Saturday, June 10, 2006 11:32 PM
To: Michael Oldham
Cc: Cook, Malcolm; bioperl-l at lists.open-bio.org
Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
single large file
What happens if you just print $idmatch or $1 (i.e. check to see if
the regex matches anything)? If there is nothing printed then either
the regex isn't working as expected or there is something logically
wrong. The problem may be that the captured string must match the id
exactly, the id being the key to the %ID hash; any extra characters
picked up by the regex outside of your id key and you will not get
anything. Looking at Malcolm's regex it should work just fine, but
we only had one example sequence to try here.
If your while loop is set up like this won't it only print only the
matched description lines to the outfile (no sequence) even if there
is a match? Or is this what you wanted? If you want the sequence
you should add 'print OUT <PROBES>;' after the 'print OUT;' line.
Chris
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting
closer,
but no cigar quite yet. The script below runs quickly with no
errors--but
the output file is empty. It seems that the problem must lie
somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more
experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
keys=PSIDs and
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
single large
file
I wouldn't bioperl for this, or create an index. Perl would do
fine and
probably be faster.
Assuming your ids are one per line in a file named id.dat
looking like
Post by Michael Oldham
this
1138_at
1134_at
etc..
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa
good luck
--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000
probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of
~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
6/8/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/9/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
Chris Fields
2006-06-12 20:35:10 UTC
Permalink
...
Post by Sendu Bala
Post by Chris Fields
http://www.bioperl.org/wiki/HOWTO:Flat_databases
http://www.bioperl.org/wiki/HOWTO:OBDA
...
Post by Sendu Bala
As you later discovered, that was an Outlook problem. Just to make this
Agreed (stupid Outlook). It might be much faster to use non-Bioperl-ish
ways, but it is easier to further manipulate sequences (convert format,
analyze sequences, etc) using Bioperl directly. I haven't used flat
databases much but it should move very quickly, even in an OO environment.

The one problem with the proposed non-bioperl method is, if you wanted
100,000 sequences (based on ID's) in a FASTA database file containing
200,000 sequences, all ID's would need to be stored (1) in an array (which
gulped the data from the ID file) and then map the ID's to (2) a hash;
that's may be a pretty big memory footprint depending on your system.

Sendu's BioPerl version indexes the FASTA file based on the ID, then (1)
reads the ID's in one at a time from the file, (2) retrieves the data, then
(3) prints it out. The advantage of this approach is that the built index
can be used in other bioperl scripts as well w/o having to rebuild it again,
so if you wanted a different set of ID's later on you can access the
database using the prebuilt index. More can be found in the
Bio::Index::Fasta POD.

You can also use the ideas and code in the HOWTO (Flat Databases) I
mentioned, which focuses on the Bio::DB::Flat system and ODBA. The
advantage of these is that you can use Sleepycat's Berkeley Database through
the Perl BerkeleyDB module (more functionality than DB_File) which is faster
than a standard flat database. In the HOWTO, specifically look under
'Secondary or custom namespaces' for ideas on how to use your ID as a
primary or secondary key.

Chris
Post by Sendu Bala
use Bio::SeqIO;
use Bio::Index::Fasta;
my $inx = Bio::Index::Fasta->new(-write_flag => 1);
$inx->id_parser(\&get_id);
$inx->make_index(shift);
my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
my $wanted_ids_file = shift;
open(IDS, $wanted_ids_file);
while (<IDS>) {
chomp;
my $seq = $inx->fetch($_);
$out->write_seq($seq);
}
sub get_id {
my $line = shift;
$line =~ /^>probe:\S+?:(\S+?):/;
$1;
}
It works for me on the sample sequence given by the OP.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Lincoln Stein
2006-06-27 22:35:08 UTC
Permalink
Hi All,

This is rather late, but just for future reference on the mailing list, here
is how I would do the task using Bio::DB::Fasta.

Script 1: index the file for future use:

#!/usr/bin/perl
use strict;
use Bio::DB::Fasta;

my $filename = shift; # name of file to index on command line
Bio::DB::Fasta->new($filename,-makeid=>\&make_my_id)
or die "Indexing failed";
print "Indexing succeeded!\n";
exit 0;

sub make_my_id {
my $description_line = shift;
$description_line =~ /(\d+_at)/ or die "malformed description line";
return $1;
}

Run this script once to create a reusable index of the file. The index will be
stored in the same directory as the FASTA file.

Script 2: extract the sequences using the IDs stored in a second file:

#!/usr/bin/perl
use strict;
use Bio::DB::Fasta;
use Bio::SeqIO;
use IO::File;

my $indexed_fasta_file = shift;
my $probe_id_file = shift;

# open up the indexed fasta file
my $db = Bio::DB::Fasta->new($indexed_fasta_file) or die;
# open up a FASTA writer
my $out = Bio::SeqIO->new(-format=>'Fasta',-fh=>\*STDOUT) or die;
# open the probe id file
my $in = IO::File->new($probe_id_file) or die;

# do the work
while (my $id = <$in>) {
chomp $id;
my $seq = $db->get_Seq_by_id($id) or die;
$out->write_seq($seq);
}

exit 0;

Bio::Index::Fasta will work in almost exactly the same way. The only
difference is that the Bio::DB::Fasta will allow you to retrieve subsequences
efficiently.

Lincoln
--
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu
Michael Oldham
2006-06-09 02:07:34 UTC
Permalink
Dear all,

I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC

What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I have these
8,175 IDs listed in a separate file. I *think* that I managed to create an
index of all 200,000 probes in the original fasta file using the following
script:

#!/usr/bin/perl -w

# script 1: create the index

use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
$inx->make_index(@ARGV);

I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!

Many thanks,
Mike O.




--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
Sendu Bala
2006-06-09 14:52:59 UTC
Permalink
Post by Michael Oldham
Dear all,
I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes
[snip]
Post by Michael Oldham
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
[snip]
Post by Michael Oldham
I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
I'd say you're on the right lines. Next, you should continue reading the
rest of the synopsis and description in the docs for Bio::Index::Fasta.

Perhaps it's not clear, but you don't need to say
$inx->make_index(@ARGV); if you've already provided -file to new() and
are only dealing with one file. You also can't supply -file to new() if
you want to change the id_parser (which you do, since you need to tell
it how to detect your probe set ID).

Having indexed your file you can then output the desired sequences, just
like the foreach loop suggested in the synopsis. (You could have that in
the same script.)


One thing I'm not clear on is why it needs -write_flag => 1. Why can't
it index a read-only database? Even when you set -write_flag allowing it
to work, it doesn't write anything...
simon andrews (BI)
2006-06-09 15:01:05 UTC
Permalink
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: 09 June 2006 03:08
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple task. I have a single large fasta file
containing about 200,000 probe sequences (from an Affymetrix
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
Unfortunately that's not Fasta format (which only has a single header
line starting with a '>'. I'd imagine that most programs which deal
with fasta which read that entry would see it as two sequences, the
first of which is empty.
What I would like to do is extract from this file a subset of
~130,800 probes (both the header and the sequence) and output
this subset into a new fasta file. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is the probe set
ID in the header listed above)
If you're only having to do this once then it should be fairly quick to
knock up a one off script to do this. Since you've only got 8000ish
probeset ids then you can probably just read those into a hash to start
with then parse through your big sequence file with something like;


#!perl
use warnings;
use strict;

my %probe_ids;

# Add real code here to populate your hash
$probe_ids{1138_at} = 1;
##########################################


open (IN,'your_affy_file.txt') or die "Can't read affy file: $!";

open (OUT,'>','probe_list.txt') or die "Can't write output: $!";

while (<IN>) {

if (/^>probe/) {
# This assumes there are always 3 lines per probe entry
if (exists $probe_ids{(split(/:/))[2]}) {
print OUT;
print OUT scalar <IN>;
print OUT scalar <IN>;
}
}
}
Cook, Malcolm
2006-06-09 14:58:22 UTC
Permalink
I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.

Assuming your ids are one per line in a file named id.dat looking like
this

1138_at
1134_at
etc..

this should work:

perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
<idfile>; chomp @ID; %ID = map {($_, 1)} @ID;} $inmatch =
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa

good luck

--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
M Senthil Kumar
2006-06-09 22:21:11 UTC
Permalink
On Fri, 9 Jun 2006, simon andrews (BI) wrote:
|
|
|> -----Original Message-----
|> From: bioperl-l-bounces at lists.open-bio.org
|> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
|> Michael Oldham
|> Sent: 09 June 2006 03:08
|> To: bioperl-l at lists.open-bio.org
|> Subject: [Bioperl-l] Output a subset of FASTA data from a
|> single large file
|>
|> Dear all,
|>
|> I am a total Bioperl newbie struggling to accomplish a
|> conceptually simple task. I have a single large fasta file
|> containing about 200,000 probe sequences (from an Affymetrix
|> microarray), each of which looks like this:
|>
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|

[snipped]

hi,

I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.

Senthil
Chris Fields
2006-06-09 17:59:18 UTC
Permalink
No; I saw the same thing here. It's not FASTA in the traditional sense:

http://www.bioperl.org/wiki/FASTA_sequence_format

though he did get it to build a database successfully. Well, 'success' in
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.

It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the database,
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.

Chris
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
bounces at lists.open-bio.org] On Behalf Of M Senthil Kumar
Sent: Friday, June 09, 2006 5:21 PM
To: simon andrews (BI)
Cc: bioperl-l at lists.open-bio.org; Michael Oldham
Subject: Re: [Bioperl-l] Output a subset of FASTA data from a single large
file
|
|
|> -----Original Message-----
|> From: bioperl-l-bounces at lists.open-bio.org
|> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
|> Michael Oldham
|> Sent: 09 June 2006 03:08
|> To: bioperl-l at lists.open-bio.org
|> Subject: [Bioperl-l] Output a subset of FASTA data from a
|> single large file
|>
|> Dear all,
|>
|> I am a total Bioperl newbie struggling to accomplish a
|> conceptually simple task. I have a single large fasta file
|> containing about 200,000 probe sequences (from an Affymetrix
|>
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
Senthil
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Sean Davis
2006-06-09 18:29:53 UTC
Permalink
Post by Chris Fields
http://www.bioperl.org/wiki/FASTA_sequence_format
though he did get it to build a database successfully. Well, 'success' in
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.
It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the database,
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.
I think that Senthil was pointing out that even though >Antisense looks to
be on its own line, it isn't, but is simply a continutation of the FASTA
header. Judging from the context, that is the only interpretation that
makes sense.

Sean
Post by Chris Fields
Post by M Senthil Kumar
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
Chris Fields
2006-06-09 19:05:44 UTC
Permalink
There's information in the HOWTOs:

http://www.bioperl.org/wiki/HOWTO:Flat_databases

http://www.bioperl.org/wiki/HOWTO:OBDA

Just so you know, I tried, out of curiosity, passing this through Bio::SeqIO
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
i.e. an empty sequence, which is what I guessed might happen, though I
thought it might pick up the second '>' and the full sequence there. Since
the sequence is tossed you'll have to prescreen your sequence input stream
by either concatenating the two '>' lines together or screening for the
relevant information you want to retain. You can try maybe getting this
info into Bio::Seq objects and writing to a Bio::SeqIO stream (to file or
file handle).

Once you have that set up, the HOWTO tells you how to set up custom or
secondary namespaces, so you can use a regex to parse out the information
for a primary or secondary keys:

http://www.bioperl.org/wiki/HOWTO:Flat_databases#Secondary_or_custom_namespa
ces

then you could select specific sequences this way (per the HOWTO):

$db->secondary_namespaces("GI");
my $acc_seq = $db->get_Seq_by_id("P84139");
my $gi_seq = $db->get_Seq_by_secondary("GI",443893);

or for multiple sequences (judging from the POD):

my $acc_seqio = $db->get_Stream_by_id(@ids);

Chris
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
bounces at lists.open-bio.org] On Behalf Of Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a single large
file
Dear all,
I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I have these
8,175 IDs listed in a separate file. I *think* that I managed to create
an
index of all 200,000 probes in the original fasta file using the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Chris Fields
2006-06-09 19:49:51 UTC
Permalink
Post by Chris Fields
Post by Chris Fields
http://www.bioperl.org/wiki/FASTA_sequence_format
though he did get it to build a database successfully. Well, 'success'
in
Post by Chris Fields
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.
It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the
database,
Post by Chris Fields
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.
I think that Senthil was pointing out that even though >Antisense looks to
be on its own line, it isn't, but is simply a continutation of the FASTA
header. Judging from the context, that is the only interpretation that
makes sense.
Sean
Sorry. Just checked through another mail client and you're right. That's
what I get for trusting Mr. Gates (stupid Outlook). I have seen a few funky
FASTA derivations, so I thought that's what was going on here. My bad!

My point, though erroneous, was that the fasta format parser may not parse
this data correctly if he did have two description lines, but may not
indicate there are problems by throwing an exception. I demonstrated that
using Bio::SeqIO as an example (you get empty sequences). Bio::Index::Fasta
parses the file itself using this loop to index:

# Main indexing loop
while (<FASTA>) {
if (/^>/) {
# $begin is the position of the first character
after the '>'
my $begin = tell(FASTA) - length( $_ ) + 1;

foreach my $id (&$id_parser($_)) {
$self->add_record($id, $i, $begin);
}
}
}

Which simply looks for '>'. That's fine for a vast majority of sequences.
I thought it would be nice to have something that's a little more strenuous
in verifying the format rather than trusting it implicitly, maybe by using
an eval{} block to make sure the format is FASTA-like and looks like
DNA/RNA/protein.

Chris
Post by Chris Fields
Post by Chris Fields
Post by M Senthil Kumar
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Michael Oldham
2006-06-10 01:39:45 UTC
Permalink
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no errors--but
the output file is empty. It seems that the problem must lie somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more experienced
eye--but not to mine! Any suggestions? Thanks again for your help.

--Mike O.


#!/usr/bin/perl -w

use strict;

my $IDs = 'ID.dat.txt';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes = 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
chomp @ID;
my %ID = map {($_, 1)} @ID; #Note: This creates a hash with keys=PSIDs and
all values=1.

while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;


-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a single large
file



I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.

Assuming your ids are one per line in a file named id.dat looking like
this

1138_at
1134_at
etc..

this should work:

perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
<idfile>; chomp @ID; %ID = map {($_, 1)} @ID;} $inmatch =
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa

good luck

--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/360 - Release Date: 6/9/2006
Cook, Malcolm
2006-06-12 15:28:41 UTC
Permalink
Michael,

I don't think you can call perl's `print` on just a filehandle as you
are doing. This is probably your problem.

If you call `select OUT` after opeining it, print will print $_ to it.
And, every line in the fasta record whose header matches on of the IDS
will get printed, not just the fasta header lines. Read the code again
nothing that $idmatch is only getting reset when a correctly formatted
fasta header line is matched.

--Malcolm
-----Original Message-----
From: Chris Fields [mailto:cjfields at uiuc.edu]
Sent: Saturday, June 10, 2006 11:32 PM
To: Michael Oldham
Cc: Cook, Malcolm; bioperl-l at lists.open-bio.org
Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
single large file
What happens if you just print $idmatch or $1 (i.e. check to see if
the regex matches anything)? If there is nothing printed then either
the regex isn't working as expected or there is something logically
wrong. The problem may be that the captured string must match the id
exactly, the id being the key to the %ID hash; any extra characters
picked up by the regex outside of your id key and you will not get
anything. Looking at Malcolm's regex it should work just fine, but
we only had one example sequence to try here.
If your while loop is set up like this won't it only print only the
matched description lines to the outfile (no sequence) even if there
is a match? Or is this what you wanted? If you want the sequence
you should add 'print OUT <PROBES>;' after the 'print OUT;' line.
Chris
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting
closer,
but no cigar quite yet. The script below runs quickly with no
errors--but
the output file is empty. It seems that the problem must lie
somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more
experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
keys=PSIDs and
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
single large
file
I wouldn't bioperl for this, or create an index. Perl would do
fine and
probably be faster.
Assuming your ids are one per line in a file named id.dat
looking like
Post by Michael Oldham
this
1138_at
1134_at
etc..
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa
good luck
--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000
probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of
~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
6/8/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/9/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
Chris Fields
2006-06-12 20:35:10 UTC
Permalink
...
Post by Sendu Bala
Post by Chris Fields
http://www.bioperl.org/wiki/HOWTO:Flat_databases
http://www.bioperl.org/wiki/HOWTO:OBDA
...
Post by Sendu Bala
As you later discovered, that was an Outlook problem. Just to make this
Agreed (stupid Outlook). It might be much faster to use non-Bioperl-ish
ways, but it is easier to further manipulate sequences (convert format,
analyze sequences, etc) using Bioperl directly. I haven't used flat
databases much but it should move very quickly, even in an OO environment.

The one problem with the proposed non-bioperl method is, if you wanted
100,000 sequences (based on ID's) in a FASTA database file containing
200,000 sequences, all ID's would need to be stored (1) in an array (which
gulped the data from the ID file) and then map the ID's to (2) a hash;
that's may be a pretty big memory footprint depending on your system.

Sendu's BioPerl version indexes the FASTA file based on the ID, then (1)
reads the ID's in one at a time from the file, (2) retrieves the data, then
(3) prints it out. The advantage of this approach is that the built index
can be used in other bioperl scripts as well w/o having to rebuild it again,
so if you wanted a different set of ID's later on you can access the
database using the prebuilt index. More can be found in the
Bio::Index::Fasta POD.

You can also use the ideas and code in the HOWTO (Flat Databases) I
mentioned, which focuses on the Bio::DB::Flat system and ODBA. The
advantage of these is that you can use Sleepycat's Berkeley Database through
the Perl BerkeleyDB module (more functionality than DB_File) which is faster
than a standard flat database. In the HOWTO, specifically look under
'Secondary or custom namespaces' for ideas on how to use your ID as a
primary or secondary key.

Chris
Post by Sendu Bala
use Bio::SeqIO;
use Bio::Index::Fasta;
my $inx = Bio::Index::Fasta->new(-write_flag => 1);
$inx->id_parser(\&get_id);
$inx->make_index(shift);
my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
my $wanted_ids_file = shift;
open(IDS, $wanted_ids_file);
while (<IDS>) {
chomp;
my $seq = $inx->fetch($_);
$out->write_seq($seq);
}
sub get_id {
my $line = shift;
$line =~ /^>probe:\S+?:(\S+?):/;
$1;
}
It works for me on the sample sequence given by the OP.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Lincoln Stein
2006-06-27 22:35:08 UTC
Permalink
Hi All,

This is rather late, but just for future reference on the mailing list, here
is how I would do the task using Bio::DB::Fasta.

Script 1: index the file for future use:

#!/usr/bin/perl
use strict;
use Bio::DB::Fasta;

my $filename = shift; # name of file to index on command line
Bio::DB::Fasta->new($filename,-makeid=>\&make_my_id)
or die "Indexing failed";
print "Indexing succeeded!\n";
exit 0;

sub make_my_id {
my $description_line = shift;
$description_line =~ /(\d+_at)/ or die "malformed description line";
return $1;
}

Run this script once to create a reusable index of the file. The index will be
stored in the same directory as the FASTA file.

Script 2: extract the sequences using the IDs stored in a second file:

#!/usr/bin/perl
use strict;
use Bio::DB::Fasta;
use Bio::SeqIO;
use IO::File;

my $indexed_fasta_file = shift;
my $probe_id_file = shift;

# open up the indexed fasta file
my $db = Bio::DB::Fasta->new($indexed_fasta_file) or die;
# open up a FASTA writer
my $out = Bio::SeqIO->new(-format=>'Fasta',-fh=>\*STDOUT) or die;
# open the probe id file
my $in = IO::File->new($probe_id_file) or die;

# do the work
while (my $id = <$in>) {
chomp $id;
my $seq = $db->get_Seq_by_id($id) or die;
$out->write_seq($seq);
}

exit 0;

Bio::Index::Fasta will work in almost exactly the same way. The only
difference is that the Bio::DB::Fasta will allow you to retrieve subsequences
efficiently.

Lincoln
--
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu
Michael Oldham
2006-06-09 02:07:34 UTC
Permalink
Dear all,

I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC

What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I have these
8,175 IDs listed in a separate file. I *think* that I managed to create an
index of all 200,000 probes in the original fasta file using the following
script:

#!/usr/bin/perl -w

# script 1: create the index

use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
$inx->make_index(@ARGV);

I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!

Many thanks,
Mike O.




--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
Sendu Bala
2006-06-09 14:52:59 UTC
Permalink
Post by Michael Oldham
Dear all,
I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes
[snip]
Post by Michael Oldham
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
[snip]
Post by Michael Oldham
I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
I'd say you're on the right lines. Next, you should continue reading the
rest of the synopsis and description in the docs for Bio::Index::Fasta.

Perhaps it's not clear, but you don't need to say
$inx->make_index(@ARGV); if you've already provided -file to new() and
are only dealing with one file. You also can't supply -file to new() if
you want to change the id_parser (which you do, since you need to tell
it how to detect your probe set ID).

Having indexed your file you can then output the desired sequences, just
like the foreach loop suggested in the synopsis. (You could have that in
the same script.)


One thing I'm not clear on is why it needs -write_flag => 1. Why can't
it index a read-only database? Even when you set -write_flag allowing it
to work, it doesn't write anything...
simon andrews (BI)
2006-06-09 15:01:05 UTC
Permalink
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: 09 June 2006 03:08
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple task. I have a single large fasta file
containing about 200,000 probe sequences (from an Affymetrix
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
Unfortunately that's not Fasta format (which only has a single header
line starting with a '>'. I'd imagine that most programs which deal
with fasta which read that entry would see it as two sequences, the
first of which is empty.
What I would like to do is extract from this file a subset of
~130,800 probes (both the header and the sequence) and output
this subset into a new fasta file. These 130,800 probes
correspond to 8,175 probe set IDs ("1138_at" is the probe set
ID in the header listed above)
If you're only having to do this once then it should be fairly quick to
knock up a one off script to do this. Since you've only got 8000ish
probeset ids then you can probably just read those into a hash to start
with then parse through your big sequence file with something like;


#!perl
use warnings;
use strict;

my %probe_ids;

# Add real code here to populate your hash
$probe_ids{1138_at} = 1;
##########################################


open (IN,'your_affy_file.txt') or die "Can't read affy file: $!";

open (OUT,'>','probe_list.txt') or die "Can't write output: $!";

while (<IN>) {

if (/^>probe/) {
# This assumes there are always 3 lines per probe entry
if (exists $probe_ids{(split(/:/))[2]}) {
print OUT;
print OUT scalar <IN>;
print OUT scalar <IN>;
}
}
}
Cook, Malcolm
2006-06-09 14:58:22 UTC
Permalink
I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.

Assuming your ids are one per line in a file named id.dat looking like
this

1138_at
1134_at
etc..

this should work:

perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
<idfile>; chomp @ID; %ID = map {($_, 1)} @ID;} $inmatch =
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa

good luck

--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
M Senthil Kumar
2006-06-09 22:21:11 UTC
Permalink
On Fri, 9 Jun 2006, simon andrews (BI) wrote:
|
|
|> -----Original Message-----
|> From: bioperl-l-bounces at lists.open-bio.org
|> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
|> Michael Oldham
|> Sent: 09 June 2006 03:08
|> To: bioperl-l at lists.open-bio.org
|> Subject: [Bioperl-l] Output a subset of FASTA data from a
|> single large file
|>
|> Dear all,
|>
|> I am a total Bioperl newbie struggling to accomplish a
|> conceptually simple task. I have a single large fasta file
|> containing about 200,000 probe sequences (from an Affymetrix
|> microarray), each of which looks like this:
|>
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|

[snipped]

hi,

I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.

Senthil
Chris Fields
2006-06-09 17:59:18 UTC
Permalink
No; I saw the same thing here. It's not FASTA in the traditional sense:

http://www.bioperl.org/wiki/FASTA_sequence_format

though he did get it to build a database successfully. Well, 'success' in
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.

It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the database,
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.

Chris
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
bounces at lists.open-bio.org] On Behalf Of M Senthil Kumar
Sent: Friday, June 09, 2006 5:21 PM
To: simon andrews (BI)
Cc: bioperl-l at lists.open-bio.org; Michael Oldham
Subject: Re: [Bioperl-l] Output a subset of FASTA data from a single large
file
|
|
|> -----Original Message-----
|> From: bioperl-l-bounces at lists.open-bio.org
|> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
|> Michael Oldham
|> Sent: 09 June 2006 03:08
|> To: bioperl-l at lists.open-bio.org
|> Subject: [Bioperl-l] Output a subset of FASTA data from a
|> single large file
|>
|> Dear all,
|>
|> I am a total Bioperl newbie struggling to accomplish a
|> conceptually simple task. I have a single large fasta file
|> containing about 200,000 probe sequences (from an Affymetrix
|>
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
Senthil
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Sean Davis
2006-06-09 18:29:53 UTC
Permalink
Post by Chris Fields
http://www.bioperl.org/wiki/FASTA_sequence_format
though he did get it to build a database successfully. Well, 'success' in
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.
It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the database,
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.
I think that Senthil was pointing out that even though >Antisense looks to
be on its own line, it isn't, but is simply a continutation of the FASTA
header. Judging from the context, that is the only interpretation that
makes sense.

Sean
Post by Chris Fields
Post by M Senthil Kumar
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
Chris Fields
2006-06-09 19:05:44 UTC
Permalink
There's information in the HOWTOs:

http://www.bioperl.org/wiki/HOWTO:Flat_databases

http://www.bioperl.org/wiki/HOWTO:OBDA

Just so you know, I tried, out of curiosity, passing this through Bio::SeqIO
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
i.e. an empty sequence, which is what I guessed might happen, though I
thought it might pick up the second '>' and the full sequence there. Since
the sequence is tossed you'll have to prescreen your sequence input stream
by either concatenating the two '>' lines together or screening for the
relevant information you want to retain. You can try maybe getting this
info into Bio::Seq objects and writing to a Bio::SeqIO stream (to file or
file handle).

Once you have that set up, the HOWTO tells you how to set up custom or
secondary namespaces, so you can use a regex to parse out the information
for a primary or secondary keys:

http://www.bioperl.org/wiki/HOWTO:Flat_databases#Secondary_or_custom_namespa
ces

then you could select specific sequences this way (per the HOWTO):

$db->secondary_namespaces("GI");
my $acc_seq = $db->get_Seq_by_id("P84139");
my $gi_seq = $db->get_Seq_by_secondary("GI",443893);

or for multiple sequences (judging from the POD):

my $acc_seqio = $db->get_Stream_by_id(@ids);

Chris
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
bounces at lists.open-bio.org] On Behalf Of Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a single large
file
Dear all,
I am a total Bioperl newbie struggling to accomplish a conceptually simple
task. I have a single large fasta file containing about 200,000 probe
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a
new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I have these
8,175 IDs listed in a separate file. I *think* that I managed to create
an
index of all 200,000 probes in the original fasta file using the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Chris Fields
2006-06-09 19:49:51 UTC
Permalink
Post by Chris Fields
Post by Chris Fields
http://www.bioperl.org/wiki/FASTA_sequence_format
though he did get it to build a database successfully. Well, 'success'
in
Post by Chris Fields
the sense that no errors were thrown. I've learned the absence of error
messages does not necessarily mean that everything went as planned; it
depends on how much error handling has been added to the module by the
submitting author.
It's possible that the second annotation line was ignored completely. I
suppose it's also possible that two sequences are entered into the
database,
Post by Chris Fields
an empty sequence for the first '>' line and the full sequence for the
second. It's all dependent on how the parser handles this.
I think that Senthil was pointing out that even though >Antisense looks to
be on its own line, it isn't, but is simply a continutation of the FASTA
header. Judging from the context, that is the only interpretation that
makes sense.
Sean
Sorry. Just checked through another mail client and you're right. That's
what I get for trusting Mr. Gates (stupid Outlook). I have seen a few funky
FASTA derivations, so I thought that's what was going on here. My bad!

My point, though erroneous, was that the fasta format parser may not parse
this data correctly if he did have two description lines, but may not
indicate there are problems by throwing an exception. I demonstrated that
using Bio::SeqIO as an example (you get empty sequences). Bio::Index::Fasta
parses the file itself using this loop to index:

# Main indexing loop
while (<FASTA>) {
if (/^>/) {
# $begin is the position of the first character
after the '>'
my $begin = tell(FASTA) - length( $_ ) + 1;

foreach my $id (&$id_parser($_)) {
$self->add_record($id, $i, $begin);
}
}
}

Which simply looks for '>'. That's fine for a vast majority of sequences.
I thought it would be nice to have something that's a little more strenuous
in verifying the format rather than trusting it implicitly, maybe by using
an eval{} block to make sure the format is FASTA-like and looks like
DNA/RNA/protein.

Chris
Post by Chris Fields
Post by Chris Fields
Post by M Senthil Kumar
|> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
|> >Antisense;
|> TGGCTCCTGCTGAGGTCCCCTTTCC
|
|Unfortunately that's not Fasta format (which only has a single header
|line starting with a '>'. I'd imagine that most programs which deal
|with fasta which read that entry would see it as two sequences, the
|first of which is empty.
|
[snipped]
hi,
I think the file is in fasta format and probably you might have seen it
differently because of your mail transport agent.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Michael Oldham
2006-06-10 01:39:45 UTC
Permalink
Thanks to everyone for their helpful advice. I think I am getting closer,
but no cigar quite yet. The script below runs quickly with no errors--but
the output file is empty. It seems that the problem must lie somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more experienced
eye--but not to mine! Any suggestions? Thanks again for your help.

--Mike O.


#!/usr/bin/perl -w

use strict;

my $IDs = 'ID.dat.txt';

unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}

my $probes = 'HG_U95Av2_probe_fasta.txt';

unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}

open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";

my @ID = <IDFILE>;
chomp @ID;
my %ID = map {($_, 1)} @ID; #Note: This creates a hash with keys=PSIDs and
all values=1.

while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;


-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a single large
file



I wouldn't bioperl for this, or create an index. Perl would do fine and
probably be faster.

Assuming your ids are one per line in a file named id.dat looking like
this

1138_at
1134_at
etc..

this should work:

perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
<idfile>; chomp @ID; %ID = map {($_, 1)} @ID;} $inmatch =
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa

good luck

--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/360 - Release Date: 6/9/2006
Cook, Malcolm
2006-06-12 15:28:41 UTC
Permalink
Michael,

I don't think you can call perl's `print` on just a filehandle as you
are doing. This is probably your problem.

If you call `select OUT` after opeining it, print will print $_ to it.
And, every line in the fasta record whose header matches on of the IDS
will get printed, not just the fasta header lines. Read the code again
nothing that $idmatch is only getting reset when a correctly formatted
fasta header line is matched.

--Malcolm
-----Original Message-----
From: Chris Fields [mailto:cjfields at uiuc.edu]
Sent: Saturday, June 10, 2006 11:32 PM
To: Michael Oldham
Cc: Cook, Malcolm; bioperl-l at lists.open-bio.org
Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
single large file
What happens if you just print $idmatch or $1 (i.e. check to see if
the regex matches anything)? If there is nothing printed then either
the regex isn't working as expected or there is something logically
wrong. The problem may be that the captured string must match the id
exactly, the id being the key to the %ID hash; any extra characters
picked up by the regex outside of your id key and you will not get
anything. Looking at Malcolm's regex it should work just fine, but
we only had one example sequence to try here.
If your while loop is set up like this won't it only print only the
matched description lines to the outfile (no sequence) even if there
is a match? Or is this what you wanted? If you want the sequence
you should add 'print OUT <PROBES>;' after the 'print OUT;' line.
Chris
Post by Michael Oldham
Thanks to everyone for their helpful advice. I think I am getting
closer,
but no cigar quite yet. The script below runs quickly with no
errors--but
the output file is empty. It seems that the problem must lie
somewhere in
the 'while' loop, and I'm sure it's quite obvious to a more
experienced
eye--but not to mine! Any suggestions? Thanks again for your help.
--Mike O.
#!/usr/bin/perl -w
use strict;
my $IDs = 'ID.dat.txt';
unless (open(IDFILE, $IDs)) {
print "Could not open file $IDs!\n";
}
my $probes = 'HG_U95Av2_probe_fasta.txt';
unless (open(PROBES, $probes)) {
print "Could not open file $probes!\n";
}
open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
keys=PSIDs and
all values=1.
while (<PROBES>) {
my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
if ($idmatch){
print OUT;
}
}
exit;
-----Original Message-----
From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
Sent: Friday, June 09, 2006 7:58 AM
To: Michael Oldham; bioperl-l at lists.open-bio.org
Subject: RE: [Bioperl-l] Output a subset of FASTA data from a
single large
file
I wouldn't bioperl for this, or create an index. Perl would do
fine and
probably be faster.
Assuming your ids are one per line in a file named id.dat
looking like
Post by Michael Oldham
this
1138_at
1134_at
etc..
exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
mybigfile.fa
good luck
--Malcolm Cook
-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
Michael Oldham
Sent: Thursday, June 08, 2006 9:08 PM
To: bioperl-l at lists.open-bio.org
Subject: [Bioperl-l] Output a subset of FASTA data from a
single large file
Dear all,
I am a total Bioperl newbie struggling to accomplish a
conceptually simple
task. I have a single large fasta file containing about 200,000
probe
sequences (from an Affymetrix microarray), each of which looks
probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC
What I would like to do is extract from this file a subset of
~130,800
probes (both the header and the sequence) and output this
subset into a new
fasta file. These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I
have these
8,175 IDs listed in a separate file. I *think* that I managed
to create an
index of all 200,000 probes in the original fasta file using
the following
#!/usr/bin/perl -w
# script 1: create the index
use Bio::Index::Fasta;
use strict;
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
I'm not sure if this is the most sensible approach, and even
if it is, I'm
not sure what to do next. Any help would be greatly appreciated!
Many thanks,
Mike O.
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/8/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
No virus found in this incoming message.
Checked by AVG Free Edition.
6/8/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
6/9/2006
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
Chris Fields
2006-06-12 20:35:10 UTC
Permalink
...
Post by Sendu Bala
Post by Chris Fields
http://www.bioperl.org/wiki/HOWTO:Flat_databases
http://www.bioperl.org/wiki/HOWTO:OBDA
...
Post by Sendu Bala
As you later discovered, that was an Outlook problem. Just to make this
Agreed (stupid Outlook). It might be much faster to use non-Bioperl-ish
ways, but it is easier to further manipulate sequences (convert format,
analyze sequences, etc) using Bioperl directly. I haven't used flat
databases much but it should move very quickly, even in an OO environment.

The one problem with the proposed non-bioperl method is, if you wanted
100,000 sequences (based on ID's) in a FASTA database file containing
200,000 sequences, all ID's would need to be stored (1) in an array (which
gulped the data from the ID file) and then map the ID's to (2) a hash;
that's may be a pretty big memory footprint depending on your system.

Sendu's BioPerl version indexes the FASTA file based on the ID, then (1)
reads the ID's in one at a time from the file, (2) retrieves the data, then
(3) prints it out. The advantage of this approach is that the built index
can be used in other bioperl scripts as well w/o having to rebuild it again,
so if you wanted a different set of ID's later on you can access the
database using the prebuilt index. More can be found in the
Bio::Index::Fasta POD.

You can also use the ideas and code in the HOWTO (Flat Databases) I
mentioned, which focuses on the Bio::DB::Flat system and ODBA. The
advantage of these is that you can use Sleepycat's Berkeley Database through
the Perl BerkeleyDB module (more functionality than DB_File) which is faster
than a standard flat database. In the HOWTO, specifically look under
'Secondary or custom namespaces' for ideas on how to use your ID as a
primary or secondary key.

Chris
Post by Sendu Bala
use Bio::SeqIO;
use Bio::Index::Fasta;
my $inx = Bio::Index::Fasta->new(-write_flag => 1);
$inx->id_parser(\&get_id);
$inx->make_index(shift);
my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
my $wanted_ids_file = shift;
open(IDS, $wanted_ids_file);
while (<IDS>) {
chomp;
my $seq = $inx->fetch($_);
$out->write_seq($seq);
}
sub get_id {
my $line = shift;
$line =~ /^>probe:\S+?:(\S+?):/;
$1;
}
It works for me on the sample sequence given by the OP.
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Lincoln Stein
2006-06-27 22:35:08 UTC
Permalink
Hi All,

This is rather late, but just for future reference on the mailing list, here
is how I would do the task using Bio::DB::Fasta.

Script 1: index the file for future use:

#!/usr/bin/perl
use strict;
use Bio::DB::Fasta;

my $filename = shift; # name of file to index on command line
Bio::DB::Fasta->new($filename,-makeid=>\&make_my_id)
or die "Indexing failed";
print "Indexing succeeded!\n";
exit 0;

sub make_my_id {
my $description_line = shift;
$description_line =~ /(\d+_at)/ or die "malformed description line";
return $1;
}

Run this script once to create a reusable index of the file. The index will be
stored in the same directory as the FASTA file.

Script 2: extract the sequences using the IDs stored in a second file:

#!/usr/bin/perl
use strict;
use Bio::DB::Fasta;
use Bio::SeqIO;
use IO::File;

my $indexed_fasta_file = shift;
my $probe_id_file = shift;

# open up the indexed fasta file
my $db = Bio::DB::Fasta->new($indexed_fasta_file) or die;
# open up a FASTA writer
my $out = Bio::SeqIO->new(-format=>'Fasta',-fh=>\*STDOUT) or die;
# open the probe id file
my $in = IO::File->new($probe_id_file) or die;

# do the work
while (my $id = <$in>) {
chomp $id;
my $seq = $db->get_Seq_by_id($id) or die;
$out->write_seq($seq);
}

exit 0;

Bio::Index::Fasta will work in almost exactly the same way. The only
difference is that the Bio::DB::Fasta will allow you to retrieve subsequences
efficiently.

Lincoln
--
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu
Continue reading on narkive:
Loading...