Saturday, January 30, 2010

Filtering Blast hits by coverage using BioPerl

A couple of days ago I wrote about how I had to throw away a ton of data because I ran out of disk space for a large Blast analysis. One of the reasons I ran out of room was because I opted to use the XML output format over the simpler tabular output. The XML format provides much more information about each hit including the lengths of query and subject genes, which allows easy retrieval of the coverage in BioPerl using the "frac_aligned_query()" and "frac_aligned_hit()" functions. For example:

my $searchio = Bio::SearchIO->new(
-format => 'blastxml',
-file => $blast_file
);
while ( my $result = $searchio->next_result() ) {

#process the Bio::Search::Hit::GenericHit
while ( my $hit = $result->next_hit ) {

my $evalue = $hit->significance();
my $identity = $hit->frac_identical();

######Get amount of coverage for query and hit #######
my $query_coverage = $hit->frac_aligned_query();
my $hit_coverage = $hit->frac_aligned_hit();

###Filter based on evalue and coverage
if ( ( $query_coverage > $coverage_cutoff )
&& ( $hit_coverage > $coverage_cutoff )
&& ( $evalue < $evalue_cutoff ) ) { ##do something ##}
This is fairly simple, but if you have to use the tabular output of Blast (due to say file size limitations) the lengths of the genes are not included in the Blast output. Therefore, you have to retrieve these manually somehow and then either calculate coverage yourself (remembering to tile the HSPs for each hit) or tell BioPerl about the gene lengths so you can call the same functions. This isn't obvious or documented in BioPerl, so I had to hack away until I found out the solution. See comments for explanation.

my $searchio = Bio::SearchIO->new(
-format => 'blasttable',
-file => $blast_file
);

while ( my $result = $searchio->next_result() ) {
my $query_id = $result->query_name();

#get the length of the query sequence
#Note: you need to write this function yourself since the length is not in the blast output file.
my $query_len = get_gene_length($query_id);

#################
#This sets the query length for each hit by directly manipulating
#the objects hash (instead of through a function)
foreach my $hit(@{$result->{_hits}}){
$hit->{-query_len}=$query_len;
}
#################

#process the Bio::Search::Hit::GenericHit
while ( my $hit = $result->next_hit ) {
my $hit_id = $hit->name();
my $hit_len = get_gene_length($hit_id);

###Setting the hit length is much easier!!
$hit->length($hit_len);
#################

my $evalue = $hit->significance();
my $identity = $hit->frac_identical();
my $query_coverage = $hit->frac_aligned_query();
my $hit_coverage = $hit->frac_aligned_hit();

if ( ( $query_coverage > $coverage_cutoff )
&& ( $hit_coverage > $coverage_cutoff )
&& ( $evalue < $evalue_cutoff ) ) { ##do something ##}
Note that if you don't tell BioPerl the lengths your script will die with a "divide by zero" error.

Thursday, January 28, 2010

Large BLAST runs and output formats

I have used BLAST in many different forms and on many scales, from single gene analysis to large "all vs all" comparisons. This is a short story of how I decided to delete 164GB of Blast output.

I will save my reasoning for doing such a large Blast for another post. For now, all you have to know is that I am doing an "all vs all" Blast for 2,178,194 proteins. That is (2,178,194)^2 = 24,744,529,101,636 comparisons. Sure quite a few, but nothing that a large compute cluster can't handle (go big or go home is usually my motto).

I usually use the tabular output format for Blast (-m 8). However, one of the nice functions in BioPerl allows you to calculate the coverage of all hsps with respect to the query or subject sequence. BioPerl handles the tiling of the hsps which is annoying to have to code yourself. I often use this coverage metric to filter Blast hits downstream in my pipeline. So here comes the annoying thing. The tabular output of Blast does not include the start or end positions (or length) of the sequences in the Blast comparison. Therefore, to calculate coverage you need to go back to the original sequence and retrieve the length of the sequence. I know this is not a hard thing to do, but I am a lazy programmer and I like fewer steps whenever possible. Therefore, I decided to try out the Blast XML format (-m 7). A few test runs showed that the files were much larger (5X), but this format includes all information about the Blast run including the sequence coordinates. Therefore, I decided not to worry about space issues and launched my jobs. Bad decision.

Well 3 days later, I find out my quota is 300GB and since I already had 150GB from another experiment the blast output put me over. I can't easily tell which jobs completed normally, so I am faced with the decision to either write a script to figure out which jobs completed normally, or scrap all the data and re-run it the right way. I have opted to delete my 164GB of blast output and re-run it using the tabular format and I might even gzip the data on the fly to ensure this doesn't happen again.

Of course this isn't rocket science, but I thought I would tell my tale in case others are in similar circumstances.
Reblog this post [with Zemanta]

Thursday, January 21, 2010

MCE Remote Stops Working

Quick post in case it happens to me again. My MCE IR remote stopped working suddenly and after some Googling I found out that I needed to change the batteries. The weird part is that it also required the remote to be "reset". One way is to short the battery terminals, but that didn't work for me so I found another solution.

  1. Take the batteries out of the remote.
  2. Hold the power button down and then press every other button on the remote once.
  3. Replace the batteries.
  4. Pray to the Microsoft gods.
Hopefully this helps someone else.