Beta Science: Filtering Blast hits by coverage using BioPerl

Saturday, January 30, 2010

Filtering Blast hits by coverage using BioPerl

A couple of days ago I wrote about how I had to throw away a ton of data because I ran out of disk space for a large Blast analysis. One of the reasons I ran out of room was because I opted to use the XML output format over the simpler tabular output. The XML format provides much more information about each hit including the lengths of query and subject genes, which allows easy retrieval of the coverage in BioPerl using the "frac_aligned_query()" and "frac_aligned_hit()" functions. For example:

my $searchio = Bio::SearchIO->new(
-format => 'blastxml',
-file => $blast_file
);
while ( my $result = $searchio->next_result() ) {

#process the Bio::Search::Hit::GenericHit
while ( my $hit = $result->next_hit ) {

my $evalue = $hit->significance();
my $identity = $hit->frac_identical();

######Get amount of coverage for query and hit #######
my $query_coverage = $hit->frac_aligned_query();
my $hit_coverage = $hit->frac_aligned_hit();

###Filter based on evalue and coverage
if ( ( $query_coverage > $coverage_cutoff )
&& ( $hit_coverage > $coverage_cutoff )
&& ( $evalue < $evalue_cutoff ) ) { ##do something ##}

This is fairly simple, but if you have to use the tabular output of Blast (due to say file size limitations) the lengths of the genes are not included in the Blast output. Therefore, you have to retrieve these manually somehow and then either calculate coverage yourself (remembering to tile the HSPs for each hit) or tell BioPerl about the gene lengths so you can call the same functions. This isn't obvious or documented in BioPerl, so I had to hack away until I found out the solution. See comments for explanation.

my $searchio = Bio::SearchIO->new(
-format => 'blasttable',
-file => $blast_file
);

while ( my $result = $searchio->next_result() ) {
my $query_id = $result->query_name();

#get the length of the query sequence
#Note: you need to write this function yourself since the length is not in the blast output file.
my $query_len = get_gene_length($query_id);

#################
#This sets the query length for each hit by directly manipulating
#the objects hash (instead of through a function)
foreach my $hit(@{$result->{_hits}}){
$hit->{-query_len}=$query_len;
}
#################

#process the Bio::Search::Hit::GenericHit
while ( my $hit = $result->next_hit ) {
my $hit_id = $hit->name();
my $hit_len = get_gene_length($hit_id);

###Setting the hit length is much easier!!
$hit->length($hit_len);
#################

my $evalue = $hit->significance();
my $identity = $hit->frac_identical();
my $query_coverage = $hit->frac_aligned_query();
my $hit_coverage = $hit->frac_aligned_hit();

if ( ( $query_coverage > $coverage_cutoff )
&& ( $hit_coverage > $coverage_cutoff )
&& ( $evalue < $evalue_cutoff ) ) { ##do something ##}

Note that if you don't tell BioPerl the lengths your script will die with a "divide by zero" error.

4 comments:

Anonymous said...: i get this error when i try your script :

Undefined subroutine &main::get_gene_length called at test_xml_blast.pl (name of file) line 24, line 879.

why???
thanks.; March 11, 2010 at 12:27 AM
Morgan Langille said...: As I mentioned above the gene length is not included in the blast output when using tabular format. Therefore, you have to write this function yourself to retrieve the sequence length for a given sequence id. This can be done multiple ways and depends somewhat on how many sequences you have. One suggestion is to read in the original fasta file using bioperl and create a hash with the key being the sequence id and the element being the sequence length.; March 11, 2010 at 10:46 AM
Manju said...: Hey Pls help me..
I am very new in Bioperl..And i want to use blast report in my programming..But i dnt know how to use it...pls tell me how to use HSP,gaps.etc methods??/how to use them to extract valus from blast file..; August 30, 2011 at 4:01 AM
Anonymous said...: I have attempted hsp tilling learning from your blog. When I retrieve $hit->frac_identical('hit') for one of my hits(having 6 hsp), this values > 1. Is this suppose to be correct? I was expecting something <=1

Many thanks; February 11, 2013 at 10:47 AM