Thursday, January 28, 2010

Large BLAST runs and output formats

I have used BLAST in many different forms and on many scales, from single gene analysis to large "all vs all" comparisons. This is a short story of how I decided to delete 164GB of Blast output.

I will save my reasoning for doing such a large Blast for another post. For now, all you have to know is that I am doing an "all vs all" Blast for 2,178,194 proteins. That is (2,178,194)^2 = 24,744,529,101,636 comparisons. Sure quite a few, but nothing that a large compute cluster can't handle (go big or go home is usually my motto).

I usually use the tabular output format for Blast (-m 8). However, one of the nice functions in BioPerl allows you to calculate the coverage of all hsps with respect to the query or subject sequence. BioPerl handles the tiling of the hsps which is annoying to have to code yourself. I often use this coverage metric to filter Blast hits downstream in my pipeline. So here comes the annoying thing. The tabular output of Blast does not include the start or end positions (or length) of the sequences in the Blast comparison. Therefore, to calculate coverage you need to go back to the original sequence and retrieve the length of the sequence. I know this is not a hard thing to do, but I am a lazy programmer and I like fewer steps whenever possible. Therefore, I decided to try out the Blast XML format (-m 7). A few test runs showed that the files were much larger (5X), but this format includes all information about the Blast run including the sequence coordinates. Therefore, I decided not to worry about space issues and launched my jobs. Bad decision.

Well 3 days later, I find out my quota is 300GB and since I already had 150GB from another experiment the blast output put me over. I can't easily tell which jobs completed normally, so I am faced with the decision to either write a script to figure out which jobs completed normally, or scrap all the data and re-run it the right way. I have opted to delete my 164GB of blast output and re-run it using the tabular format and I might even gzip the data on the fly to ensure this doesn't happen again.

Of course this isn't rocket science, but I thought I would tell my tale in case others are in similar circumstances.
Reblog this post [with Zemanta]


Karthik said...

I usually use the -m7, to get the XML output. However, I have never had a quota issue/worried much about disk space :). I find that XML is parsed very easily using BioPython (XML is meant for parsing, right?).

Good luck with your BLAST.. I can't think of a better idea for you - perhaps split the query into bunches, so that you get multiple output files?

Morgan Langille said...

Yup the XML format is optimal for parsing and also because it contains all of the information including the alignments if needed. Just not suitable on very large blast runs.

The output files are split, but it still takes up the same amount of disk space.

Of course, if I had of gzipped the output on the fly I think I would have been alright.

Leon French said...

I heard recently that a big problem with doing analysis on the high throughput sequencing data is disk space. Seems most of those programs will silently run out of disk space and it takes awhile to get back to when it ran out.

Anonymous said...

use AWk or cut or anything to get the GIDS. "paste" the gids to the BLAST output. Now you will one extra field. In the BLAST+ there is blastdbcmd (-entry_batch) that can take list of GIDS. Use that to extract names, length of the source seq etc. And use join the two files.
NOTE: Above is just a "raw" answer as you need to use sort etc to make sure that join works. Over all, I do this to get length of the original seq, names of the genomes reads have hit etc.

Neil said...

What Karthik said about perhaps splitting the query into bunches, so that you get multiple output files sounds like a good idea though you indeed still have to consider the space as Morgan stated.

Anonymous said...

Hi Morgan,

How fast did your blast jobs run? I'm getting use to doing blast these days but I am overwhelmed by how long it takes. I recently have about 15000 sequences taking 9 days against uniprotkb on a machine with over 100GB and 4 threads using cynblastx.