Wednesday, February 24, 2010

Using Aspera instead of FTP to download from NCBI

If you often download large amounts of data from NCBI using their FTP site you might be interested in knowing that NCBI has recently started using the commercial software Aspera to improve download transfer speeds. This was announced in their August newsletter and at first was only for the Short Read Archive (SRA). However, I recently found out that they are now making all of their data available.

How to use it (web browser)
  1. Download and install the Aspera browser plugin software.
  2. Browse the Aspera NCBI archives.
  3. Click on the file or folder you want to download and choose a place to save it.
  4. The Aspera download manager should (see below) open and show the download progression.
How to use it (command line)
  1. The browser plugin also includes the command line program: ascp (In linux this is at: ~/.aspera/connect/bin)
  2. There are many options but the standard method is:
ascp -QT -i ../etc/asperaweb_id_dsa.putty anonftp@ftp-private.ncbi.nlm.nih.gov:/source_directory /destination_directory/

e.g.:
ascp -QT -i ../etc/asperaweb_id_dsa.putty anonftp@ftp-private.ncbi.nlm.nih.gov:/genomes/Bacteria/all.faa.tar.gz ~/

Critique
  • Windows machine with Firefox worked with no problems and download speeds at my institution were much faster than with FTP (~0.5 - 4.0Mbps vs 50-300kbps)
  • Browser plugin with Firefox on Linux would not work! Plugin seemed to be loaded properly, but Aspera download manager would not start. Update: This was due to me trying to install the plugin as root and causing a permission error. The plugin is installed in your home directory and must not be installed as root.
  • Download with command line in Linux was unreliable. This was a huge disappointment as this was the primary method I was hoping to use. Files would start to download correctly with very fast transfer speeds (1-4Mbps), but connection would drop with error: "Session Stop (Error: Connection lost in midst of data session)". Unfortunately, there is no way to resume the download so each time I had to start over. On about the 8th try it downloaded the file (6889MB) correctly. Update: see below
Personal Opinion
Although I was excited to see NCBI trying to improve data transfer speeds I was not very impressed with the Aspera solution. Hopefully, it will become more reliable in the future.
Of course, my personal solution would be for NCBI to embrace BitTorrent technology and make use of BioTorrents, but I will save that discussion for another day.


Update:
All ascp options are shown below (by typing ascp without arguments). However, I can't find any further documentation on these options. As noted in the comments below, -k2 is supposed to resume a download, but this didn't work for me when I tested it.
usage: ascp [-{ATdpqv}] [-{Q|QQ}] ...
[-l rate-limit[K|M|G|P(%)]] [-m minlimit[K|M|G|P(%)]]
[-M mgmt-port] [-u user-string] [-i private-key-file.ppk]
[-w{f|r} [-K probe-rate]] [-k {0|1|2|3}] [-Z datagram-size]
[-X rexmsg-size] [-g read-block-size[K|M]] [-G write-block-size[K|M]]
[-L log-dir] [-R remote-log-dir] [-S remote-cmd] [-e pre-post-cmd]
[-O udp-port] [-P ssh-port] [-C node-id:num-nodes]
[-o Option1=value1[,Option2=value2...] ]
[-E exclude-pattern1 -E exclude-pattern2...]
[-U priority] [-f config-file.conf] [-W token string]
[[user@]host1:]file1 ... [[user@]host2:]file2

-A: report version; -Q: adapt rate; -T: no encryption
-d: make destination directory; -p: preserve file timestamp
-q: no progress meter; -v: verbose; -L-: log to stderr
-o: SkipSpecialFiles=yes,RemoveAfterTransfer=yes,RemoveEmptyDirectories=yes,
PreCalculateJobSize={yes|no},Overwrite={always|never|diff|older},
FileManifest={none|text},FileManifestPath=filepath,
FileCrypt={encrypt|decrypt},RetryTimeout=secs

HTTP Fallback only options:
[-y 0/1] 1 = Allow HTTP fallback (default = 0)
[-j 0/1] 1 = Encode all HTTP transfers as JPEG files
[-Y filename] HTTPS key file name
[-I filename] HTTPS certificate file name
[-t port number] HTTP fallback server port #
[-x ]]

Update 2:
After spending an afternoon with Aspera Support, I have some answers to my connection and resume issues when using ascp. The problem has to do with me not using the -l option to properly limit the speed at which ascp sends data. I thought this limit would only be relevant if 1) I wanted to not use all of my available bandwidth or 2) my computer hardware could not handle the bandwidth of the file transfer. Surprisingly, the recent for my disconnects was because NCBI was trying to send more data than my bandwidth allowed and thus causing my connection to drop. I would have thought that ascp would look after these type of bandwidth differences considering that all other data transfer protocols that I know of can control their rate of data flow. If this is the case, it would suggest that my connection may be broken if for some reason my available bandwidth drops (which would happen often due to network fluctuations at a large institution) even if I set the limit appropriately. Hopefully, Aspera can make their data transfer method a little more robust in the future. I don't think I will be replacing ftp with ascp in my download scripts quite yet.

Update 3:
Michelle from Aspera finally let me know that -Q is default option I should be using to allow adaptive control. Now, I am trying to get a entire directory to download, but I am still having connection issues. Here is a screenshot of my terminal showing that the directory resume is not working and I am losing my connection:


Reblog this post [with Zemanta]

Saturday, January 30, 2010

Filtering Blast hits by coverage using BioPerl

A couple of days ago I wrote about how I had to throw away a ton of data because I ran out of disk space for a large Blast analysis. One of the reasons I ran out of room was because I opted to use the XML output format over the simpler tabular output. The XML format provides much more information about each hit including the lengths of query and subject genes, which allows easy retrieval of the coverage in BioPerl using the "frac_aligned_query()" and "frac_aligned_hit()" functions. For example:

my $searchio = Bio::SearchIO->new(
-format => 'blastxml',
-file => $blast_file
);
while ( my $result = $searchio->next_result() ) {

#process the Bio::Search::Hit::GenericHit
while ( my $hit = $result->next_hit ) {

my $evalue = $hit->significance();
my $identity = $hit->frac_identical();

######Get amount of coverage for query and hit #######
my $query_coverage = $hit->frac_aligned_query();
my $hit_coverage = $hit->frac_aligned_hit();

###Filter based on evalue and coverage
if ( ( $query_coverage > $coverage_cutoff )
&& ( $hit_coverage > $coverage_cutoff )
&& ( $evalue < $evalue_cutoff ) ) { ##do something ##}
This is fairly simple, but if you have to use the tabular output of Blast (due to say file size limitations) the lengths of the genes are not included in the Blast output. Therefore, you have to retrieve these manually somehow and then either calculate coverage yourself (remembering to tile the HSPs for each hit) or tell BioPerl about the gene lengths so you can call the same functions. This isn't obvious or documented in BioPerl, so I had to hack away until I found out the solution. See comments for explanation.

my $searchio = Bio::SearchIO->new(
-format => 'blasttable',
-file => $blast_file
);

while ( my $result = $searchio->next_result() ) {
my $query_id = $result->query_name();

#get the length of the query sequence
#Note: you need to write this function yourself since the length is not in the blast output file.
my $query_len = get_gene_length($query_id);

#################
#This sets the query length for each hit by directly manipulating
#the objects hash (instead of through a function)
foreach my $hit(@{$result->{_hits}}){
$hit->{-query_len}=$query_len;
}
#################

#process the Bio::Search::Hit::GenericHit
while ( my $hit = $result->next_hit ) {
my $hit_id = $hit->name();
my $hit_len = get_gene_length($hit_id);

###Setting the hit length is much easier!!
$hit->length($hit_len);
#################

my $evalue = $hit->significance();
my $identity = $hit->frac_identical();
my $query_coverage = $hit->frac_aligned_query();
my $hit_coverage = $hit->frac_aligned_hit();

if ( ( $query_coverage > $coverage_cutoff )
&& ( $hit_coverage > $coverage_cutoff )
&& ( $evalue < $evalue_cutoff ) ) { ##do something ##}
Note that if you don't tell BioPerl the lengths your script will die with a "divide by zero" error.

Thursday, January 28, 2010

Large BLAST runs and output formats

I have used BLAST in many different forms and on many scales, from single gene analysis to large "all vs all" comparisons. This is a short story of how I decided to delete 164GB of Blast output.

I will save my reasoning for doing such a large Blast for another post. For now, all you have to know is that I am doing an "all vs all" Blast for 2,178,194 proteins. That is (2,178,194)^2 = 24,744,529,101,636 comparisons. Sure quite a few, but nothing that a large compute cluster can't handle (go big or go home is usually my motto).

I usually use the tabular output format for Blast (-m 8). However, one of the nice functions in BioPerl allows you to calculate the coverage of all hsps with respect to the query or subject sequence. BioPerl handles the tiling of the hsps which is annoying to have to code yourself. I often use this coverage metric to filter Blast hits downstream in my pipeline. So here comes the annoying thing. The tabular output of Blast does not include the start or end positions (or length) of the sequences in the Blast comparison. Therefore, to calculate coverage you need to go back to the original sequence and retrieve the length of the sequence. I know this is not a hard thing to do, but I am a lazy programmer and I like fewer steps whenever possible. Therefore, I decided to try out the Blast XML format (-m 7). A few test runs showed that the files were much larger (5X), but this format includes all information about the Blast run including the sequence coordinates. Therefore, I decided not to worry about space issues and launched my jobs. Bad decision.

Well 3 days later, I find out my quota is 300GB and since I already had 150GB from another experiment the blast output put me over. I can't easily tell which jobs completed normally, so I am faced with the decision to either write a script to figure out which jobs completed normally, or scrap all the data and re-run it the right way. I have opted to delete my 164GB of blast output and re-run it using the tabular format and I might even gzip the data on the fly to ensure this doesn't happen again.

Of course this isn't rocket science, but I thought I would tell my tale in case others are in similar circumstances.
Reblog this post [with Zemanta]

Thursday, January 21, 2010

MCE Remote Stops Working

Quick post in case it happens to me again. My MCE IR remote stopped working suddenly and after some Googling I found out that I needed to change the batteries. The weird part is that it also required the remote to be "reset". One way is to short the battery terminals, but that didn't work for me so I found another solution.

  1. Take the batteries out of the remote.
  2. Hold the power button down and then press every other button on the remote once.
  3. Replace the batteries.
  4. Pray to the Microsoft gods.
Hopefully this helps someone else.

Wednesday, October 21, 2009

BioTorrents - a file sharing resource for scientists

Let me ask you a question. If you just wrote a new computer program or produced a large dataset, and you wanted to openly share it with the research community, how would you do that?

This is a diagram of a Wikipedia:Peer-to-Peer ...Image via Wikipedia


My answer to that question is BioTorrents!

This has been a side project that I have been working on lately and considering this is the first international Open Access Week I thought I should finally announce it.

BioTorrents is a website that allows open access sharing of scientific data. It uses the popular BitTorrent peer-to-peer file sharing technology to allow rapid file transferring.

So what is the advantage of using BioTorrents?

  1. Faster file transfer

    • Have you tried to download the entire RefSeq or GEO datasets from NCBI recently? How about all the metagenomic data from CAMERA? Datasets continue to increase in size and downloading speed can be improved by allowing multiple computers/institutions to share their bandwidth.


  2. More reliable file transfer

    • BitTorrent technology has file checking built-in, so that you don't have to worry about corrupt downloads.

    • Decentralization of the data ensures that if one server is disabled, that the data is still available from another user.


  3. A central repository for software and datasets


    • Rapid and open sharing of scientific findings continues to push for changes in traditional publication methods and has resulted in an increase in the use of pre-print archives, blogs, etc. However, sharing just datasets and software without a manuscript as an index is not as easy. BioTorrents allows anyone to share their data without a restriction on size (since the files are not actually hosted or transferred by BioTorrents).

    • Titles and descriptions of all data on BioTorrents can be browsed by category or searched for keywords (tag cloud coming soon).

    • As long as there is at least one user sharing the data it will always be available on BioTorrents. Those pieces of software or datasets that are not popular and not hosted by a user will quietly die (removed from Biotorrents after 2 weeks).

I am continuing to update BioTorrents, so if you have any suggestions or comments please let me know.


Wednesday, October 7, 2009

Canada gets Google StreetView

Google launched their StreetView program in major Canadian cities today. Of course I don't live there right now, but I did check out my old stomping grounds in Vancouver, BC.


View Larger Map

Also, they happened to do Chester, NS which is where I usually spend most of my summer vacations.



View Larger Map
Reblog this post [with Zemanta]

Sunday, August 2, 2009

Storable.pm

Most of my programming is what I like to call "biologically driven"; that is the main end result is not the development of the program itself, but rather the data that comes out of the program. Many times this involves writing a script to input data, do something to that data, and then output it back to a file which is in turn read into another script....ad infinitum.

The classic tab-delimited file is usually my typical choice for the intermediate format, but reading and writing (although simple) these gets repetitive and more complicated for more complex data structures. I finally looked into alternatives (something I clearly should have done awhile ago) and came across Storable.

Basically, it allows you to save/open any perl data structure to/from a file.
It is very easy to use:
use Storable;

#Reference to any data structure
$data_ref;

store($data_ref, 'my_storage_file');

#later in same or different script
$new_data_ref = retrieve('my_storage_file');
Check it out if you have never used it before.