Beta Science: bioinformatics

Showing posts with label bioinformatics. Show all posts

Wednesday, April 14, 2010

An interview with the creator of BioTorrents

Who better to interview the creator of BioTorrents than the creator himself? :)

Interviewer: So Morgan, your article entitled “BioTorrents: A File Sharing Service for Scientific Data” was published today in PLoS One. BioTorrents uses the popular peer-to-peer file sharing protocol, BitTorrent, to allow scientists to rapidly share their results, datasets, and software. Where did this idea come from?

Morgan: Well about 6 months ago I was downloading some genome files from NCBI's FTP site and was watching the download speed hover between 50-100Kb/s and I said to myself (much like this interview) I wish could download these with BitTorrent. I have used BitTorrent for downloading other non-scientific data (lets not discuss what they may be) and I know it is a much faster and more reliable way for getting large files. A few minutes later I posted to Twitter asking if anyone had thought about setting up a BitTorrent tracker for scientific data and the response was over-whelming (well only 1 response, but I could feel it had a larger impact). About a week later, I brought up the idea again over coffee with some members of my lab and more importantly my post-doc supervisor Dr. Jonathan Eisen. He thought it was a good idea and well worth pursuing, which was all I needed to push aside all my other "real" research and focus on this much more "fun" project.

Interviewer: Thanks for that long-winded response. Maybe you could comment more briefly on the benefits of using BioTorrents/BitTorrent for sharing scientific data.

Morgan: I think it is explained fairly well in the manuscript and in my previous blog post, but to reiterate the major benefits are:
1) Faster, more reliable, and better controlled downloading of data that scales well for very large files.
2) Instant "publishing" of data, results, and software.
3) Very easy for anyone to share their data. No dedicated web server needed.

Interviewer: Who should consider sharing data on BioTorrents?

Morgan: Everyone that has something to share. Large institutions can benefit from reduced bandwidth requirements, while individual users can benefit from the simplicity of sharing with BitTorrent technology. Personally, I really like the idea of open data and the idea of sharing results before publication. How many times has someone done an all vs all blast of microbial genomes? In theory this can be done once, and that person can be recognized (referenced, co-authored, etc.) when other researchers use that data.

Interviewer: Are there any challenges/limitations to using BitTorrent with scientific data?

Morgan: BitTorrent excels at transferring very large popular datasets. Therefore, if only one person is "seeding" a file and only one person is downloading the file most of the advantage to using BitTorrent is lost. However, even in this worst case scenario, the transfer speed would be roughly equivalent to using traditional file transfer methods such as FTP/HTTP and BitTorrent still provides the benefit of error checking and ease of data transfer control (pause, resume, etc.). Another possible problem is that some institutions often try to limit BitTorrent traffic since it is often considered illegal non-work related network traffic. However, I would encourage users at these institutions to explain to their network administrator that many times BitTorrent traffic is legitimate and shouldn't be blocked.

Interviewer: Why publish in PLoS One?

Morgan: I have been a big fan of the PLoS One journal and ever since I blogged about it last year "Is PLOS One the future of scientific publishing?", I have been wanting to submit a paper there. Also, considering that BioTorrents is aimed at improving open access to data in all fields of science, PLoS One seemed like the most obvious journal choice for our manuscript.

Langille, M., & Eisen, J. (2010). BioTorrents: A File Sharing Service for Scientific Data PLoS ONE, 5 (4) DOI: 10.1371/journal.pone.0010071

Tuesday, March 2, 2010

Please don't use Clustal for tree construction!

Image via Wikipedia

There are reams of books, articles, and websites about the correct way to build a phylogenetic tree. My post is not to argue about what is the best method, but rather point out that most people do not consider Clustal (e.g. ClustalX or ClustalW) to be an optimal solution in almost any circumstance. Countless times I have asked people how they built their particular tree and they give me the vague "Clustal" answer. Of course this answer is fine if this is the first tree you ever constructed, but beware you will be labelled as a phylogenetic newbie.

Clustal is technically a multiple alignment algorithm, but it also includes methods for tree construction in the same interface. Most of these methods are not really considered "good" tree building methods. If you do use Clustal, at least specify what tree building method you used (ie. "Clustal with neighbor joining"). Most people don't use Clustal even for multiple alignment anymore, because Muscle has been shown to be at least as accurate as Clustal and is much faster.

For tree construction, most people would agree that a Maximum Likelihood or Bayesian method would almost always be a better solution; PhyML and Mr. Bayes seem to be the most popular implementations for these methods. Advanced users might also want to look into using Beast.

I usually interact with most of these programs through a command line interface, so I don't have an expansive knowledge of the best graphical tool. However, I did come across, "Robust Phylogenetic Analysis For The Non-Specialist" which does a good job allowing easy interaction between various methods for multiple sequence alignment, tree construction, and tree viewing.

Whatever you use to build trees, just make sure it isn't Clustal!

Wednesday, February 24, 2010

Using Aspera instead of FTP to download from NCBI

If you often download large amounts of data from NCBI using their FTP site you might be interested in knowing that NCBI has recently started using the commercial software Aspera to improve download transfer speeds. This was announced in their August newsletter and at first was only for the Short Read Archive (SRA). However, I recently found out that they are now making all of their data available.

How to use it (web browser)

Download and install the Aspera browser plugin software.
Browse the Aspera NCBI archives.
Click on the file or folder you want to download and choose a place to save it.
The Aspera download manager should (see below) open and show the download progression.

How to use it (command line)

The browser plugin also includes the command line program: ascp (In linux this is at: ~/.aspera/connect/bin)
There are many options but the standard method is:

ascp -QT -i ../etc/asperaweb_id_dsa.putty anonftp@ftp-private.ncbi.nlm.nih.gov:/source_directory /destination_directory/

e.g.:
ascp -QT -i ../etc/asperaweb_id_dsa.putty anonftp@ftp-private.ncbi.nlm.nih.gov:/genomes/Bacteria/all.faa.tar.gz ~/

Critique

Windows machine with Firefox worked with no problems and download speeds at my institution were much faster than with FTP (~0.5 - 4.0Mbps vs 50-300kbps)

Browser plugin with Firefox on Linux would not work! Plugin seemed to be loaded properly, but Aspera download manager would not start. Update: This was due to me trying to install the plugin as root and causing a permission error. The plugin is installed in your home directory and must not be installed as root.

Download with command line in Linux was unreliable. This was a huge disappointment as this was the primary method I was hoping to use. Files would start to download correctly with very fast transfer speeds (1-4Mbps), but connection would drop with error: "Session Stop (Error: Connection lost in midst of data session)". Unfortunately, there is no way to resume the download so each time I had to start over. On about the 8th try it downloaded the file (6889MB) correctly. Update: see below

Personal Opinion
Although I was excited to see NCBI trying to improve data transfer speeds I was not very impressed with the Aspera solution. Hopefully, it will become more reliable in the future.
Of course, my personal solution would be for NCBI to embrace BitTorrent technology and make use of BioTorrents, but I will save that discussion for another day.

Update:
All ascp options are shown below (by typing ascp without arguments). However, I can't find any further documentation on these options. As noted in the comments below, -k2 is supposed to resume a download, but this didn't work for me when I tested it.

usage: ascp [-{ATdpqv}] [-{Q|QQ}] ...
[-l rate-limit[K|M|G|P(%)]] [-m minlimit[K|M|G|P(%)]]
[-M mgmt-port] [-u user-string] [-i private-key-file.ppk]
[-w{f|r} [-K probe-rate]] [-k {0|1|2|3}] [-Z datagram-size]
[-X rexmsg-size] [-g read-block-size[K|M]] [-G write-block-size[K|M]]
[-L log-dir] [-R remote-log-dir] [-S remote-cmd] [-e pre-post-cmd]
[-O udp-port] [-P ssh-port] [-C node-id:num-nodes]
[-o Option1=value1[,Option2=value2...] ]
[-E exclude-pattern1 -E exclude-pattern2...]
[-U priority] [-f config-file.conf] [-W token string]
[[user@]host1:]file1 ... [[user@]host2:]file2

-A: report version; -Q: adapt rate; -T: no encryption
-d: make destination directory; -p: preserve file timestamp
-q: no progress meter; -v: verbose; -L-: log to stderr
-o: SkipSpecialFiles=yes,RemoveAfterTransfer=yes,RemoveEmptyDirectories=yes,
PreCalculateJobSize={yes|no},Overwrite={always|never|diff|older},
FileManifest={none|text},FileManifestPath=filepath,
FileCrypt={encrypt|decrypt},RetryTimeout=secs

HTTP Fallback only options:
[-y 0/1] 1 = Allow HTTP fallback (default = 0)
[-j 0/1] 1 = Encode all HTTP transfers as JPEG files
[-Y filename] HTTPS key file name
[-I filename] HTTPS certificate file name
[-t port number] HTTP fallback server port #
[-x ]]

Update 2:
After spending an afternoon with Aspera Support, I have some answers to my connection and resume issues when using ascp. The problem has to do with me not using the -l option to properly limit the speed at which ascp sends data. I thought this limit would only be relevant if 1) I wanted to not use all of my available bandwidth or 2) my computer hardware could not handle the bandwidth of the file transfer. Surprisingly, the recent for my disconnects was because NCBI was trying to send more data than my bandwidth allowed and thus causing my connection to drop. I would have thought that ascp would look after these type of bandwidth differences considering that all other data transfer protocols that I know of can control their rate of data flow. If this is the case, it would suggest that my connection may be broken if for some reason my available bandwidth drops (which would happen often due to network fluctuations at a large institution) even if I set the limit appropriately. Hopefully, Aspera can make their data transfer method a little more robust in the future. I don't think I will be replacing ftp with ascp in my download scripts quite yet.

Update 3:
Michelle from Aspera finally let me know that -Q is default option I should be using to allow adaptive control. Now, I am trying to get a entire directory to download, but I am still having connection issues. Here is a screenshot of my terminal showing that the directory resume is not working and I am losing my connection:

Wednesday, October 21, 2009

BioTorrents - a file sharing resource for scientists

Let me ask you a question. If you just wrote a new computer program or produced a large dataset, and you wanted to openly share it with the research community, how would you do that?

Image via Wikipedia

My answer to that question is BioTorrents!

This has been a side project that I have been working on lately and considering this is the first international Open Access Week I thought I should finally announce it.

BioTorrents is a website that allows open access sharing of scientific data. It uses the popular BitTorrent peer-to-peer file sharing technology to allow rapid file transferring.

So what is the advantage of using BioTorrents?

Faster file transfer

Have you tried to download the entire RefSeq or GEO datasets from NCBI recently? How about all the metagenomic data from CAMERA? Datasets continue to increase in size and downloading speed can be improved by allowing multiple computers/institutions to share their bandwidth.

More reliable file transfer

BitTorrent technology has file checking built-in, so that you don't have to worry about corrupt downloads.

Decentralization of the data ensures that if one server is disabled, that the data is still available from another user.

A central repository for software and datasets

Rapid and open sharing of scientific findings continues to push for changes in traditional publication methods and has resulted in an increase in the use of pre-print archives, blogs, etc. However, sharing just datasets and software without a manuscript as an index is not as easy. BioTorrents allows anyone to share their data without a restriction on size (since the files are not actually hosted or transferred by BioTorrents).

Titles and descriptions of all data on BioTorrents can be browsed by category or searched for keywords (tag cloud coming soon).

As long as there is at least one user sharing the data it will always be available on BioTorrents. Those pieces of software or datasets that are not popular and not hosted by a user will quietly die (removed from Biotorrents after 2 weeks).

I am continuing to update BioTorrents, so if you have any suggestions or comments please let me know.

Other Tags:Website Hosting, Inventory Software

Sunday, August 2, 2009

Storable.pm

Most of my programming is what I like to call "biologically driven"; that is the main end result is not the development of the program itself, but rather the data that comes out of the program. Many times this involves writing a script to input data, do something to that data, and then output it back to a file which is in turn read into another script....ad infinitum.

The classic tab-delimited file is usually my typical choice for the intermediate format, but reading and writing (although simple) these gets repetitive and more complicated for more complex data structures. I finally looked into alternatives (something I clearly should have done awhile ago) and came across Storable.

Basically, it allows you to save/open any perl data structure to/from a file.
It is very easy to use:

use Storable;

#Reference to any data structure
$data_ref;

store($data_ref, 'my_storage_file');

#later in same or different script
$new_data_ref = retrieve('my_storage_file');

Check it out if you have never used it before.

Wednesday, January 21, 2009

Looking for a bioinformatics expert?

What I have to offer:

A balanced background in both biology (BSc) and computer science (BCS)
Soon to be completed PhD
Extensive research experience in bioinformatics, genomics, phylogenetics/phylogenomics, evolution, and bacteria pathogenesis
Some previous research experience in medical imaging, ontology development, and metagenomics
An impressive publishing record (7 papers, 3 first authors, 2 more first authors under review)
Solid computational skills including Perl programming, database design (MySQL), parallel programming, and web design (PHP & JavaScript)
Good communication and social skills
More information

What I am looking for:

Post-doc or job (academic or industrial)
Preferably, a position where I have some significant manager or leadership responsibilities
Geographically interested in north eastern parts of North America (Ottawa down to New York), but would entertain positions elsewhere in N.A.

I didn't put any limitations on research interests, since I am open to many areas. However, anything having to due with the human microbiome project, human-bacteria interactions, or metagenomics would be of particular interest.

Please email me if you are interested or if you have suggestions on some good openings.

Tuesday, November 18, 2008

IslandViewer

My most recent research has resulted in website for viewing predictions of genomic islands (GIs), large regions of horizontal gene transfer, in bacterial genomes. IslandViewer integrates three different methods of GI detection IslandPick, SIGI-HMM, and IslandPath-DIMOB. SIGI-HMM and IslandPath-DIMOB use sequence composition bias to detect GIs and were found to be the most accurate in a recent publication. IslandPick is a method I recently developed that uses comparative genomics to find GIs by identifying regions that are present in one genome but absent from several related genomes.

Predictions for all three tools are pre-computed for all sequenced bacterial genomes (those available from NCBI Microbial genomes). Also, users can submit their own newly sequenced genome for analysis and receive an email when complete (usually within a couple of hours).

Update:IslandViewer has been published in Bioinformatics!

Any feedback or comments on the design or usefulness on the website is appreciated!

Wednesday, October 3, 2007

Least Publishable Unit (LPU)

I have been recently thinking about the Least Publishable Unit (LPU) theory in academia. Considering that I am now a month into my fourth year of my PhD and I have just submitted my first, first author research paper on my thesis work I am starting to panic slightly. I do have a previous first author research paper from undergrad research, 3 other non-first author papers, a submitted first author book chapter, and a Nature Microbial Reviews paper soon to be submitted. However, I would like to have another couple of first author papers in the next year and a half, so that I can graduate with a decent PhD career under my belt.

From my previous experience, the life sciences tend to publish more content less often, whereas computer scientists tend to publish very often with smaller amounts of research. Bioinformatics has overlap in both of these fields thus allowing different publishing rates depending on your research topic. For instance, if you are developing new tools, you would probably be publishing at a greater rate then if you are using bioinformatics to find some new biological interesting result (although this is certainly not always the case).

I would like to think that I have been focusing more on biology and thus my publishing has been slightly behind. However, I now have the skills and knowledge that I could quickly crank out a couple of useful tools that would probably be publishable (I feel like this would be somehow selling out, but maybe not).
Also, if I did go this route does it depend on how much effort was involved or rather how useful the tool would really be?

Recently, I wrote a script that would use gene synteny to make improved ortholog detection in two genomes. It is not overly complicated and uses previously developed tools (genome alignment and local alignment tools), but I think it is incredibly useful and improves upon the basic reciprocal best blast hit approach that is primarily in use. Although, my research is not focused on ortholog prediction and the tool was made so that I would not have to manually annotate 5500 bacteria genes (as part of a bacteria genome project); I have to wonder, "is it publishable?". I guess the only way to find out is by submission.

Tuesday, May 15, 2007

Bioinformatics for biologists

I just read the review "Bioinformatics Software for Biologists in the Genomics Era" by Kumar and Dudley. Basically, the authors outline the need for improved bioinformatics software that can be easily used by biologists. I completely agree with the authors, but by the time I was done reading the article I was somewhat annoyed. I think the overall problem I have is that they just keep reiterating that the people developing the tools need to make them more user friendly, without giving any real review of possible solutions or roadblocks that need to be overcome.
They state that:

command-line programs are bad, GUI's are good
being able to submit batch processes is good
documentation is good
clear, human readable results is good
tools that run on all operating systems is good
being able to connect multiple tools in a pipeline (with no programming required) is good

I think even the most amateur programmer is aware of these issues, so I am left wondering who the audience of the article is meant for? In addition they don't reference any of the current tools that are being developed to improve on the situation. In particular they propose:

"Within the context of the user-friendly software, we favor a solution where the existing implementations of computational methods can be incorporated “as is,” without requiring any significant effort from the developer of the program that is being incorporated. We refer to this approach as “Application Linking,” which is similar to “wrapping” (Spitznagel and Garlan, 2003). The aim of Application Linking is to allow existing user-friendly applications to seamlessly host third-party scripts and applications through its graphical interface, such that the user is abstracted from the intricate nuances of the hosted application’s non-visual execution requirements (e.g., process control, system I/O, and control files)."

Ummm.... I guess they haven't heard of Taverna and web services?

As a student who is writing one of these programs let me elaborate on what I think is the main problem. Scientific credit is based on publishing not on producing good tools. What does this mean to me as a PhD student? It means that the time spent on producing a robust web tool would be better spent on making additional tools or conducting more biological relevant based analyzes that will lead to more publishable papers. Am I proud of this? No. Especially since I have a strong interest in reducing programming redundancy, but for now it seems that I don't have much of a choice.

Friday, May 4, 2007

Canadian Bioinformatics Workshop

Hello Blogosphere! I told myself about a month ago that I would stop lurking around everyone else's blog and start contributing once I had a free afternoon. Well I am finally here!

A couple of days ago I was asked to be a TA for the upcoming Canadian Bioinformatics Workshop and I graciously accepted. I am really excited to get a chance to do some teaching. I have been lucky enough to obtain funding for all of my graduate career thus far, so I haven't been required to TA any courses before. It looks as if I will be leading a lecture on global genome alignment tools, which is great since I have been using one such tool called Mauve in my personal research (soon to be submitted). Luckily, I don't have to start from scratch since I get to use material from last year's presenter Mike Brudno, who is the creator of the LAGAN alignment tools. Surprisingly, he didn't include any other tools besides his own in his lecture last year. I guess it is natural to talk about what you know best. :)