Beta Science

Antarctica Part 2 - Getting to Mount Erebus

2010-11-27T16:03:00.000-08:00

After arriving in Antarctica, many days were spent preparing for the 2 weeks we would spend out in the field on Mount Erebus. Basically, science gear all had to be listed, boxed, and weighed before departure. Additionally, gear had to be separated into two shipments: 1) for the gear going with us to Fang Camp, 2) the gear we would need once we got to Lower Erebus Hut.

I spent quite a bit of my downtime drinking tea and writing in my journal while looking out at the sea ice and pressure ridges from the lounge area. I also got out for a couple of walks along the pressure ridges, which was very scenic and refreshing.

What made the walks even more interesting were the Weddell seals that had chewed their way through the ice and were lounging on the sea ice near the marked trail.

After spending nearly a week at Scott Base preparing, we got confirmation that we would be heading to the field the next morning. After a few last minute changes, it was determined that we would it take 2 helicopter trips to get the 6 of us and our gear up the mountain to Fang camp. Fang camp is about two thirds of the way up the mountain and is used as a stop for a couple of days to get used to the altitude. Flying directly to Lower Erebus Hut isn't done much anymore since the altitude is 12000 ft, which feels more like 16000 ft since it is so far from the equator, and altitude sickness is a real concern. In fact, most of us took Diomox to help adapt to the altitude sickness along with drinking at least 4 liters of water per day.

It was my first helicopter ride and it was simply amazing.

Once we landed, everyone helped unload and start to get gear put into our tents. A few helicopter trips later and we were on our own on the side of Mount Erebus.

Antarctica Part 1 - Getting to the ice.

2010-11-24T20:45:00.000-08:00

After about 20 hours of traveling I arrived in ChristChurch, New Zealand. It was early morning, and I didn't really now where I was going so I walked from the airport to Antarctica New Zealand. I tried on all of my clothing that they provided including enough to have 6 layers on top, 4 layers on bottom, 2 pairs of boots, and lots of other accessories. I met with the rest of the research team who are based out of University of Waikato:
Craig Cary, PI
Ian McDonald, PI
Craig Herbold, Post-doc
Chelsea Vickers, Master's student

That night we all attended the 2010 New Zealand Research Honours Dinner. This was the beginning of a crash course in NZ culture. Considering that I would be spending most of my time with these people in close courters with out running water, it seemed fitting to start the trip with everyone all dressed up.

After spending a day getting little things done in Christchurch, our flight for Antartica left the next morning. For our flight we had to be wearing (or at least carrying) all of our ECW (Extreme Cold Weather) gear. We were given a short intro video on Antartica, then we went through security, and boarded a bus. A short bus ride took us out onto the tarmac and we picked up a bagged lunch as we boarded our plane.

Seating was first come first serve so I grabbed one of the business class seats and settled in for the 5 hour flight. Almost everyone on board are scientists so there was lots of interesting projects being discussed on the flight. For example, the person beside me worked for NASA and was studying the soil and microbes in the McMurdo Dry Valleys, because these regions are thought to be very similar to Mars.

Interestingly, the cock pit was open and you could go up at anytime to chat with the pilots and to check out their view. The visibility was perfect and the views were fantastic as we started to approach Antarctica.

As we got closer to Ross Island, Mount Erebus was clearly visible and it was unimaginable that I would be living on top of it for 2 weeks.

After landing on the ice runway, it was a short drive through the American McMurdo Station (max. ~1100 people) and into the smaller, cozier, New Zealand Scott Base (max. 86 people). At last I had arrived!

Heading to Antarctica!

2010-11-05T16:57:00.000-07:00

On Nov. 8th I will depart from sunny California and with a quick one day stop in Christchurch, New Zealand I will be in Antarctica. Now this isn't just a trip to Antarctica, this is the trip to Antarctica. Please allow me to gloat a little bit. After arriving at Scott Base on Ross Island I will begin a 5 day field training course to make sure I survive the expedition I will be taking on. After training, 8 of us will be flown halfway up Mount Erebus where we will acclimatise to the altitude by living in unheated tents for a couple of days. Then we will travel to the top of Mount Erebus to reach the "Lower Erebus Hut" at an altitude of 12,000 ft. We will stay there for 2 weeks doing daily trips to "fields"/fumeroles where the snow has melted away due to hot volcanic gases (did I forget to mention that the Mount Erebus is a active volcano?). The main goal is to collect soil samples and environmental data to examine the microbes living in this extreme environment. At the research station there is a small heated hut for meal times and work, but sleep will still be in the same cold tents with 24 hour sunshine. Dehydrated food and -20 to -40 C temperatures will ensure I lose some weight, but that is an added bonus.

To say that I am excited is an under-statement. Sure it will be hard to be away from my family for a month and the altitude sickness will make me feel like crap, but the chance to go on such a crazy expedition ruled out any chance of me turning it down.
How many scientists, especially those that do bioinformatics, gets a chance to do field work like this!?

Review of Open Science Summit 2010, #OSS2010

2010-07-31T10:11:00.000-07:00

I have been attending Open Science Summit 2010 at Berkeley, CA and although not quite finished yet I feel like I can give an overall review of what I thought of the conference. You can check out my individual comments during the conference on Twitter.

I would like to state that in general I am grateful and respect the work that Joseph Jackson and the organizing committee conducted to make this open science conference a reality. It is a tremendous amount of effort and the following is only meant as a constructive criticism for possible open science summit conferences in the future.

Pros

Bringing together a very intelligent diverse group of speakers. Good mix of policy makers, developers, traditional scientists, biotech, young and old, etc.
Great use of technology. Providing a live video stream of conferences is an idea that I wish more conferences implemented. Also, using backchan.nl is a nice additional add-on that couples well with the live video stream.
Willingness to try to adapt (as much as possible) to conference attendees comments via twitter, back channel, etc.

Cons

No scheduled breaks. Breaks are needed for numerous practical reasons: people need bathroom breaks, time to get some fresh air, and time for talks to get back on schedule. Even more importantly, it allows people to mingle. People travel to conferences so that they can get a chance to connect with people face to face (otherwise they would just watch the online feed).
No time for Q & A. Questions immediately after speakers not only is informative, but gives a temporary "mind break" for the audience. It also gives time for IT to get the next presentation queued. Note: this did tend to improve as the conference proceeded.
Too many speakers. Having 25 speakers in a single day (without parallel sessions) is just too much information for people to take in and sit through.

Additional lessons learned

A no slide presentation is not a guarantee that it will be a good one.
Videos do not always make a presentation better.
Having 2 or more speakers from the same organization or having the exact same opinion is not really beneficial.

Comments from FriendFeed

Use Mendeley to list your publications on your personal homepage

2010-05-04T15:12:00.000-07:00

I was updating/creating my personal website a little while ago and was looking for a good method to keep my "Publications" page updated without having to edit it manually.
At first I played around with using Exhibit. This was kind of fun and allowed my publications to be sorted in all kinds of ways and exported in lots of formats.However, this method seemed like overkill (might be more useful if I had hundreds of publications....maybe one day but not today), and required that I update a .bib file every time I had to add a publication.

Then I noticed that my Mendeley profile has a nicely formatted page of my publications. Unfortunately, Mendeley doesn't yet provide html code to embed this on your own page, there is a slight workaround.

In your Mendeley software client create a new collection and name it "Publications" (you can rename this later if need be).
Add your publications to this new collection. Note!! The publications need to be added from oldest to newest one at a time. This is because Mendeley orders the publications by the date they were added to the collection (and not by pub. date).
Right-click->Edit Settings, then under "Collection Access" choose "Public - visible to everyone". Then click "Apply and Sync".
Go to the settings again for the collection and follow the web link to the collection online.
In the upper right click on "Embed on other websites". You can customize the size and the color if you want. Then copy the html code to your website, blog, etc.
That's it! When you have a publication to add just add it to your new "publications" collection, sync, and now your personal page is updated as well.

The default settings will result in your publications looking like this:

Publications is a group in Biological Sciences on Mendeley.

For my personal website I changed the color and made it a bit larger so it looks like this.

An interview with the creator of BioTorrents

2010-04-14T12:17:00.000-07:00

Who better to interview the creator of BioTorrents than the creator himself? :)

Interviewer: So Morgan, your article entitled “BioTorrents: A File Sharing Service for Scientific Data” was published today in PLoS One. BioTorrents uses the popular peer-to-peer file sharing protocol, BitTorrent, to allow scientists to rapidly share their results, datasets, and software. Where did this idea come from?

Morgan: Well about 6 months ago I was downloading some genome files from NCBI's FTP site and was watching the download speed hover between 50-100Kb/s and I said to myself (much like this interview) I wish could download these with BitTorrent. I have used BitTorrent for downloading other non-scientific data (lets not discuss what they may be) and I know it is a much faster and more reliable way for getting large files. A few minutes later I posted to Twitter asking if anyone had thought about setting up a BitTorrent tracker for scientific data and the response was over-whelming (well only 1 response, but I could feel it had a larger impact). About a week later, I brought up the idea again over coffee with some members of my lab and more importantly my post-doc supervisor Dr. Jonathan Eisen. He thought it was a good idea and well worth pursuing, which was all I needed to push aside all my other "real" research and focus on this much more "fun" project.

Interviewer: Thanks for that long-winded response. Maybe you could comment more briefly on the benefits of using BioTorrents/BitTorrent for sharing scientific data.

Morgan: I think it is explained fairly well in the manuscript and in my previous blog post, but to reiterate the major benefits are:
1) Faster, more reliable, and better controlled downloading of data that scales well for very large files.
2) Instant "publishing" of data, results, and software.
3) Very easy for anyone to share their data. No dedicated web server needed.

Interviewer: Who should consider sharing data on BioTorrents?

Morgan: Everyone that has something to share. Large institutions can benefit from reduced bandwidth requirements, while individual users can benefit from the simplicity of sharing with BitTorrent technology. Personally, I really like the idea of open data and the idea of sharing results before publication. How many times has someone done an all vs all blast of microbial genomes? In theory this can be done once, and that person can be recognized (referenced, co-authored, etc.) when other researchers use that data.

Interviewer: Are there any challenges/limitations to using BitTorrent with scientific data?

Morgan: BitTorrent excels at transferring very large popular datasets. Therefore, if only one person is "seeding" a file and only one person is downloading the file most of the advantage to using BitTorrent is lost. However, even in this worst case scenario, the transfer speed would be roughly equivalent to using traditional file transfer methods such as FTP/HTTP and BitTorrent still provides the benefit of error checking and ease of data transfer control (pause, resume, etc.). Another possible problem is that some institutions often try to limit BitTorrent traffic since it is often considered illegal non-work related network traffic. However, I would encourage users at these institutions to explain to their network administrator that many times BitTorrent traffic is legitimate and shouldn't be blocked.

Interviewer: Why publish in PLoS One?

Morgan: I have been a big fan of the PLoS One journal and ever since I blogged about it last year "Is PLOS One the future of scientific publishing?", I have been wanting to submit a paper there. Also, considering that BioTorrents is aimed at improving open access to data in all fields of science, PLoS One seemed like the most obvious journal choice for our manuscript.

Langille, M., & Eisen, J. (2010). BioTorrents: A File Sharing Service for Scientific Data PLoS ONE, 5 (4) DOI: 10.1371/journal.pone.0010071

Please don't use Clustal for tree construction!

2010-03-02T14:33:00.001-08:00

Image via Wikipedia

There are reams of books, articles, and websites about the correct way to build a phylogenetic tree. My post is not to argue about what is the best method, but rather point out that most people do not consider Clustal (e.g. ClustalX or ClustalW) to be an optimal solution in almost any circumstance. Countless times I have asked people how they built their particular tree and they give me the vague "Clustal" answer. Of course this answer is fine if this is the first tree you ever constructed, but beware you will be labelled as a phylogenetic newbie.

Clustal is technically a multiple alignment algorithm, but it also includes methods for tree construction in the same interface. Most of these methods are not really considered "good" tree building methods. If you do use Clustal, at least specify what tree building method you used (ie. "Clustal with neighbor joining"). Most people don't use Clustal even for multiple alignment anymore, because Muscle has been shown to be at least as accurate as Clustal and is much faster.

For tree construction, most people would agree that a Maximum Likelihood or Bayesian method would almost always be a better solution; PhyML and Mr. Bayes seem to be the most popular implementations for these methods. Advanced users might also want to look into using Beast.

I usually interact with most of these programs through a command line interface, so I don't have an expansive knowledge of the best graphical tool. However, I did come across, "Robust Phylogenetic Analysis For The Non-Specialist" which does a good job allowing easy interaction between various methods for multiple sequence alignment, tree construction, and tree viewing.

Whatever you use to build trees, just make sure it isn't Clustal!

Using Aspera instead of FTP to download from NCBI

2010-02-24T14:11:00.000-08:00

If you often download large amounts of data from NCBI using their FTP site you might be interested in knowing that NCBI has recently started using the commercial software Aspera to improve download transfer speeds. This was announced in their August newsletter and at first was only for the Short Read Archive (SRA). However, I recently found out that they are now making all of their data available.

How to use it (web browser)

Download and install the Aspera browser plugin software.
Browse the Aspera NCBI archives.
Click on the file or folder you want to download and choose a place to save it.
The Aspera download manager should (see below) open and show the download progression.

How to use it (command line)

The browser plugin also includes the command line program: ascp (In linux this is at: ~/.aspera/connect/bin)
There are many options but the standard method is:

ascp -QT -i ../etc/asperaweb_id_dsa.putty anonftp@ftp-private.ncbi.nlm.nih.gov:/source_directory /destination_directory/

e.g.:
ascp -QT -i ../etc/asperaweb_id_dsa.putty anonftp@ftp-private.ncbi.nlm.nih.gov:/genomes/Bacteria/all.faa.tar.gz ~/

Critique

Windows machine with Firefox worked with no problems and download speeds at my institution were much faster than with FTP (~0.5 - 4.0Mbps vs 50-300kbps)

Browser plugin with Firefox on Linux would not work! Plugin seemed to be loaded properly, but Aspera download manager would not start. Update: This was due to me trying to install the plugin as root and causing a permission error. The plugin is installed in your home directory and must not be installed as root.

Download with command line in Linux was unreliable. This was a huge disappointment as this was the primary method I was hoping to use. Files would start to download correctly with very fast transfer speeds (1-4Mbps), but connection would drop with error: "Session Stop (Error: Connection lost in midst of data session)". Unfortunately, there is no way to resume the download so each time I had to start over. On about the 8th try it downloaded the file (6889MB) correctly. Update: see below

Personal Opinion
Although I was excited to see NCBI trying to improve data transfer speeds I was not very impressed with the Aspera solution. Hopefully, it will become more reliable in the future.
Of course, my personal solution would be for NCBI to embrace BitTorrent technology and make use of BioTorrents, but I will save that discussion for another day.

Update:
All ascp options are shown below (by typing ascp without arguments). However, I can't find any further documentation on these options. As noted in the comments below, -k2 is supposed to resume a download, but this didn't work for me when I tested it.

usage: ascp [-{ATdpqv}] [-{Q|QQ}] ...
[-l rate-limit[K|M|G|P(%)]] [-m minlimit[K|M|G|P(%)]]
[-M mgmt-port] [-u user-string] [-i private-key-file.ppk]
[-w{f|r} [-K probe-rate]] [-k {0|1|2|3}] [-Z datagram-size]
[-X rexmsg-size] [-g read-block-size[K|M]] [-G write-block-size[K|M]]
[-L log-dir] [-R remote-log-dir] [-S remote-cmd] [-e pre-post-cmd]
[-O udp-port] [-P ssh-port] [-C node-id:num-nodes]
[-o Option1=value1[,Option2=value2...] ]
[-E exclude-pattern1 -E exclude-pattern2...]
[-U priority] [-f config-file.conf] [-W token string]
[[user@]host1:]file1 ... [[user@]host2:]file2

-A: report version; -Q: adapt rate; -T: no encryption
-d: make destination directory; -p: preserve file timestamp
-q: no progress meter; -v: verbose; -L-: log to stderr
-o: SkipSpecialFiles=yes,RemoveAfterTransfer=yes,RemoveEmptyDirectories=yes,
PreCalculateJobSize={yes|no},Overwrite={always|never|diff|older},
FileManifest={none|text},FileManifestPath=filepath,
FileCrypt={encrypt|decrypt},RetryTimeout=secs

HTTP Fallback only options:
[-y 0/1] 1 = Allow HTTP fallback (default = 0)
[-j 0/1] 1 = Encode all HTTP transfers as JPEG files
[-Y filename] HTTPS key file name
[-I filename] HTTPS certificate file name
[-t port number] HTTP fallback server port #
[-x ]]

Update 2:
After spending an afternoon with Aspera Support, I have some answers to my connection and resume issues when using ascp. The problem has to do with me not using the -l option to properly limit the speed at which ascp sends data. I thought this limit would only be relevant if 1) I wanted to not use all of my available bandwidth or 2) my computer hardware could not handle the bandwidth of the file transfer. Surprisingly, the recent for my disconnects was because NCBI was trying to send more data than my bandwidth allowed and thus causing my connection to drop. I would have thought that ascp would look after these type of bandwidth differences considering that all other data transfer protocols that I know of can control their rate of data flow. If this is the case, it would suggest that my connection may be broken if for some reason my available bandwidth drops (which would happen often due to network fluctuations at a large institution) even if I set the limit appropriately. Hopefully, Aspera can make their data transfer method a little more robust in the future. I don't think I will be replacing ftp with ascp in my download scripts quite yet.

Update 3:
Michelle from Aspera finally let me know that -Q is default option I should be using to allow adaptive control. Now, I am trying to get a entire directory to download, but I am still having connection issues. Here is a screenshot of my terminal showing that the directory resume is not working and I am losing my connection:

Filtering Blast hits by coverage using BioPerl

2010-01-30T22:46:00.000-08:00

A couple of days ago I wrote about how I had to throw away a ton of data because I ran out of disk space for a large Blast analysis. One of the reasons I ran out of room was because I opted to use the XML output format over the simpler tabular output. The XML format provides much more information about each hit including the lengths of query and subject genes, which allows easy retrieval of the coverage in BioPerl using the "frac_aligned_query()" and "frac_aligned_hit()" functions. For example:

my $searchio = Bio::SearchIO->new(
-format => 'blastxml',
-file => $blast_file
);
while ( my $result = $searchio->next_result() ) {

#process the Bio::Search::Hit::GenericHit
while ( my $hit = $result->next_hit ) {

my $evalue = $hit->significance();
my $identity = $hit->frac_identical();

######Get amount of coverage for query and hit #######
my $query_coverage = $hit->frac_aligned_query();
my $hit_coverage = $hit->frac_aligned_hit();

###Filter based on evalue and coverage
if ( ( $query_coverage > $coverage_cutoff )
&& ( $hit_coverage > $coverage_cutoff )
&& ( $evalue < $evalue_cutoff ) ) { ##do something ##}

This is fairly simple, but if you have to use the tabular output of Blast (due to say file size limitations) the lengths of the genes are not included in the Blast output. Therefore, you have to retrieve these manually somehow and then either calculate coverage yourself (remembering to tile the HSPs for each hit) or tell BioPerl about the gene lengths so you can call the same functions. This isn't obvious or documented in BioPerl, so I had to hack away until I found out the solution. See comments for explanation.

my $searchio = Bio::SearchIO->new(
-format => 'blasttable',
-file => $blast_file
);

while ( my $result = $searchio->next_result() ) {
my $query_id = $result->query_name();

#get the length of the query sequence
#Note: you need to write this function yourself since the length is not in the blast output file.
my $query_len = get_gene_length($query_id);

#################
#This sets the query length for each hit by directly manipulating
#the objects hash (instead of through a function)
foreach my $hit(@{$result->{_hits}}){
$hit->{-query_len}=$query_len;
}
#################

#process the Bio::Search::Hit::GenericHit
while ( my $hit = $result->next_hit ) {
my $hit_id = $hit->name();
my $hit_len = get_gene_length($hit_id);

###Setting the hit length is much easier!!
$hit->length($hit_len);
#################

my $evalue = $hit->significance();
my $identity = $hit->frac_identical();
my $query_coverage = $hit->frac_aligned_query();
my $hit_coverage = $hit->frac_aligned_hit();

if ( ( $query_coverage > $coverage_cutoff )
&& ( $hit_coverage > $coverage_cutoff )
&& ( $evalue < $evalue_cutoff ) ) { ##do something ##}

Note that if you don't tell BioPerl the lengths your script will die with a "divide by zero" error.

Large BLAST runs and output formats

2010-01-28T10:42:00.000-08:00

I have used BLAST in many different forms and on many scales, from single gene analysis to large "all vs all" comparisons. This is a short story of how I decided to delete 164GB of Blast output.

I will save my reasoning for doing such a large Blast for another post. For now, all you have to know is that I am doing an "all vs all" Blast for 2,178,194 proteins. That is (2,178,194)^2 = 24,744,529,101,636 comparisons. Sure quite a few, but nothing that a large compute cluster can't handle (go big or go home is usually my motto).

I usually use the tabular output format for Blast (-m 8). However, one of the nice functions in BioPerl allows you to calculate the coverage of all hsps with respect to the query or subject sequence. BioPerl handles the tiling of the hsps which is annoying to have to code yourself. I often use this coverage metric to filter Blast hits downstream in my pipeline. So here comes the annoying thing. The tabular output of Blast does not include the start or end positions (or length) of the sequences in the Blast comparison. Therefore, to calculate coverage you need to go back to the original sequence and retrieve the length of the sequence. I know this is not a hard thing to do, but I am a lazy programmer and I like fewer steps whenever possible. Therefore, I decided to try out the Blast XML format (-m 7). A few test runs showed that the files were much larger (5X), but this format includes all information about the Blast run including the sequence coordinates. Therefore, I decided not to worry about space issues and launched my jobs. Bad decision.

Well 3 days later, I find out my quota is 300GB and since I already had 150GB from another experiment the blast output put me over. I can't easily tell which jobs completed normally, so I am faced with the decision to either write a script to figure out which jobs completed normally, or scrap all the data and re-run it the right way. I have opted to delete my 164GB of blast output and re-run it using the tabular format and I might even gzip the data on the fly to ensure this doesn't happen again.

Of course this isn't rocket science, but I thought I would tell my tale in case others are in similar circumstances.

MCE Remote Stops Working

2010-01-21T21:22:00.000-08:00

Quick post in case it happens to me again. My MCE IR remote stopped working suddenly and after some Googling I found out that I needed to change the batteries. The weird part is that it also required the remote to be "reset". One way is to short the battery terminals, but that didn't work for me so I found another solution.

Take the batteries out of the remote.
Hold the power button down and then press every other button on the remote once.
Replace the batteries.
Pray to the Microsoft gods.

Hopefully this helps someone else.

BioTorrents - a file sharing resource for scientists

2009-10-21T12:07:00.000-07:00

Let me ask you a question. If you just wrote a new computer program or produced a large dataset, and you wanted to openly share it with the research community, how would you do that?

Image via Wikipedia

My answer to that question is BioTorrents!

This has been a side project that I have been working on lately and considering this is the first international Open Access Week I thought I should finally announce it.

BioTorrents is a website that allows open access sharing of scientific data. It uses the popular BitTorrent peer-to-peer file sharing technology to allow rapid file transferring.

So what is the advantage of using BioTorrents?

Faster file transfer

Have you tried to download the entire RefSeq or GEO datasets from NCBI recently? How about all the metagenomic data from CAMERA? Datasets continue to increase in size and downloading speed can be improved by allowing multiple computers/institutions to share their bandwidth.

More reliable file transfer

BitTorrent technology has file checking built-in, so that you don't have to worry about corrupt downloads.

Decentralization of the data ensures that if one server is disabled, that the data is still available from another user.

A central repository for software and datasets

Rapid and open sharing of scientific findings continues to push for changes in traditional publication methods and has resulted in an increase in the use of pre-print archives, blogs, etc. However, sharing just datasets and software without a manuscript as an index is not as easy. BioTorrents allows anyone to share their data without a restriction on size (since the files are not actually hosted or transferred by BioTorrents).

Titles and descriptions of all data on BioTorrents can be browsed by category or searched for keywords (tag cloud coming soon).

As long as there is at least one user sharing the data it will always be available on BioTorrents. Those pieces of software or datasets that are not popular and not hosted by a user will quietly die (removed from Biotorrents after 2 weeks).

I am continuing to update BioTorrents, so if you have any suggestions or comments please let me know.

Other Tags:Website Hosting, Inventory Software

Canada gets Google StreetView

2009-10-07T09:55:00.000-07:00

Google launched their StreetView program in major Canadian cities today. Of course I don't live there right now, but I did check out my old stomping grounds in Vancouver, BC.

View Larger Map

Also, they happened to do Chester, NS which is where I usually spend most of my summer vacations.

View Larger Map

Storable.pm

2009-08-02T14:37:00.000-07:00

Most of my programming is what I like to call "biologically driven"; that is the main end result is not the development of the program itself, but rather the data that comes out of the program. Many times this involves writing a script to input data, do something to that data, and then output it back to a file which is in turn read into another script....ad infinitum.

The classic tab-delimited file is usually my typical choice for the intermediate format, but reading and writing (although simple) these gets repetitive and more complicated for more complex data structures. I finally looked into alternatives (something I clearly should have done awhile ago) and came across Storable.

Basically, it allows you to save/open any perl data structure to/from a file.
It is very easy to use:

use Storable;

#Reference to any data structure
$data_ref;

store($data_ref, 'my_storage_file');

#later in same or different script
$new_data_ref = retrieve('my_storage_file');

Check it out if you have never used it before.

Gene ontology tool suggestions

2009-07-15T11:57:00.000-07:00

I have used a few GO tools in the past, but after looking at the massive list of tools on the gene ontology page I'm hoping someone can give me a good suggestion for my problem.

Basically, I have several lists of GO terms (~4-15 terms per list) and I would like to see if at a "higher" branch they share a common molecular function. Ideally, a tool that could be run from the command line and outputs significance scores would be great, but a GUI tool would also work since I have about 70 lists that I would need to run.

Note, that this is slightly different than the usual over-representation analysis which usually takes a list of genes as input. In my problem I am starting with GO terms.

Any suggestions would really be welcome!

Syncing Mendeley and CiteULike

2009-06-03T12:20:00.000-07:00

I have been using CiteULike for quite awhile (after switching from Connotea), but more recently started using Mendeley. Overall, I am really impressed! Mendeley is a relatively new software project (still in beta), and I am surprised by how well it works. It has some crucial features that seperate it from other bookmarking tools such as: ability to sync bookmarks and pdf files back and forth from multiple personal computers and their online server, the ability to organize pdf files locally by title, author, journal, etc., has a citation plugin for Word (so you can stop paying for EndNote), and that the client software is available for Linux! Mendeley has been working so well that I was afraid I might end up abandoning CiteULike, since I most likely won't bookmark something twice.

However, yesterday it was announced that bookmarks from CiteULike can be accessed from within Mendeley. Note that this isn't just the simple ability to import the bookmarks, but that the bookmarks are kept synced and in their own CiteULike folder within Mendeley. Although the syncronization is currently only one way, from CiteULike to Mendeley, further integration of the two tools is suppossedly in the works.

This seems like a great colloboration since CiteULike tends to focus more on the social networking aspect, while Mendeley focuses more on providing a presonal reference manager.

It is nice to see companies colloborating instead of competing.

Automatically downloading emails in Thunderbird when using IMAP

2009-05-20T10:33:00.000-07:00

Lots of applications have an "offline" feature that allow you to access data (email, calendar, documents, etc) when you don't have an internet connection. These are great, but I can never remember to click the "offline" mode. Bandwidth and storage are never usually concerns, so I would just prefer if applications did this by default (or at least had the option). Google Calendar is about the only program that I use daily that does this without me needing to click on update/offline.

For those who use Thunderbird as their email client and use IMAP instead of POP, you can set it to have all of your emails stored locally by default without clicking the offline mode. The trick is a couple of settings in the advanced config editor (Options->Advanced->Config Editor):

mail.server.default.autosync_offline_stores to true (you might have to create this value if it doesn't already exist. Right Click->New->Boolean)
use_status_for_biff to false

More information is here.

Hello California!

2009-05-13T13:42:00.001-07:00

Well UC Davis to be more precise. I accepted a postdoctoral fellowship from Jonathan Eisen to be a part of the iSEEM project working on metagenomics. I have only been here for a few days, and first impressions seem great. First, the research field is exactly what I was most interested in; second, my previous PhD research is definitely of relevance; and third, I feel like I have lots to learn from the people around me.

Considering my previous Blog tag line/description is inaccurate:

"A PhD student's point of view on bioinformatics, evolution, and microbial diversity; with an interest in cutting edge computer tools that make them all a bit easier."

I decided to radically change it to:

"A post-doc's point of view on bioinformatics, evolution, and microbial diversity; with an interest in cutting edge computer tools that make them all a bit easier."

Jonathan's opinion on open-access publishing is quite similar to my own, so in addition to blogging about microbial evolution, expect to see more posts about my views on academic publishing.

Goodbye Vancouver!

2009-04-30T21:33:00.000-07:00

The past 4 months have been a whirlwind. On April 16th I successfully defended my PhD thesis, after some minor revisions submitted it on April 18th, and left the country on April 29th. I wouldn't recommend such a tight time line especially if you happen to have a 5 month old baby as well!

My thesis will eventually be accessible (open-access of course) through SFU's library, but for those who are just dying to read it now, can access it here (+ appendix).

I feel obligated to give some type of advice to future PhD students. Unfortunately, I don't have any huge insight, but I would recommend not worrying too much during your graduate studies. Many times, I thought the whole thing would unravel and I would never finish, especially during years 2-3, but all of a sudden things started to fall in place. Every grad student I have ever talked to has always agreed that productivity increases greatly in the last year or two and so you can't worry about how long it took to do X in time Y. I hope I am not giving the impression that doing a PhD is easy, because it is not. It is hard, and different from all other schooling. If you think of an undergrad degree as sprinting, then a PhD is more like a marathon. I was great at sprinting, but learning to be a good marathon runner was a completely new set of skills.

In between all of the moving steps (I don't want to see another cardboard box for quite awhile), I had lots of time to reflect on my past 4.5 years in Vancouver, BC. Although there were some challenging times, I will greatly miss Vancouver and the people that I met during my time there. The first years of my marriage, living far away from family, the completion of my PhD, and becoming a Dad all happened in Vancouver and I will cherish the multitude of memories that accompany each of these milestones.

To end this post, I think I will list a few flashes of memories that are ingrained in my head from the past several years (in no particular order):

Driving across Canada and seeing the Rockies from a distance for the first time.

Looking out my first downtown apartment window for the first time.

Standing on top of the "Chief".

Snorkeling in the ocean with my wife along the "sunshine coast".

Houseboating on a quiet lake in Vancouver Island surrounded by the most beautiful scenery.

White water rafting near Squamish.

Walking the sea wall countless times, and every time still being impressed by it

The various camping adventures including a jump into a cold lake to escape a never ending swarm of flies.
Standing at the peak of Whistler for the first time.
The various conferences that included travel to destinations such as Maui, Vienna, Cambridge, UK, and California.
The birth of my son, Gavin.
The happiness of reading a short letter stating that I had completed all requirements for my PhD.

Is PLOS One the future of scientific publishing?

2009-03-31T14:06:00.000-07:00

I just read about PLOS One's new features through their relatively new blog, EveryOne. Although the new features are not really ground breaking they do provide a much improved layout and a new "Related Content" page. These changes show that One is dedicated to improving connectivity between peer-reviewed papers and commentary from comments, blogs, etc., giving me some hope that publishing may be changing (yet still at a snails pace).

So back to the question that is asked in the title of this post, "Is PLOS One the future of scientific publishing?", I am going to have to say a tentative "Yes". I think their basis of publishing papers not on novelty, but focusing peer-review on ensuring that the methods, and conclusions drawn from the results are scientifically sound, opens many doors for how scientists publish their findings. Currently, scientists compete for a limited space in a "high-impact" journal. In the majority of cases papers are not rejected because of their methods, results, and conclusions are not valid, but due to a better paper being submitted at the same time. This competition is justified, but in this current format has various drawbacks including:

Importance of research is determined by a very small number of reviewers and usually a single editor has the final decision
Significance or novelty of research is very subjective and can vary widely between reviewers
Significance can change over time as future experiments confirm or depend on the results of the current research (including negative results)
Not making the cut (i.e. rejection) results in a large waste of time as authors have to reformat, resubmit, and respond to new reviewers comments

The separation of the evaluation for competitiveness, novelty, significance, etc. versus scientific robustness helps reduce many of these problems. The largest hurdle to overcome using this model is to move from a journal impact factor to a paper impact factor measurement. Therefore, "signficant" papers are still valued and reconizable in PLOS One and other journals that will likely follow their publishing methods.

Personally, I have never published in PLOS One and by no means do I think PLOS One in its current form is the pinnacle of publishing. However, I do appreciate that they are trying to change the way science publishing is currently conducted.

Google Calendar Available Offline

2009-03-10T10:31:00.000-07:00

I am just starting to peek my head out of the thesis hole and noticed that Google Calendar is now available offline using Google Gears. By default it only syncs your personal calendar, but shared calendars can also be synced under the offline options.

I'm not offline that often, but it is nice to know that my calendar is always available now.

Considering Gears has been around for quite awhile now, I am surprised that it took Google this long to add the offline mode for their calendar.

HubMed Citation Manager

2009-02-19T11:38:00.000-08:00

I just came across HubMed yesterday and I found one of their tools incredibly useful for getting references into EndNote (or other reference manager software tools). Basically, HubMed Citation Finder will take a bibliography (say from one of your favorite papers), split them up, find the citation in PubMed, and return the list of references in several citation formats such as RIS, BibTex, RDF, etc. This file is then easily imported into your reference manager's library.

It just saved me a couple of hours and would have saved me even more if I had known about it a few weeks ago.

Personal Genomics & The Burden of Knowing

2009-02-03T12:38:00.000-08:00

Like many bioinformatists, biologists, scientists, and technologists I am very interested in personal genomics. I have kept track of the start ups that are doing personal SNPs analysis and have been eagerily waiting for sequencing costs to drop to the point were the $1000 genome is possible. I envisage everyone having their personal genome done and programs to analyse the data being so widespread that even a "My Genome Facts" Facebook application would not seem out of place.

Of course I have read lots about ethical worries about how the data could be mis-used or how the public can not handle the probabalities of having a certain disease. Personally, I have always thought these were blown a bit out of proportion and that personal genomics will in general be a good thing. More data is better right?

Well, I just read an article called "The Burden of Knowing" by Catherine Elton in the Boston Magazine and it really made me reconsider my previous thoughts. Elton starts out explaining about personal genomics and specifically about Knome, the first company to do complete personal genome sequencing. She then starts to delve into her personal choices regarding her susectibility to having the BRCA1 gene. The article is extremely well written, and unless I am becoming a complete softy, quite sobering.

A small excerpt that I really enjoyed was this:

The counselors then mentioned another option: having my ovaries taken out and my breasts removed. Here we were, talking about science's ability to look along a submicroscopic piece of DNA, searching for missing letters on a strip of a gene, and yet if science found that letters were missing—if the gene had the cancer-risk mutation—the best it could do was amputate or sterilize. These options seemed as though they should have been filed away in a medieval remedy book, somewhere between leeches and bloodletting.

So did the story change my view on personal genomics? No not completely, but I do think that getting my genome sequenced might not be as fun as I first thought. Too bad there are not many positive attributes linked to genes like "gene variant Y will allow you to live a long life despite your lack of physical exercise" or "you have an improved version of the alcohol dehydrogenase gene, so feel free to drink more beer".

Pseudomonas and Langille in the media

2009-01-27T11:15:00.000-08:00

Ok, this is some serious self-promotion, but scientists (well PhD students anyway) don't get a chance to brag about their research being in the media very often. Plus, it is my blog, so why not?!

The actual science:
The research in question surrounded the sequencing of the Liverpool Epidemic Strain of Pseudomonas aeruginosa that was causing increased virulence in cystic fibrosis patients. One of the interesting things in the paper is that we identified several genes related to virulence (using STM) and that several of these genes were within genomic island and prophage regions. Of course virulence factors have been found within these types of regions before, but to have actual in-vivo (chronic rat lung infection model) experimental evidence that these genes are involved in virulence in an epidemic strain, really makes this research notable. The research was published in Genome Research and is open access.

The media coverage:
Lancet Infectious Diseases (sorry not OA).

Vancouver Sun

Ok, now for the fun stuff:
SFU News - Notice those sleepy eyes? That is what having a 2 month old will do to you!

The story even made some news on a non-English site:
http://news.sina.com.hk/cgi-bin/nw/show.cgi/32/1/1/962395/1.html
Automatic translation results in me being referred to as "blue Gull", SFU as "West gate Philippines Sand University", and UBC as "Inferior poem University".
http://tinyurl.com/68ge47

Looking for a bioinformatics expert?

2009-01-21T11:42:00.000-08:00

What I have to offer:

A balanced background in both biology (BSc) and computer science (BCS)
Soon to be completed PhD
Extensive research experience in bioinformatics, genomics, phylogenetics/phylogenomics, evolution, and bacteria pathogenesis
Some previous research experience in medical imaging, ontology development, and metagenomics
An impressive publishing record (7 papers, 3 first authors, 2 more first authors under review)
Solid computational skills including Perl programming, database design (MySQL), parallel programming, and web design (PHP & JavaScript)
Good communication and social skills
More information

What I am looking for:

Post-doc or job (academic or industrial)
Preferably, a position where I have some significant manager or leadership responsibilities
Geographically interested in north eastern parts of North America (Ottawa down to New York), but would entertain positions elsewhere in N.A.

I didn't put any limitations on research interests, since I am open to many areas. However, anything having to due with the human microbiome project, human-bacteria interactions, or metagenomics would be of particular interest.

Please email me if you are interested or if you have suggestions on some good openings.