Web 2.0 and Semantic Web for Bioinformatics

compute with infinite capacity from within your IDE

Posted in cloud computing by agbiotec on November 13, 2009

Imagine to be able to write code to process a large dataset, and immediately being able with a click of a button to run that code on a compute cluster, without worrying about setting up anything like job submission to a grid, resource allocation, compilation etc etc etc

Now stop imagining, because this is reality today. The details how this can happen is via your NetBeans IDE (Integrated Development Environment) and Amazon’s Elastic MapReduce. Leveraging the capability of this Amazon web service for provisioning Hadoop (the open source implementation of Google’s Map/Reduce parallel programming framework) compute clusters of any size on the fly, Karmasphere offers an amazing Netbeans plugin.

Developers can add their Amazon credentials within the IDE after installing the plugin, write their code, and perform parallel computing on as big cluster as they desire (or to be realistic, as big as the credit limit their card has). The plugin takes care of communicating with the Amazon Elastic MapReduce API and submitting the code for execution, while Amazon sets-up the cluster in minutes and pulls the desired data from  S3 storage.

The most impressive thing for me in this story, is the abstraction for all the layers that used to require an expert a few years back (see MPI and C), in order to perform high performance computing. And the second impressive thing is the democratization of access to large computing resources. Definitely infrastructures and software are in place in big corporations, allowing business analysts and researchers to write and execute code on big clusters without worrying about the details of cluster setup. But how about those outside those corporations ? And how about those that don’t want to code in C or C++ ?

Through Hadoop’s Streaming option though, anyone that can write a Perl script which will operate in a large dataset on a compute cluster, by following the most abstract programming model of Map / Reduce. With a laptop having an internet connection and a credit card, a Bionformatics researcher anywhere in the world can write code, and execute it without worrying for resource constraints, but only worrying about his or her application logic.

About these ads
Tagged with:

One Response

Subscribe to comments with RSS.

  1. Marlena said, on November 14, 2009 at 12:36 pm

    Ease and access to resources like this are especially important for data visualization. The longest program I run takes 3 hours to parse and assemble a dataset on my mac. Given the dearth of hours I have to do my school work, the idea of running something like this on an Amazon cluster is quite appealing. Thanks for this post.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: