This debate stated so boldly on the tittle above, is not something that I’m going to solve in this post. But is something that has made me thinking a lot lately, partly for personal reasons while in the process of wrapping up a PhD, and trying to evaluate what I have learned as the years have passed (and what I want to do with my life next)…

So to get right into the point, I see a distinction in the bioinformatics field between scientists who go after testing hypotheses and bringing out new knowledge, and developers, who are basically engineers that built,break, hack etc. bytes which make computers do interesting stuff. What I want to emphasize though, is the catch for the ones that lie in the middle, not few cases of these observed within this field.

Many where the times that I fell (and still falling) into this “middle” trap myself, something I realize as I look in the past and the present of my work. Let me explain with a simple example what this trap is: suppose you want to use a piece of software that visualizes some data. Suppose also that this software was not made by Microshoot (you know which one I mean), but rather is a product of a small lab, presumably a result of a graduate student’s PhD project. Let’s say that this software is written well enough (i.e. a real program, not a collection of scripts), but still since it’s one grad. student’s creation and not Microshoot’s, it’ll take you some time until you get it working (see also related post from this blog).

And where will that time go? Probably you’ll need to do some *nix hacking to get the thing working (even if it comes with make/configure files, still you might need to set-up some libraries etc). Then add some time on top of that if you have to bring your data in the format appropriate for the software (often case in bioinformatics, with its babel tower of data formats). All-in-all the setup process will take 50% of the time you decided to allocate for trying your data with that software (and that 50% might be in any scale, from a day to weeks, depending on the scale of your endeavors). The remaining 50% will go to the scientific discovery / hypothesis testing part. On the other hand, a -scientist- (don’t want to use quotes cause it might seem I’m being ironic, but still want to emphasize the word) would go for something that works right out of the box, so that he/see can spent 100% of the time testing hypotheses with the data (that’s why people use Spotfire for analysis of microarray data).

The problem that originates here if you are on the 50-50 approach, is that you get to be dead in the middle. Why this is not good ? Well, personally I enjoy building stuff and would opt for spending 100% to develop the thing and forget the hypothesis testing. Or if I didn’t like development, I’d immerse myself into reading all the papers related to the hypothesis I want to test, curate knowledge in my head, and go for 100% there (spend some good bucks too for buying Spotfire).

I know many of you will tell me you use open source software developed by small labs and available for free, hack it and make it do you job. But think about how much time you spend on that. The time spend is often pretty significant, but it’s rather invested in setting it up the software on your machine, or doing a small hack that scratches the surface of what lies beneath the years of development that it took (a student probably) to build the software. Does the investment of time to set it up offer you something in learning ? Learn your system ? Maybe, but once you learn how to setup software on a *nix machine, I think it doesn’t offer anything to repeat it with any program you wanna experiment with.

I’ll close here, but before that I’ll underline my main point for this post: I believe everything you do must give you something in return, something that make you know more and be more experienced after you did it. This may sound like seeing things just black or white, but wouldn’t be better to spend your full time in either building the softare, or transparently using it to do your biological discoveries - so in the end you become a strong developer or a strong scientist ? I see lots of the middle-way in bioinformatics, and maybe that’s because we have the bulk quantity of papers published in the field being of average quality. Do we need segregation of the developers and the scientists ? Do we need journals that publish on bioinformatics software and rigorously review its quality, and journals that publish strong theories developed using large bioinformatic data from scientists that know their specialty ? Do we need scientists that are software developers ? The answer is yours.

I have been experimenting lately a bit, by going “under the hood” in Unix-type of operating systems, specifically Linux (openSuse), NetBSD and OpenBSD… why ? Because it is boring doing the same thing the whole time even if you learn and discover new stuff (see writing PhD thesis and data-mining relational databases).

So what did I learn: apart from the gory details, it boils down to that lots of hardware with still juice left after the years have passed, gets put aside because it becomes slow when using it with out-of-the-box operating systems. I’ll give you two examples for that, one being my 5-year old Powerbook G4 Titanium (800Mhz) which with OSX (and to be exact, not the latest version of OSX) was simply crawling. Yes - you would say, it’s a five year old computer, time to throw it away; but after installing OpenBSD to it, I got fast web browsing with Firefox, email with Thunderbird, pdf and document reading. And to bring a good example of how successful the resurrection was, when I use it to read blogs in Google Reader, or write posts through WordPress’ web interface (both heavyweight javascript), the Powerbook with OpenBSD just flies.

The second example is what I’ve seen recently in a Windows XP machine (well that’s the worst example ever!). The machine is a decent 1GB, Core Duo HP, only about 2 years old and its getting slower and slower. Why ? My suspicion is that with the patches and updates Windows install automatically through the network, it just makes the Window’s kernel a mess, with the accumulating mass of patches hogging away on the system’s resources. So that’s where the “2-year life cycle” of a computer comes from ? Without need to do any preaching to the choir, for those of you who have tasted a piece of Linux, you know that this machine is more than adequate for a production desktop with Linux or any other efficient Unix-type system.

Now the real part of the dark side, concerning my experimentation of compiling customized kernels for openSuse Linux and NetBSD. Basically, kernel compilation is editing a configuration file, which leads to compiling less code and consequentially removing bits and making smaller the core binary (the kernel image) that sits in your computer’s memory, and controls the hardware. Since Linux (and Windows) are intended to be installed on many different brands of machines with different hardware, their kernel comes pre-compiled with support for all of those. But if you don’t use infrared port or bluetooth with your laptop, do you want the bits for controlling this hardware just eating up your RAM, since they come as part of the pre-compiled kernel ? So what you do, is to edit the kernel compile configuration file appropriately, and then re-compile your kernel to shave off those bits and make the kernel image smaller.

My experimentation so far, has told me that compiling custom Linux kernels is a little more of a pain compared to NetBSD. That’s because Linux as a user-centric / supporting everything (joysticks, radio & tv cards etc etc) type of operating system, has a lot more as default in the configuration file of the generic kernel, that need to removed in order to tailor it to your specific hardware. The gain is significant though, in my case about 100MB of more available RAM with the custom-compiled kernel. On the other hand, NetBSD has a much simpler file for configuring your kernel’s compilation, and things you need to remove are RAID, PPP etc support (well you need to leave those if you’re compiling for a data center and not a laptop).

The other thing I observed when comparing openSuse Linux versus NetBSD, is that the latter has much smaller kernel that the former, with pretty much the same hardware support for an ordinary desktop pc. I haven’t drilled down on that yet, but maybe Berkeley BSD’s kernels are better written than the one put together by Linus ? To get down to earth again, I am now left with a Core 2 Duo - 2GB memory laptop, which has 1.7GB of memory free for my applications after 0.3GB are eaten by openSuse (did I mention that Windows Vista in the same machine eats 1.2GB by itself ?!!?). NetBSD runs only with only taking about 50MB (!) of RAM, and currently is installed on my old PIII with 256MB memory, making it a very happy computer….

Under Linux, I am running 4 virtual desktops (well not under KDE desktop manager which is a memory hog, but under Fluxbox window manager), with a couple of Firefox windows - each with a bunch of tabs - open in the first desktop, 4-5 x-terminals in the second, OpenOffice Impress presentation on the third, and Cytoscape with a gene interaction network (in the thousands of nodes) open on the fourth. The amazing thing is that I still have memory left to perform operations in the network in Cytoscape, and the machine never swaps data in the hard-disk. So I see this laptop staying with me for some more years to come, and when apps need more memory (i.e Firefox 7.0) or my data crunching needs grow bigger, I’ll just thin down the operating system (switch from Linux to NetBSD).

To conclude, it feels good to hack on the dark side :-) I think I will stay here for a little bit, the next project being to transfer to the Core 2 Duo laptop, the Postgres database that hosts all my data-mining projects and runs on a Quad Xeon server. It will be an interesting thing to see whether a 1000$ present-day laptop, does better than a 4-year old / 10,000$ server…

In the meanwhile, don’t throw away your old computer please !

So I’ve been reading a bit lately around the web about Open Innovation, an interest spanned out of some lounging with the Wikinomics book by Don Tapscott. A prime example of this model of new economy is Innocentive, with outsourcing of brain power and seek for diverse expertise (problem solvers participating in Innocentive get cash). In my mind, the concept encompasses everything from open source operating systems (see Redhat, Novell to see how this is model of economy makes good bussiness), to open source drug development, and the whole breaking from the hierachical-closed company management systems.

So I was wandering whether this is a new thing, or it has been always happening. I got a hint that makes me believe the latter, from reading the Free as in Freedom book on Richard M. tallman (FSF). Without any nervous looks around, and without cold sweat dripping down my forehead for violating any copyrights, here’s an excerpt from the first chapter of the book :

“….why companies like Xerox made it a policy to donate their machines and software programs to places where hackers typically congregated. If hackers improved the software, companies could borrow back the improvements, incorporating them into update versions for the commercial marketplace. In corporate terms, hackers were a leveragable community asset, an auxiliary research-and-development division available at minimal cost.”

It seems to me that it has been happening before, when the term open source did not exist, but the difference today is the upscaling of the phenomenon in the internet era. Plus cheap computers and network bandwidth for everyone. And I am wondering if we are in a transition phase, where big corporate organizations try to absorb / accept / adapt to the new model of Open Innovation, and share some of their IP, operations, seek of problem solvers with the commons, ripping the benefits of collective intelligence in return.

My speculation as I sit here writing this post, is that more and more companies will go for it, and my secret hope is for pharma, where I strongly believe collective intelligence will drive drug discovery as it drove all open source operating systems over the past 10 years.

For those of you liking to know more about Open Innovation, wanting to think how it will be like over the next 10 years, but are bored of reading blog posts about it and want to understand it while doing something more fun, I suggest this sci-fi book by Charles Stross.

(… or how to benefit from other people’s curation of information across the web)

So I have been brainstorming about how to write up a short article for the Biogang’s wikipage, with a topic for online collaborative communities in bioinformatics. I’m contemplating to gather a collection of examples across the web and present a brief description of each, but before going even further I thought I should have a look in my own online back-yard. So here came the present post, on how you can find information curated through human intelligence and not PageRank, gathered from a variety of sources and aggregated within a single focal website, a.k.a. Friendfeed (FF).

People over at FF, capture their daily web activity and stream of consioussness through a variety of ways, by sharing interesting articles from Google Reader, bookmarking at del.icio.us, posting short messages on Twitter and a variety of other Web 2.0 sites, as the two following pictures show. Some of these people I find interesting, and I subscribe to the records of their daily web activity…

Now the question that comes to mind, is that all these people have a diverse array of daily jobs, personal interests, expertise in topics. Most of our opinions, ideas, new knowledge on different topics comes from the web nowdays (at least for my FF peers, I find it difficult to imagine them curling up with hardbound volumes from the Annual Proceedings of some scientific Society). As these people move through the linked information space of the world’s largest (un-indexed) encyclopedia, they gather information that answer their questions, open a new window of interest, or simply amuse them.

Can I easily use their collection of answers, so that I can answer questions of my own on similar topics ? Yes, and that comes from searching through the stream of consciousness of the people I follow over at FF: here’s an example of what was curated over the web by my peers at FF, on a topic that is related to this blog:

Hmmm… Pierre has exactly posted an RSS link for how Web 2.0 can help with the information explosion, I’ll just click and read it, sure thing it’ll save me some Googling and spending time to search through the search engine’s results…

Anybody not convinced yet about Friendfeed ?

P.S. I will follow up with a post discussing the options for online collaboration, discussion over at FF.

I’d like to duplicate here a long comment I made in response to nsaunder’s post on the development of bioinformatics software he did as part of his research. This comment ended up being a long spill of thoughts from personal experiences, in regards to the approach of software is development in academic settings of the bioinformatics field. So here’s how I think this approach affects the quality of the developed software…

The bad software design due to rush of getting things ready soon, and is eminent all over the bioinformatics field. I have experienced it personally, and in addition to your “publish or perish” reason which I completely agree,  I add one more: many (probably the greater part) of P.I.s for grants in bioinformatics are biologists. This translates to a boss who knows minimal or even nothing about implementing software, has no notion of software usabilty, that a clean implementation needs time, testing, user feedback etc. Also we have to throw a portion of the responsibility to funding agencies, which again put mainly biologists on the committees, have again no idea about software, give a timeline for which the funded research has to be out and no standards requirement for how the software should be implemented.  This to my humble opinion is the whole reason which we see so much of the development in bioinformatics being replicated efforts. A little bit of a corporate approach where the software is a product and if it’s bad it will not sell, I think would not hurt .  Having interoperable software in bioinformatics is a whole ‘nother story - even the commercial software cannot work well with each other cause the development is closed behind each company’s walls. But the difference with commercial software is the quality of code written, primarily through procedures that test it, but also because not having a single grad student to build the whole thing, take classes, write thesis, oh and not forget, please the boss with the biological insights that he gets after he / she uses the software…

Next Page »