Open Access, Science Commons, Open Science
What is wrong with the current way science, dissemination of research findings and resulting benefits for society at large, take in the current Internet era ? … and why do we need things like Open Access, Science Commons, Open Science….?
One good way to define it, is by quoting from the Science Commons website:
“There are terabytes of research data being produced in laboratories around the world, but the best web search tools available can’t help us make sense of it. Why? Because more stands between basic research and meaningful discovery than the problem of search. Many scientists today work in relative isolation, left to follow blind alleys and duplicate existing research. Data is balkanized — trapped behind firewalls, locked up by contracts or lost in databases that can’t be accessed or integrated….”
“The consequences in many cases are no less than tragic. The time it takes to go from identifying a gene to developing a drug currently stands at 17 years — forever, for people suffering from disease.”
Three main factors should be considered when trying to understand the statements above:
- Copyrights for published data (Open Access)
- Open distributed collaboration (Open Science)
- Knowledge fragmentation and data standards (Science Commons)
Copyrights for published data
An excellent article by Peter Murray-Rust in Nature Precedings (January 2008) documents from personal experience and also using examples from others, what happens when you “hand” your article to a peer-reviewed journal for publication. The central idea in this article is that for many of the traditional, non-open access (subscriber only) access journals, whenever you publish an article you have to surrender copyright to the publishing house over the text, data, figures and supplemental material of your manuscript (by signing a “Copyright Transfer Agreement”). To demonstrate the consequences of this fact, Murray-Rust reports on a case of a student that posted on her blog some of the supplemental material from a manuscript, in order to comment on some conclusions she reached from the data. What followed her post was notices for legal action from the journal and publishing house, based on the claim of copyright violations !
Doesn’t this journal’s action come in contrast with the fundamental spirit of science, which is being open with research results and sharing them in order to get or provide feedback from / to peers ? The way all journal copyright agreements are currently set up is not to serve the interests of the scientist or for the advancement of knowledge, but for protecting the profits of the publishing house. The first thing those agreements usually state, is pretty much that broadcasting of the article content on a medium that reaches the public is not allowed. So if we think about the simplest case, we all do an infringement each time we put up on our presentations a graph from a published paper that is copyrighted !
Imagine yourself in a situation where after reading an article you see a pattern in the data, which the authors haven’t observed or wrote about in their article (or maybe they noticed but did not want to write about, for obvious reasons). You probably cannot publish a new article for just your observation. The alternative is to try sending a letter to the journal, but the question is whether they will publish it or not. Even if it gets through, the journal will not allow much print space for letters communicating back and forth. Another option is to contact the authors and let them know, or maybe also include the article’s graph with your observations during a presentation in a seminar. The journal’s lawyers probably will not get to see the copyright infringement, that you perform by broadcasting the graph through your Powerpoint slides within the conference room! But how far can you get through these channels, and alternatively what would be the medium that will spread your word most ?
The answer is the world wide web, where publishing is free and gets to many, many people’s eyes and ears without any barriers. Especially by using a blog website allowing commentary, discussion can go on and on, unlike a journal pages where communication between authors and readers is restricted. Now thinking about the real world example Murray-Rust reports in this paper, would you post chucks from the paper along with your comments on a blog, then sit around and wait for the journal to contact you with legal threats because you’ve posted on the web copyrighted material ?
While it seems logical for the text of an article to be copyright-able (because it is rather the art of the author, similarly to when writing a novel for example), the same is not true of the data. The data should belong to the public because they are facts measured from nature, and similarly to the patent law only artifacts made by man can be patented – but not any products of nature.
We see therefore how the copyright framework can stifle advancement of knowledge, since it allows discussion of scientific research to be performed only through limited channels. In contrast to Open Access, copyrighted articles do not allow any user who has access to the Internet to link, read, post and dicsuss, data-mine and re-analyze the digital content of that article. Furthermore, Open Access secures equal opportunity, since everyone can have access to the latest research results and not only those nations or institutions who can afford the expensive subscriptions fees to journals.
Open distributed collaboration
As the corresponding article on Wikipedia puts it, Open Research and Open Science are
“…research, conducted in the spirit of free and open source software…..built around a source code that is made public, the central theme of open research is to make clear accounts of the methodology, along with data and results extracted therefrom, freely available via the internet. This permits a massively distributed collaboration…..”
For no other reason, open collaboration offers the advantage that a large group of minds can reach faster a solution, than an individual or a few closely collaborating peers. This is clearly demonstrated in the principles of Open Innovation, a business model encouraging search of skills further than the local talent pool within a company’s confinements. In the current era of Web 2.0, communication media such as blogs and social networking websites can be used by people of common goals and interests as high bandwidth channels, for the flow of ideas amongst each other. That is based on the fact that a sheer number of people are connected through the internet, and through the Web 2.0 sites they can express themselves and contribute their opinions on topics, something described already as collective intelligence. The practice of open source software development, and examples such as whole operating systems like the GNU/Linux developed by the self-assembly of individuals in online working groups, demonstrate the wisdom of the masses. Imagine therefore, how benefiting could be for a scientist working on a hard problem to harness this collective intelligence in order to reach a quicker and better solution.
The problem is that current academic system discourages open collaboration, mainly because promotion to tenure uses the number of peer-reviewed publications as evaluating metric. Researchers safeguard their results in order to secure their competitive advantage based on novelty of publication, in fear that openness might deprive them from that edge. There is currently no metric system that provides accreditation to the participating scientists of an online community, that might be working towards the solution of a difficult problem. A system which will credit open collaboration and account for individual contributions, will result to a more efficient discovery process by harnessing the untapped potential of large groups of scientists working together.
Knowledge fragmentation and data standards
This part problem comes from the fact that data are being locally deposited in the computers of each research group, while they are also stored in all sorts of formats. Probably we all have seen the “data available upon request from the authors” in research papers. Most of us have also gone through the process of having to write to the authors in order to request the data from a publication. Following receipt, we find out that it takes quite some time to figure out the on-the-fly-just-designed-to-get-the-job-done data format most people use (I include myself with those people when I have deadlines), and how frustrating it can be when trying to built upon a published manuscript’s data.
The goal is to have directly accessibly and re-usable scientific data, by depositing on public repositories and following standardized formats. Ideally, this will enable integration of the various published results, but of course it would also require open access to articles’ contents. Just being able to play around by easily integrating data from various sets of published experiments (say for example studies on proteins that have been found to be related to cancer), would probably bring forward a completely new set of insights.
There is currently a whole range of technologies that are designed towards integration of information drawn from diverse sources, such as for example the semantic web, or more practical but less powerful like the XML standard. None of these can work though if research results are locked behind database firewalls, preventing development of applications which will automatically aggregate the data and sift through them for the discovery of interesting patterns. In my opinion this will be the real “boom” in bioinformatics, when we will be able to add value in the existing information by intelligently combining (mashup using computers) data-sets, something already being a very profitable industry outside academia (Google Maps mashups, ProgrammableWeb).
Besides the experimental data from the published articles, equally important is for the text to be in machine-understandable format. A good example that demonstrates this is the following: suppose you are conducting research on one topic, and maybe there are 25, 50 or even 150 papers related to it. You can probably read all those in a few months and keep referring back to them (you cannot memorize all of the content!) while you advance in your experiments. But is there a more tedious thing than going back to a stack of papers – or an e-stack of PDF files, and trying to spot the section that you remember contains a clue to help you interpret a result you got from an experiment ?
That is where data standards and open-access comes to the rescue. Over at NCBI’s PubMed there is an open access repository of peer-reviewed articles, for which content is deposited in XML format. In a few words, Pubmed’s XML tags all sections of an article using computer-processable identifiers such as <Title>, <Abstract>, main text <Section>’s paragraphs, <Graph> etc. In this way the tedious finding of couple sections or graphs among 150 papers that contain results related to your work, becomes something your computer can do. Using a simple XML parser software, you can have every chunk of all the articles on your fingertips, and very easy perform operations like “bring me the graphs from articles that contain such and such keywords in their abstract”.
PubMed enables all that because it follows the XML standard and has the text of the articles open and free. But this is only one very small fraction of the articles published in peer-reviewed journals. For all the rest, you will have to go through your stack of papers and search to find the section of interest. We need open and machine processable literature, so that new knowledge can be generated by aggregation of the current findings. It is easy for somebody to see that this slow process of manually gathering knowledge, can hinder research in mission-critical areas such as drug research.
With the current publishing system it is difficult for the opinions of readers of an article to reach the authors, and even more difficult to echo it back to the wider audience that an academic article is intended for. The peer review process is slow, and only examines the results of scientific inquiry under the scope of the few minds participating in the process. Open collaboration can be the way for tackling difficult scientific problems by harnessing collective intelligence, but a new system is needed which will provide the necessary credit to those participating in online communities. Acceleration of research is possible when data are deposited in openly accessible repositories, in formats which can be processed by computers. Complex scientiifc problems can be solved, and difficult questions answered by building upon the increased value of the aggregated information.
The Future of Science is Open
Part 1 (http://3quarksdaily.blogs.com/3quarksdaily/2006/10/the_future_of_s_1.html)
Part 2 (http://3quarksdaily.blogs.com/3quarksdaily/2006/11/the_future_of_s.html)
Part 3 (http://3quarksdaily.blogs.com/3quarksdaily/2007/01/the_future_of_s.html)
ProgrammableWeb Applications using open data from maps, markets, blogs adding value and some create a bussiness