Innovation In Genetics Using Big Data and Opensouce
I read an interesting interview lately on O’reilly about the power of Big Data Science and traditional molecular biology. Molecular biology is becoming more and more digital. The very vocabulary of molecular biology are DNA base pairs which are expressed as A’s,T’s, C’s, and G’s parallel information technology which uses 0s and 1s. Scanned sequences in molecular biology are now rating on the terabyte scale. In genomics, the term big data is far from a buzzword.
For high frequency and RNA sequencing scans the output is on the terabyte scale. Luckily, there are opensource tools popping up that help us efficiently cope with the data deluge. Along with the oepnsource movement is a swelling Open Data movement that seeks to create more open dialogue between scientists. If someone is looking at gene X and there is something interesting that pops up with gene Y or Z, researchers can now annotate the data with a short blurb. This allows collaborators to start working on the data earlier in the cycle. The point, after all, is to push ideas out and the data behind those ideas to make them public.
The NIH has sent out a mandate for more people to publish their scientific results back into repositories so that there is more freely available data for smart data scientists to mashup with their own experimental findings. All this adds up to a goldmine for data scientists who can quickly draw insight from data visualizations which use disparate data sources both public and experimental.
The big challenge with building a scalable data model for scientific discovery as always is making sure that the data is curated. Data munging takes up a lot of valuable time that many computational biologists and data scientists, would otherwise spend doing analysis and writing algorithms. Standards need to be put in place to validate whether the data is usable or not. One of the suggested methods for doing this is to establish a large project consortium where people can share data inter-institutionally. When people get comfortable with this half-step, we can then open it up to a wider audience.