Scientists love data. It is like flower to florists, canvas to artists, money to bankers or ingredients to chefs. It answers the questions we have and sets the direction for new ones. From Mendel and his pea plants to Darwin and his finches through Rosalind Franklin and her X-ray crystals of DNA to CERN and their atom smashing tube thingy, the aim of experiments is to generate data to answer questions. We invest considerable time and effort to work out if the data we have is true and representative of the whole or a unique subset caused by chance (statistics) or the way we did the study (experimental design). We often repeat the same experiment multiple times to convince ourselves (and more importantly others) about the validity of our data. Without data, we are just messing around in a white coat.
Too much data
So you would think the more data the better. However, you can have too much of a good thing. Whereas before you would ask does my treatment increase or decrease a single factor, we can now measure 1000’s of things in a single experiment generating huge piles of data (datasets).In biology,, methods that generate large datasets are described as ‘omics. This is named after the genome (all the genes that make up an organism). We now have the transcriptome (all the mRNA – the messages that make proteins - at a certain timepoint), the proteome (all the proteins), the metabolome (all of the bacteria), the microbiome (all of the bacteria on or in the body) and the gnomeome (the number of garden ornaments per square metre). Each technique generates a long list of stuff that goes up or down after a certain treatment. These long lists of data are where the problems arise, being comprised of genes with weird short names like IFIT1, LILRB4, IIGP1 many of which have no known function. All of which leads to a mountain of data languishing in supplemental tables of half-read papers in obscure journals.
Biologist + computer = ???Xxx!!!
The surfeit of data has led to a whole new discipline to interpret these lists called bioinformatics. But bioinformatics requires special skills, knowledge of the mythical ‘R’ programming language, access to software tools with laborious jokey names based on forced acronyms like PICrust (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) and time. Faced with these datasets, I get a bit flustered: like many biologists, I type with 2 fingers, get nervous flushes if someone mentions Linux and can just about use Excel to add two numbers together. This is a problem because it means that there is a wealth of data out there that is inaccessible to me.
Bioinformatics for dummies
I am interested in how the body fights off viral infections in the lungs, particularly a virus called Respiratory Syncytial Virus (RSV). Part of the body’s defences is a family of proteins that restrict viruses ability to hijack our cells to make copies of themselves. There are a lot of these proteins, many of which we have no idea about how they work. A brief look at some of the ‘omics studies reveals long lists of these proteins, with no insight as to what they do. There are probably clever, but inaccessible, AI based algorithms that can search for all the relevant papers and compile them somehow; but I wouldn’t know how to use them or even where to start looking. Instead we used a ‘brute force’ approach, which meant that I/we/Jaq (first author on the paper) sat down and searched for every paper ever published on RSV that contained a big data. Having found the papers, we then harvested the gene lists from them. This was not trivial, some of the papers had to be ignored because they had inaccessible data locked behind pay walls, or were in Chinese, or were just rubbish papers or a combination of the three. But we were left with gene lists from 33 papers and stuck all the data in a big pile. At this point we employed the services of Derek, a bonafide bioinformatician, who through some computer wizardry wrote us a piece of software called geneIDs (which is freely available here, if you need such a thing), which handily counts and ranks the genes. This gave us a brand new list of all the other lists (sometimes called metadata) which can then be used as the basis for further analysis. Which we did and published the results here.
More data: better tools
First of all we compared our computer generated list to some new data from a clinical study. Children with severe RSV had higher levels of 56% of the genes on our list. This supports the approach demonstrating that the genes are important during infection. Taking a subset of these genes, we then performed experiments that showed that they are able to reduce RSV ability to infect cells and animals. In particular we demonstrated that a gene called IRF7 was central to the anti-RSV response. So ultimately the answer to the question, can you have too much data is no, but there is a need for tools to interpret it. In the current study we developed one such tool, which we feel is more accessible to biologists with little to no computer skills.