Scientists love data. It is like flower to florists, canvas
to artists, money to bankers or ingredients to chefs. It answers the questions
we have and sets the direction for new ones. From Mendel and his pea plants to
Darwin and his finches through Rosalind Franklin and her X-ray crystals of DNA
to CERN and their atom smashing tube thingy, the aim of experiments is to
generate data to answer questions. We invest considerable time and effort to
work out if the data we have is true and representative of the whole or a
unique subset caused by chance (statistics) or the way we did the study
(experimental design). We often repeat the same experiment multiple times to
convince ourselves (and more importantly others) about the validity of our
data. Without data, we are just messing around in a white coat.
Too much data
So you would think the more data the better. However, you
can have too much of a good thing. Whereas before you would ask does my
treatment increase or decrease a single factor, we can now measure 1000’s of
things in a single experiment generating huge piles of data (datasets).In
biology,, methods that generate large datasets are described as ‘omics. This is
named after the genome (all the genes that make up an organism). We now have
the transcriptome (all the mRNA – the messages that make proteins - at a
certain timepoint), the proteome (all the proteins), the metabolome (all of the
bacteria), the microbiome (all of the bacteria on or in the body) and the
gnomeome (the number of garden ornaments per square metre). Each technique
generates a long list of stuff that goes up or down after a certain treatment. These
long lists of data are where the problems arise, being comprised of genes with
weird short names like IFIT1, LILRB4, IIGP1 many of which have no known
function. All of which leads to a mountain of data languishing in supplemental
tables of half-read papers in obscure journals.
Biologist + computer = ???Xxx!!!
The surfeit of data has led to a whole new discipline to
interpret these lists called bioinformatics. But bioinformatics requires
special skills, knowledge of the mythical ‘R’ programming language, access to
software tools with laborious jokey names based on forced acronyms like PICrust
(Phylogenetic Investigation of Communities by Reconstruction of Unobserved
States) and time. Faced with these datasets, I get a bit flustered: like many
biologists, I type with 2 fingers, get nervous flushes if someone mentions
Linux and can just about use Excel to add two numbers together. This is a
problem because it means that there is a wealth of data out there that is
inaccessible to me.
Bioinformatics for dummies
I am interested in how the body fights off viral infections
in the lungs, particularly a virus called Respiratory Syncytial Virus (RSV).
Part of the body’s defences is a family of proteins that restrict viruses
ability to hijack our cells to make copies of themselves. There are a lot of
these proteins, many of which we have no idea about how they work. A brief look
at some of the ‘omics studies reveals long lists of these proteins, with no
insight as to what they do. There are probably clever, but inaccessible, AI
based algorithms that can search for all the relevant papers and compile them
somehow; but I wouldn’t know how to use them or even where to start looking.
Instead we used a ‘brute force’ approach, which meant that I/we/Jaq (first
author on the paper) sat down and searched for every paper ever published on
RSV that contained a big data. Having found the papers, we then harvested the
gene lists from them. This was not trivial, some of the papers had to be
ignored because they had inaccessible data locked behind pay walls, or were in
Chinese, or were just rubbish papers or a combination of the three. But we were
left with gene lists from 33 papers and stuck all the data in a big pile. At
this point we employed the services of Derek, a bonafide bioinformatician, who
through some computer wizardry wrote us a piece of software called geneIDs (which
is freely available here, if you need such a thing), which handily counts and
ranks the genes. This gave us a brand new list of all the other lists
(sometimes called metadata) which can then be used as the basis for further
analysis. Which we did and published the results here.
More data: better tools
First of all we compared our computer generated list to some
new data from a clinical study. Children with severe RSV had higher levels of 56% of the genes on our list. This supports the approach demonstrating that the
genes are important during infection. Taking a subset of these genes, we then
performed experiments that showed that they are able to reduce RSV ability to
infect cells and animals. In particular we demonstrated that a gene called IRF7
was central to the anti-RSV response. So ultimately the answer to the question,
can you have too much data is no, but there is a need for tools to interpret
it. In the current study we developed one such tool, which we feel is more
accessible to biologists with little to no computer skills.