Whenever an individual
with some background in bioinformatics questions me on my approach to
gene-enrichment, the most common question is probably, “what are your cutoffs?”
Specifically, when entering lists of transcripts/protein/micro-RNAs (whatever…call
them “genes”) into the database, what is the lowest level of significance
required for a gene to make the cut? Or, perhaps, what is weakest fold-change?
My answer: we don’t have strict cutoffs. Most typically, data is sorted
according to some criteria (e.g. significance), and the top 200 up- and
down-regulated portions are both entered into the database.
Some folks are not
appreciative of this approach, preferring, for example, that only data with an
FDR-adjusted P-value < .05 be entered. However, this would mean that
many interesting datasets would be excluded. If FDR < .05 is strictly employed,
all lists such as “tyrosine kinases”, or “common contaminants in mass
spectrometry” would be eliminated. Journals sometimes offer ranked (1,2,3) gene
lists without fold-change or significance measures.
Most importantly, even
in studies where significance and/or fold-change can be measured, strict
cutoffs can cause the loss of very interesting results. Below, I invoke real
studies to convince skeptics that data that is “technically” insignificant can
be very interesting. There are two forms of evidence. In the first group,
a gene is targeted (e.g. by knockdown), yet fails to meet standard P-value
cutoffs. Nevertheless, this gene is the single most strongly altered entity in
the study when measured by mere fold-change. In the second group, a collection
of insignificantly altered genes under a particular experimental condition matches
up with extreme significance to another gene set in our database under similar
experimental condition. For example, study A might treat cells with a drug vs.
control, with genes being altered at mathematically insignificant levels. Study
B, which is independent of A, uses a similar drug. We then find that despite A’s
lack of significance, studies A and B overlap with extreme significance (as
measured by Fisher’s exact test).
I’m not a
mathematician, and won’t delve deeply into the apparent over-conservatism of
standard adjustment methods (particularly Benjamini-Hochberg). Here’s one
argument, however, that is fairly intuitive: Let’s imagine a study with 100
genes up-regulated at an adjusted significance of .75, with 9900 other genes
being up-regulated at an adjusted significance of 1.0. The .75 figure, of
course, is “insignificant.” Nevertheless, .75 also tells you that 25 of those
100 genes may have “really” been upregulated. You construct a gene list using
those 100 genes. When applying Fisher’s exact test against another list of data,
25 “truly” upregulated genes, versus 1 or 2 that you might expect by randomly
pulling 100 genes from the 10,000, can result in huge alterations in
significance. Basically, a list of weakly altered genes can match up extremely
significantly with other lists; mass effects in action.
Hopefully, the sum of examples below will be convincing. The examples are far from exhaustive...I merely collected them over the last few weeks upon off-handedly noticing (for the millionth time) that these insignificant "omics" results were actually very interesting.
Group 1
*In a GEO Dataset (GSE81399)
from the study, Targeting of
Mesenchymal Stromal Cells by Cre-Recombinase Transgenes Commonly Used to Target
Osteoblast Lineage Cells,
DMP1 should be overexpressed in “targeted” cells. It is, and in fact has the
greatest fold change (about 10X) of 22,000 transcripts. However, following FDR
adjustment, this alteration is insignificant (P = .67).
*In GSE83388, the gene KSRP is knocked down. After adjustment, the P-value
(vs scrambled siRNA) is .82. Nevertheless, KSRP has the second-greatest fold
change of 31,000 identified transcripts.
*In AKT isoforms modulate Th1-like Treg generation and function in
human autoimmune disease, IFN-G+ tregs are separated from IFN-G- tregs.
IFN-G itself fails to reach significance in IFN-G+ cells, though some other
transcripts are indeed significantly altered. Nevertheless, IFN-G has the
single-greatest fold-change of any of 31,742 transcripts.
*In Transcriptomic Analysis Unveils Correlations between Regulative
Apoptotic Caspases and Genes of Cholesterol Homeostasis in Human Brain,
CASP2 is knocked down. After adjustment, this knockdown is insignificant (P
= 1.0). Nevertheless, when more than 30,000 genes in the transcriptomic set are
ranked according to fold-change, CASP2 ranks #1.
*In Genome-Wide Analysis Identifies NURR1-Controlled Network of New
Synapse Formation and Cell Cycle Arrest in Human Neural Stem Cells, NURR1
is overexpressed. After adjustment, this overexpression is not significant
against controls. Nevertheless, NURR1 is the single most upregulated transcript
(of more than 30,000) in the study as measured by fold-change.
*In GSE175853, TAZ is knocked-down. Relative to controls, the adjusted
P-value is .492. Nevertheless, of more than 20,000 transcripts, it ranks #2 in
terms of fold-change.
Group 2
*In Work, meaning, and gene regulation: Findings from a Japanese
information technology firm (PMID 27434635, GSE79092), male workers were
scored according to numerous psychological parameters (e.g. hedonia) and blood
transcripts were examined. We chose to look at eudemonia. Despite the fact that
no single transcript was significantly altered in comparison of high vs. low
eudemonia subjects, sorting according to fold-change and then searching for
datasets that strongly intersected this particular study generated some very
significant and interesting results. For example, transcripts downregulated in
high eudemonia subjects tended to be upregulated in responders to lithium
treatment (log(P) = -77) and in sleep deprivation subjects (-47).
Transcripts upregulated in high eudemonia subjects tended to be downregulated
in mice upon “polytrauma” (-21). It is thus difficult to argue that these
“insignificantly altered” transcripts are no different than randomly sorted
transcripts.
*Comparing grade II vs grade I breast lobular carcinoma (GSE88770), not
a single transcript showed an adjusted P-value less than 1.
Nevertheless, a list of the most downregulated transcripts in this comparison
best intersected with another breast cancer study (GSE49481). Specifically,
these downregulated transcripts intersected with transcripts downregulated in
invasive ductal carcinoma versus invasive lobular cancer at P=10-34.
*In GSE112943, the lupus lesional skin transcriptome is examined against
healthy skin. Despite no single gene rising to the level of significance after
adjustment, the database dataset that best matches upregulated transcripts in
this study is another study of lupus lesional skin: GSE72535. The best match to
downregulated transcripts in GSE112943 is found in another study of lesional
skin: GSE136757.
*In GSE79721, a novel BET inhibitor is applied to breast cancer cells.
Comparing these cells to controls, not a single transcript was altered at an
adjusted P-value of <1.0. Nevertheless, the single best overlapping
upregulated and downregulated datasets (with Fisher P-values of 10-75
and 10-54) both involved application of BET inhibitors to cells.
*In Knockdown of a novel lincRNA AATBC suppresses proliferation and
induces apoptosis in bladder cancer, bladder cancer tissue is compared to
adjacent tissue. After adjustment, no transcripts are significantly altered,
possibly because only two cancer and two control samples were examined.
Nevertheless, the database study that best overlaps transcripts upregulated in
this study (at P=10-18) is yet another comparison of bladder
cancer against adjacent tissue (GSE100926). On the downregulation side, two
other cancer studies (esophageal and colorectal) outcompeted GSE100926 for
significance.
*In GSE58591, a comparison of female vs male ES cells is made. Following
FDR adjustment, only 4 transcripts are significantly altered. Nevertheless,
Y-chromosome transcripts dominate the list of most strongly downregulated
transcripts, as measured by fold-change alone (e.g. DDX3Y, USP9Y, EIF1AY, etc).