Thursday, July 15, 2021

On Cutoffs

Whenever an individual with some background in bioinformatics questions me on my approach to gene-enrichment, the most common question is probably, “what are your cutoffs?” Specifically, when entering lists of transcripts/protein/micro-RNAs (whatever…call them “genes”) into the database, what is the lowest level of significance required for a gene to make the cut? Or, perhaps, what is weakest fold-change? My answer: we don’t have strict cutoffs. Most typically, data is sorted according to some criteria (e.g. significance), and the top 200 up- and down-regulated portions are both entered into the database.

Some folks are not appreciative of this approach, preferring, for example, that only data with an FDR-adjusted P-value < .05 be entered. However, this would mean that many interesting datasets would be excluded. If FDR < .05 is strictly employed, all lists such as “tyrosine kinases”, or “common contaminants in mass spectrometry” would be eliminated. Journals sometimes offer ranked (1,2,3) gene lists without fold-change or significance measures.

Most importantly, even in studies where significance and/or fold-change can be measured, strict cutoffs can cause the loss of very interesting results. Below, I invoke real studies to convince skeptics that data that is “technically” insignificant can be very interesting. There are two forms of evidence. In the first group, a gene is targeted (e.g. by knockdown), yet fails to meet standard P-value cutoffs. Nevertheless, this gene is the single most strongly altered entity in the study when measured by mere fold-change. In the second group, a collection of insignificantly altered genes under a particular experimental condition matches up with extreme significance to another gene set in our database under similar experimental condition. For example, study A might treat cells with a drug vs. control, with genes being altered at mathematically insignificant levels. Study B, which is independent of A, uses a similar drug. We then find that despite A’s lack of significance, studies A and B overlap with extreme significance (as measured by Fisher’s exact test).

I’m not a mathematician, and won’t delve deeply into the apparent over-conservatism of standard adjustment methods (particularly Benjamini-Hochberg). Here’s one argument, however, that is fairly intuitive: Let’s imagine a study with 100 genes up-regulated at an adjusted significance of .75, with 9900 other genes being up-regulated at an adjusted significance of 1.0. The .75 figure, of course, is “insignificant.” Nevertheless, .75 also tells you that 25 of those 100 genes may have “really” been upregulated. You construct a gene list using those 100 genes. When applying Fisher’s exact test against another list of data, 25 “truly” upregulated genes, versus 1 or 2 that you might expect by randomly pulling 100 genes from the 10,000, can result in huge alterations in significance. Basically, a list of weakly altered genes can match up extremely significantly with other lists; mass effects in action.

Hopefully, the sum of examples below will be convincing. The examples are far from exhaustive...I merely collected them over the last few weeks upon off-handedly noticing (for the millionth time) that these insignificant "omics" results were actually very interesting.

Group 1

*In a GEO Dataset (GSE81399) from the study, Targeting of Mesenchymal Stromal Cells by Cre-Recombinase Transgenes Commonly Used to Target Osteoblast Lineage Cells, DMP1 should be overexpressed in “targeted” cells. It is, and in fact has the greatest fold change (about 10X) of 22,000 transcripts. However, following FDR adjustment, this alteration is insignificant (P = .67).

*In GSE83388, the gene KSRP is knocked down. After adjustment, the P-value (vs scrambled siRNA) is .82. Nevertheless, KSRP has the second-greatest fold change of 31,000 identified transcripts.

*In AKT isoforms modulate Th1-like Treg generation and function in human autoimmune disease, IFN-G+ tregs are separated from IFN-G- tregs. IFN-G itself fails to reach significance in IFN-G+ cells, though some other transcripts are indeed significantly altered. Nevertheless, IFN-G has the single-greatest fold-change of any of 31,742 transcripts.

*In Transcriptomic Analysis Unveils Correlations between Regulative Apoptotic Caspases and Genes of Cholesterol Homeostasis in Human Brain, CASP2 is knocked down. After adjustment, this knockdown is insignificant (P = 1.0). Nevertheless, when more than 30,000 genes in the transcriptomic set are ranked according to fold-change, CASP2 ranks #1.

*In Genome-Wide Analysis Identifies NURR1-Controlled Network of New Synapse Formation and Cell Cycle Arrest in Human Neural Stem Cells, NURR1 is overexpressed. After adjustment, this overexpression is not significant against controls. Nevertheless, NURR1 is the single most upregulated transcript (of more than 30,000) in the study as measured by fold-change.

*In GSE175853, TAZ is knocked-down. Relative to controls, the adjusted P-value is .492. Nevertheless, of more than 20,000 transcripts, it ranks #2 in terms of fold-change.

Group 2

*In Work, meaning, and gene regulation: Findings from a Japanese information technology firm (PMID 27434635, GSE79092), male workers were scored according to numerous psychological parameters (e.g. hedonia) and blood transcripts were examined. We chose to look at eudemonia. Despite the fact that no single transcript was significantly altered in comparison of high vs. low eudemonia subjects, sorting according to fold-change and then searching for datasets that strongly intersected this particular study generated some very significant and interesting results. For example, transcripts downregulated in high eudemonia subjects tended to be upregulated in responders to lithium treatment (log(P) = -77) and in sleep deprivation subjects (-47). Transcripts upregulated in high eudemonia subjects tended to be downregulated in mice upon “polytrauma” (-21). It is thus difficult to argue that these “insignificantly altered” transcripts are no different than randomly sorted transcripts.

*Comparing grade II vs grade I breast lobular carcinoma (GSE88770), not a single transcript showed an adjusted P-value less than 1. Nevertheless, a list of the most downregulated transcripts in this comparison best intersected with another breast cancer study (GSE49481). Specifically, these downregulated transcripts intersected with transcripts downregulated in invasive ductal carcinoma versus invasive lobular cancer at P=10-34.

*In GSE112943, the lupus lesional skin transcriptome is examined against healthy skin. Despite no single gene rising to the level of significance after adjustment, the database dataset that best matches upregulated transcripts in this study is another study of lupus lesional skin: GSE72535. The best match to downregulated transcripts in GSE112943 is found in another study of lesional skin: GSE136757.

*In GSE79721, a novel BET inhibitor is applied to breast cancer cells. Comparing these cells to controls, not a single transcript was altered at an adjusted P-value of <1.0. Nevertheless, the single best overlapping upregulated and downregulated datasets (with Fisher P-values of 10-75 and 10-54) both involved application of BET inhibitors to cells.

*In Knockdown of a novel lincRNA AATBC suppresses proliferation and induces apoptosis in bladder cancer, bladder cancer tissue is compared to adjacent tissue. After adjustment, no transcripts are significantly altered, possibly because only two cancer and two control samples were examined. Nevertheless, the database study that best overlaps transcripts upregulated in this study (at P=10-18) is yet another comparison of bladder cancer against adjacent tissue (GSE100926). On the downregulation side, two other cancer studies (esophageal and colorectal) outcompeted GSE100926 for significance.

*In GSE58591, a comparison of female vs male ES cells is made. Following FDR adjustment, only 4 transcripts are significantly altered. Nevertheless, Y-chromosome transcripts dominate the list of most strongly downregulated transcripts, as measured by fold-change alone (e.g. DDX3Y, USP9Y, EIF1AY, etc).

 


whatismygene.com 


No comments:

Post a Comment

T-cell Exhaustion

"T-Cell Exhaustion" is associated with an inability of the immune system to fight off cancer and other diseases. We grabbed 7 mark...