Saturday, August 13, 2022

Thinking about ORA and FCS

There are quite a few papers out there on the subject of gene enrichment, comparing various methods. The two most common methods out there are termed ORA (over-representation analysis) and FCS (functional class scoring). Without quibbling, WhatIsMyGene (WIMG) would be termed ORA: the user inputs a list of genes, which needn't be ranked, and a database is searched for the lists which most significantly match up with the input list. With FCS, the user can input a ranked list without needing to make a decision about where a cutoff must be made. You could even input the entire list of genes identified in your study, as long as you've ranked them according to some criteria.

A lot of papers suggest the FCS approach is superior. The main argument involves the arbitrariness of the ORA input list cutoff. Why toss all genes with P-values that are just a tad greater than 0.05? Even on this ORA-loving blog, we've questioned this practice, and routinely see very interesting results involving genes or gene lists that didn't meet the standard 0.05 cutoff for significance.

Now, let me quibble for a moment, and then continue with the main subject: in WIMG, ranking a user input list can be helpful. That's because 1) most of the lists in our database are ranked and 2) the user can select a desired length for these lists. In other words, you could choose to examine only the top 25 genes within all our ranked lists. See here for more details. If the genes at the top of your own list are indeed more important than those at the bottom, you'll note that you don't see any increase in significance when you choose the "top 100" vs the "top 25" option. Thus, to some extent, WIMG negates complaints about the arbitrariness of cutoffs in ORA. Ultimately, the important issue is whether the use of ORA vs FCS (or vice versa) screws up a biologist's chances of making an interesting observation, and I can't say I've seen compelling arguments either way. That's one reason why WIMG's prime focus is on the database side, not on development of tricky algorithms.*

To continue: I've previously blogged about problems with the usage of "Gene Ontology" (GO) lists (and their brethren, KEGG, Reactome, etc.; let's call them "GO-like"). WIMG attempts to alleviate these issues: see here. Basically, we find that many of these GO-like lists contain lists of genes that are strongly biased in terms of abundance. Your own gene list may have been drawn from a universe of 20,000 genes, but some GO-like lists appear to be drawn from a much smaller universe. This creates issues when you attempt to use Fisher's exact test (or hypergeometric or whatever), which requires you to have a decent idea of the joint background behind the two lists in question. GO-like lists don't come with backgrounds!**

A related question (or, perhaps, precisely the same question...I'm not clear) regarding GO-like lists is this: if you tag abundance figures on all the genes in a GO-like list, do these abundance figures conform to a normal curve?

So, ORA can spit out deceptive P-values when comparing your list against GO-like lists. Could FCS have similar issues?

Sure.

First, a crude understanding of the innards of FCS is useful. The basic algorithm is not horribly difficult, despite intimidating-looking equations in the underlying papers. There's your own ranked list, L, and a GO-like list, G. If the first gene in L is found in G, you write down a positive score (like +1, or +3.7, or whatever,). If not, write down a negative score (e.g. -1). Go to the second gene in L. If it's found in G, add to the score you just wrote down; otherwise, subtract from the previous score. Continue like this until you reach the bottom of L. If L intersects nicely with G, you'll reach a peak score value, after which the score begins to decline. If L doesn't intersect with G, you'll just get a zig-zag line or a steady decline, instead of a mountain peak (or valley, if you're looking at downregulated genes in L). At this point, you want to assign a P-value to the result. This is done by randomly re-ranking the genes in your list and rescoring ad-nauseum. If there's an interesting match between L and G, there should be few or no occasions where random sorting results in a more significant score than the initial non-randomized score. The P-value comes from comparing the number of times you randomly derive a superior score against the number of times you performed the randomization. If I've got it right, you can also get a P-value by doing a Kolmogorov-Smirnov test for normality, and bypass the randomization step.

Now, let's imagine a case where L is drawn from a universe of 20,000 genes, but G is drawn from a universe of 2,000 genes. To my thinking, that means there could be 9 interesting genes in L for every one gene in G. Those 90% of genes, which could be very important (in cancer, for example), will subtract from the running score whenever seen. If, on the other hand, both L and G are drawn from the same universe of 2,000 genes, you'll get a much higher score and a more significant P-value. In other words, you get rewarded for entering lists of genes that are biased for abundance. You can tinker with this in Excel, using manufactured lists L and G; the "match" function is useful.

The above issue could be solved, I believe, by properly weighting the negative values. If L's background is 10X larger than G's, you could add 1 for a positive match, but subtract 0.1 for a non-match. In my own Excel tinkering, this seemed to work; you can get the same peak score for the 2000/2000 case as for the 20000/2000 case if you adjust the scoring system correctly. But nobody does this, of course. That's because, again, nobody (except us) assigns backgrounds to GO-like lists. Previously, we showed that it's not difficult to derive an approximate background for these lists.

One problem I see with FCS is the fact that the P-value is derived from a permutation test, meaning you'll never see extreme P-values (i.e. your computer will probably explode after, say, a quadrillion re-assortments of your gene list). Maybe I'm wrong, but I'm thinking Kolmogorov-Smirnov won't save you either...the test poops out for crazy P-values.*** When comparing 1,000 lists G to each other, you really do want an output where the absolute craziest P's are allowed to bubble to the top. Or, to put it another way, 10^100 is 10^75 times bigger than 10^25. Is that not important?

Another weakness of GO-like lists (unlike WIMG lists 😄) that we haven't addressed before is the fact that they're generally unranked. I would think that the FCS folks could find a way to make their approach all the more powerful by utilizing ranked G's, not just ranked L's. Again, though, it's a bit difficult to rank genes when you derive them by text-mining. If you ranked them merely by mentions, I'm guessing you'd find abundant genes strongly over-represented at the top of your lists. Nevertheless, even in a "concrete" list of genes like those involved in the Krebs cycle, some genes are more vulnerable to perturbation than others. Some are more dispensable than others. With some thought, ranking could be done.

A final issue (for now) with FCS is this: sometimes two lists don't intersect at all, and this is actually significant. This most often happens when G and L are large relative to the background (e.g. G=1000, L=1000, the background = 10000, and there's no intersection whatsoever between lists). I don't see FCS tools dealing with this potentially interesting prospect. Any tool that utilizes Fisher's exact test or the hypergeometric test, however, should be able to deal with this situation without a hitch.

At some point, we'll write a paper, making WIMG entirely "respectable." We'll need to show that we are, in some respects, better than other tools. My approach will not be math-dominant. Instead, it will be biology-dominant. What I mean here is this: if you insert a list of genes up-regulated in mouse brains when you apply drug X, will the output involve mice? Brains? Bonus points if the tool in question reflects back to you that you applied drug X. Such results give the user confidence that more "remote" results (e.g. genes upregulated in a human neural cell line on knockdown of gene ABC) that nevertheless intersect the mouse list with high significance are indeed relevant to whatever questions the biologist is asking.****

*Let's assume, for a second, that a paper comes out that convinces me of the superiority of FCS. It's really not a big deal. I can just add the vastly superior algorithm to the site. A good programmer (not me) could probably do this in a matter of hours or days, especially if an R package containing the nitty-gritty code is available. On the other hand, nobody is going to duplicate our underlying database in a short period of time. It's really big.

**Just to be clear, this problem is not solved by tools that ask you to enter the genes that comprise your background, in addition to your list L. The problem to which I refer lies on the side of the lists G, which have unknown backgrounds.

***I do note that GOrilla, a tool that'll use an FCS approach if you plug in a single list L (as opposed to the list L plus a background list B), can spit out P-values on the order of 10^-15 (I got that figure by grabbing a GO list and plugging it into the tool...it damn well better output a very nice P)*****. I'm not sure how that works. On the other hand, I was surprised to find that the online version of GSEA actually invokes a hypergeometric test, allowing for the crazy P-values that you may find. Why would the pioneers of FCS use the hypergeometric approach? It's probably because there's a lot of processing required. To perform a single L/G comparison, you've got to do a lot of re-assortment; the more the merrier if you're looking for extreme P-values. Then you move on to the next list G2, and repeat, and then G3, etc.

****Some folks might actually complain that the output reflects right back to you what you put in. In the case of WIMG, you can eliminate that complaint. You could, for example, require that only human lists are examined against your mouse list, or that only knockout studies are examined. Or, perhaps, the complainer wishes that only big, broad pathways that are presumably universal to all cell types are output...in that case you can choose the "external lists" within the "Cell Types" box on the left side of the page, and you'll only receive GO-like lists as output. Don't be disappointed in the not-so-significant P-values that may be generated, however...that's what happens when these lists are adjusted to account for backgrounds that are often quite small.

*****When we plug the same list into our own software, we get a P-value of 10^-96. Intuitively, what sort of P-value do you expect when two lists of about 100 elements and a background of 1000 happen to intersect perfectly?


whatismygene.com 


Friday, August 12, 2022

Transcripts that are and are not Perturbed in Cancer: Part 2

Previously, we examined transcripts that are and are not perturbed in cancer. This was done by comparing all cases where transcripts were up or down-regulated in cancer tissue against the entirety of non-cancer studies (that involve some sort of perturbation) in our database. Using this large non-cancer dataset, we can derive some very significant P-values, which we like. We did, however, mention some weaknesses with this approach and suggested a different approach for a future post. Here's the future post.

This time, we grabbed 160 studies that involved comparison of cancer tissue to adjacent, healthy tissue and looked for genes that were never/rarely upregulated or downregulated in cancer tissue. There are also weaknesses in this approach. The relatively small dataset means we won't see extreme P-values. Also, since every gene is perturbed in cancer in this dataset, in one direction or the other, we can't search for genes that are rarely perturbed at all (in both directions), as we did before. Finally, note that we only examine solid tissue cancers here, where a comparison between diseased and healthy tissue can be made.

Jumping into it, what are the genes that are rarely downregulated in cancer? TOP2A leads the list, being upregulated in 52 studies, and downregulated in only 2.  The P-value here is about 10^-14. Next up is COL1A1, which was upregulated 41 times, and never downregulated. ASPM, CTHRC1, MMP11, SPP1, CENPF, CDH3, CDKN3, and NEK2 follow.

This list of genes rarely downregulated in cancer (dbase ID 146502203, available on our next database update) corresponds nicely to genes that are most commonly up-regulated in cancer in the broader approach we used in our initial post: P = 10^-230. We also constructed a list of genes that were NEVER seen to be downregulated in the cancer vs adjacent studies we have on hand (dbase ID 146503205). Naturally, this list overlaps quite strongly with genes rarely downregulated in cancer (which includes the "nevers").

We'll examine possible treatment approaches that would downregulate the above genes in a future post. For now, we'll point out that well-known cancer treatments are prominent as downregulaters: cdk4/6 inhibitors, bromodomain inhibitors, mek1/2 inhibitors, etc. 

How about genes that are rarely upregulated in cancer (dbase ID 146503204)? Here, ADH1B leads the way, downregulated in 52 studies and upregulated in only 2 (which, coincidentally, is the exact opposite of the pattern we see with TOP2A). Next, we see ASPA, DPT, CFD, CXCL12, MT1M, ABCA8, FAM107A, ADH1C, and C7. ASPA is the first gene to never be upregulated in cancer, having 40 examples of downregulation without a single case of upregulation. Again, we also have a strongly overlapping list of genes that are never upregulated in cancer (146504206). And again, these new lists correspond very strongly (P<10^-200) to lists constructed using the broader approach.

Briefly, what are the treatments that might upregulate genes that are rarely upregulated in cancer? MAPK and tyrosine kinase inhibitors are seen, as well as BMP2 treatment, and IL17A antagonists. 

Are there "natural" or "lifestyle" approaches that might tend to downregulate genes upregulated in cancer? We'll look at that in the future. Initially, we're both relieved and disappointed that well-known cancer treatments come to the fore when we try, bioinformatically, to reverse cancer trends. Such a result strongly validates our approach, no? On the other hand, we'd be happy to see some less obvious approaches emerge. As we've mentioned before, treating the primary cancer may promote metastasis, while treating the metastasis may enhance primary cancer growth. Eyeballing the data, abetting metastasis seems to be more of a concern when upregulating genes that are downregulated in cancer (e.g. with MAPK inhibitors), rather than downregulating the upregulated genes.

We also generated a more obscure list (146504207). This time, we wanted to find genes that were seen to be upregulated in some studies, and downregulated in others. We required that no such genes be altered (up or down) with a P-value < .05, and that the gene be seen in at least 16 studies. AKR1B10 is the winner here, upregulated in 13 studies and downregulated in 14. IGFBP3, ANXA3, NNMT, AZGP1, TACSTD2, CXCL2, DACH1, NEBL and TDO2 follow. When we run this list, consisting of a mere 30 members, against all the lists in our database, nothing of significance emerges. Despite the weak significances, it does appear that studies involving drug resistance percolated to the top, and a quick Google search indeed reveals studies where some of the above genes are implicated in resistance. For example, AKR1B10 induces resistance to daunorubicin in at least one study. In addition to relevance to resistance, one could speculate that cancer types or cancer subsets could hinge on these sorts of genes: if IGFBP3 is upregulated, it might be worthwhile to downregulate it, and vice-versa.


whatismygene.com 


A Preprint

It has been a while since we posted. That's largely because of the effort put into generating a paper. Check it out on BioRxiv . This is...