There are quite a few papers out there on the subject of gene enrichment, comparing various methods. The two most common methods out there are termed ORA (over-representation analysis) and FCS (functional class scoring). Without quibbling, WhatIsMyGene (WIMG) would be termed ORA: the user inputs a list of genes, which needn't be ranked, and a database is searched for the lists which most significantly match up with the input list. With FCS, the user can input a ranked list without needing to make a decision about where a cutoff must be made. You could even input the entire list of genes identified in your study, as long as you've ranked them according to some criteria.
A lot of papers suggest the FCS approach is superior. The main argument involves the arbitrariness of the ORA input list cutoff. Why toss all genes with P-values that are just a tad greater than 0.05? Even on this ORA-loving blog, we've questioned this practice, and routinely see very interesting results involving genes or gene lists that didn't meet the standard 0.05 cutoff for significance.
Now, let me quibble for a moment, and then continue with the main subject: in WIMG, ranking a user input list can be helpful. That's because 1) most of the lists in our database are ranked and 2) the user can select a desired length for these lists. In other words, you could choose to examine only the top 25 genes within all our ranked lists. See here for more details. If the genes at the top of your own list are indeed more important than those at the bottom, you'll note that you don't see any increase in significance when you choose the "top 100" vs the "top 25" option. Thus, to some extent, WIMG negates complaints about the arbitrariness of cutoffs in ORA. Ultimately, the important issue is whether the use of ORA vs FCS (or vice versa) screws up a biologist's chances of making an interesting observation, and I can't say I've seen compelling arguments either way. That's one reason why WIMG's prime focus is on the database side, not on development of tricky algorithms.*
To continue: I've previously blogged about problems with the usage of "Gene Ontology" (GO) lists (and their brethren, KEGG, Reactome, etc.; let's call them "GO-like"). WIMG attempts to alleviate these issues: see here. Basically, we find that many of these GO-like lists contain lists of genes that are strongly biased in terms of abundance. Your own gene list may have been drawn from a universe of 20,000 genes, but some GO-like lists appear to be drawn from a much smaller universe. This creates issues when you attempt to use Fisher's exact test (or hypergeometric or whatever), which requires you to have a decent idea of the joint background behind the two lists in question. GO-like lists don't come with backgrounds!**
A related question (or, perhaps, precisely the same question...I'm not clear) regarding GO-like lists is this: if you tag abundance figures on all the genes in a GO-like list, do these abundance figures conform to a normal curve?
So, ORA can spit out deceptive P-values when comparing your list against GO-like lists. Could FCS have similar issues?
Sure.
First, a crude understanding of the innards of FCS is useful. The basic algorithm is not horribly difficult, despite intimidating-looking equations in the underlying papers. There's your own ranked list, L, and a GO-like list, G. If the first gene in L is found in G, you write down a positive score (like +1, or +3.7, or whatever,). If not, write down a negative score (e.g. -1). Go to the second gene in L. If it's found in G, add to the score you just wrote down; otherwise, subtract from the previous score. Continue like this until you reach the bottom of L. If L intersects nicely with G, you'll reach a peak score value, after which the score begins to decline. If L doesn't intersect with G, you'll just get a zig-zag line or a steady decline, instead of a mountain peak (or valley, if you're looking at downregulated genes in L). At this point, you want to assign a P-value to the result. This is done by randomly re-ranking the genes in your list and rescoring ad-nauseum. If there's an interesting match between L and G, there should be few or no occasions where random sorting results in a more significant score than the initial non-randomized score. The P-value comes from comparing the number of times you randomly derive a superior score against the number of times you performed the randomization. If I've got it right, you can also get a P-value by doing a Kolmogorov-Smirnov test for normality, and bypass the randomization step.
Now, let's imagine a case where L is drawn from a universe of 20,000 genes, but G is drawn from a universe of 2,000 genes. To my thinking, that means there could be 9 interesting genes in L for every one gene in G. Those 90% of genes, which could be very important (in cancer, for example), will subtract from the running score whenever seen. If, on the other hand, both L and G are drawn from the same universe of 2,000 genes, you'll get a much higher score and a more significant P-value. In other words, you get rewarded for entering lists of genes that are biased for abundance. You can tinker with this in Excel, using manufactured lists L and G; the "match" function is useful.
The above issue could be solved, I believe, by properly weighting the negative values. If L's background is 10X larger than G's, you could add 1 for a positive match, but subtract 0.1 for a non-match. In my own Excel tinkering, this seemed to work; you can get the same peak score for the 2000/2000 case as for the 20000/2000 case if you adjust the scoring system correctly. But nobody does this, of course. That's because, again, nobody (except us) assigns backgrounds to GO-like lists. Previously, we showed that it's not difficult to derive an approximate background for these lists.
One problem I see with FCS is the fact that the P-value is derived from a permutation test, meaning you'll never see extreme P-values (i.e. your computer will probably explode after, say, a quadrillion re-assortments of your gene list). Maybe I'm wrong, but I'm thinking Kolmogorov-Smirnov won't save you either...the test poops out for crazy P-values.*** When comparing 1,000 lists G to each other, you really do want an output where the absolute craziest P's are allowed to bubble to the top. Or, to put it another way, 10^100 is 10^75 times bigger than 10^25. Is that not important?
Another weakness of GO-like lists (unlike WIMG lists 😄) that we haven't addressed before is the fact that they're generally unranked. I would think that the FCS folks could find a way to make their approach all the more powerful by utilizing ranked G's, not just ranked L's. Again, though, it's a bit difficult to rank genes when you derive them by text-mining. If you ranked them merely by mentions, I'm guessing you'd find abundant genes strongly over-represented at the top of your lists. Nevertheless, even in a "concrete" list of genes like those involved in the Krebs cycle, some genes are more vulnerable to perturbation than others. Some are more dispensable than others. With some thought, ranking could be done.
A final issue (for now) with FCS is this: sometimes two lists don't intersect at all, and this is actually significant. This most often happens when G and L are large relative to the background (e.g. G=1000, L=1000, the background = 10000, and there's no intersection whatsoever between lists). I don't see FCS tools dealing with this potentially interesting prospect. Any tool that utilizes Fisher's exact test or the hypergeometric test, however, should be able to deal with this situation without a hitch.
At some point, we'll write a paper, making WIMG entirely "respectable." We'll need to show that we are, in some respects, better than other tools. My approach will not be math-dominant. Instead, it will be biology-dominant. What I mean here is this: if you insert a list of genes up-regulated in mouse brains when you apply drug X, will the output involve mice? Brains? Bonus points if the tool in question reflects back to you that you applied drug X. Such results give the user confidence that more "remote" results (e.g. genes upregulated in a human neural cell line on knockdown of gene ABC) that nevertheless intersect the mouse list with high significance are indeed relevant to whatever questions the biologist is asking.****
*Let's assume, for a second, that a paper comes out that convinces me of the superiority of FCS. It's really not a big deal. I can just add the vastly superior algorithm to the site. A good programmer (not me) could probably do this in a matter of hours or days, especially if an R package containing the nitty-gritty code is available. On the other hand, nobody is going to duplicate our underlying database in a short period of time. It's really big.
**Just to be clear, this problem is not solved by tools that ask you to enter the genes that comprise your background, in addition to your list L. The problem to which I refer lies on the side of the lists G, which have unknown backgrounds.
***I do note that GOrilla, a tool that'll use an FCS approach if you plug in a single list L (as opposed to the list L plus a background list B), can spit out P-values on the order of 10^-15 (I got that figure by grabbing a GO list and plugging it into the tool...it damn well better output a very nice P)*****. I'm not sure how that works. On the other hand, I was surprised to find that the online version of GSEA actually invokes a hypergeometric test, allowing for the crazy P-values that you may find. Why would the pioneers of FCS use the hypergeometric approach? It's probably because there's a lot of processing required. To perform a single L/G comparison, you've got to do a lot of re-assortment; the more the merrier if you're looking for extreme P-values. Then you move on to the next list G2, and repeat, and then G3, etc.
****Some folks might actually complain that the output reflects right back to you what you put in. In the case of WIMG, you can eliminate that complaint. You could, for example, require that only human lists are examined against your mouse list, or that only knockout studies are examined. Or, perhaps, the complainer wishes that only big, broad pathways that are presumably universal to all cell types are output...in that case you can choose the "external lists" within the "Cell Types" box on the left side of the page, and you'll only receive GO-like lists as output. Don't be disappointed in the not-so-significant P-values that may be generated, however...that's what happens when these lists are adjusted to account for backgrounds that are often quite small.
*****When we plug the same list into our own software, we get a P-value of 10^-96. Intuitively, what sort of P-value do you expect when two lists of about 100 elements and a background of 1000 happen to intersect perfectly?