We've added more than 1,000 GO lists to our database. There are thousands and thousands more we could add, if we were motivated. The lists we added are "core" lists, i.e. the lists from which other, larger, lists are built up. If your input list doesn't match up with at least one of the GO lists in our database, we're guessing it won't really match up with ANY GO lists out there. By "really", we mean that it's always possible to assume unrealistic background figures for the input set, the GO list, or both, thus generating artificially significant P-values. All of the WIMG GO lists are adjusted for background.
In addition to the "coreness" of the lists we added, we also required that the lists contain at least 40 genes. This is simply because our background-estimation procedure generally becomes more imprecise as the size of the list grows smaller.
After adding an experiment-based list (e.g. a set of genes upregulated upon IFNA treatment in a specific study), we always perform Fisher's exact test of new list against all other lists in our database, generating as many as 95,000 P-values. Part of the reason for this testing is to check for possible errors; if two lists overlap at P = 10^-400, perhaps they actually come from the same experiment; it's not necessarily cheating to use the same data in two or more studies. On the other hand, perhaps some researchers are indeed plagiarizing data. Another reason for testing new data is simple curiosity. In the course of this testing, we note that it's rare to see GO lists appear as the absolute most significant match to any particular study. Most likely, an IFNA treatment list will match up to another IFN study, or a viral infection study, not a GO list. The best-matching GO list, in fact, may be found below hundreds or even thousands of better-matching experimental lists. It's interesting to note, however, that this is not always the case. For example, examining genes upregulated in atrial appendages of old vs young individuals (GSE136928), GO lists emerged as the most significant matches ("GO:0003823 antigen binding" took the top spot, with a P-value of 10^-22, with numerous other GO lists following).
In general, we believe that most diseases function by altering gene modules, not the convenient biological pathways that GO lists are about. An alteration in a single pathway isn't the difference between brain cells and heart cells. My own thinking is that organisms are freaky-paranoid about the possibility of being hacked by viruses or bacteria or even cancer; complexity that may seem unnecessary reduces the possibility of hacking. There aren't many absolutes in biology, however; sometimes GO lists do a pretty good job of informing you what's going on.
No comments:
Post a Comment