Sunday, June 20, 2021

A New Feature at WhatIsMyGene.com

Most of the gene lists that compose our database are sorted according to some criteria. We may apply a significance cut-off to the data, and from there sort according to fold-change. Often, we divide log(fold-change) by the significance, combining these two measures in one step. In cases where significance measures are not available, or no genes are significantly altered, we’ll sort according to fold-change alone. You may wonder why we’d burden the database with studies wherein no genes at all were significantly altered…we’ll address that shortly in another post.

Regardless of the sorting method, in some cases there may be little difference between the top “most upregulated” gene and the 100th; a gradual decline. In other cases, there may be a steep dropoff from the first position to the 10th. An extreme case would be a knockdown experiment in which the targeted transcript is very significantly downregulated, but no other transcripts are altered to any great degree.

In cases where, say, only the top 25 genes in a list of 200 are “interestingly” altered, you might consider the remaining 175 genes to be more or less random garbage. This garbage could hide a significant result. When performing Fisher’s exact test against a selection of other studies, input composed of those 25 genes might render more significant results than input composed of the larger 200 gene list.

Or, perhaps you’re using our “relevant studies” tool. You want to find studies in which your gene of interest is very strongly (vs. moderately) altered. You’d thus like to eliminate the lower-ranked genes in our lists.

Our new feature simply allows you to whittle our database down to “top 100”, “top 50”, or “top 25” genes. Very roughly, most of the underlying lists are composed of about 200 genes, so the above options allow you to eliminate the 50%, 75%, and 88% lowest ranked genes in a list. The feature can be applied when you use the “Relevant studies”, “Coregulation”, “Fisher”, “Match Studies”, “Regulation”, and “Third Study” tools. You’ll see this “Restrict IDs” option at the the bottom left of a page. Using the tool has the additional benefit of speeding up the generation of output, as you’re tossing up to 88% of the database into the garbage.

Practically speaking, if you compare results with and without the “Restrict IDs” option, the most likely outcome is a lowered significance when restricting the database size. This is because a typical gene list in our database shows a gradual, not steep, change in significance (or fold-change…whatever). Thus we’d advise that you ignore this option when looking for broad trends and insights, and use the option when you seek to refine a result. The above does not apply to the “Relevant Studies” tool, as this tool simply searches for your gene in our database, and doesn’t generate any statistics. In the case of “Relevant Studies”, you may wish to begin with a restricted database size. In the case of "Relevant Studies", we've added "Top 10" and "Top 5" options, meaning you could restrict approximately 97.5% of the database.

In the case of the “Coregulation” tool, let’s say you restrict the database using “top 25.” In this case, your gene of interest must be found in the top 25 genes in a study, and its coregulated partners must also be found in the top 25.

At the beginning of this post, we say that most of our lists are sorted. In some cases, a journal simply provides a list of genes without any fold change data. Another sort of list would be represented by “518 human kinases”…we can’t say that one kinase is better than another, so we simply randomize these identifiers in our database. Also, early in our existence, we did not sort our lists; as time passes, we replace these early lists with improved lists that are sorted, but this is a slow process. When you use the "Restrict IDs" feature, in addition to tossing low-ranking genes, you're also tossing randomized (i.e. non-sorted) lists. Thus, if you wish to examine the maximum number of studies in our database, do not use this feature.

***

Given that most of our data is sorted, we thought it might be interesting to find the genes that most frequently occupy the #1 slot. The champion might be a bit unexpected: SPP1, appearing atop our lists in 30 different studies! Following that, we have TTR, EGR1, S100A8, FOS, LYZ, HBA1, and CCL5. If we adjust for rarity of the gene throughout the database, the list looks like this: EGR1, SPP1, TTR, ABL1, RAP1A, HBA1, and S100A8. Genes that are relatively rare over the entire database, yet nevertheless found themselves at the top position on multiple occasions include UBR4, FOXP3, ESRG, COX1, SNX3, EIF4A3, A1BG, and IGHD.

whatismygene.com 


A Preprint

It has been a while since we posted. That's largely because of the effort put into generating a paper. Check it out on BioRxiv . This is...