Most of the gene lists that compose our database are sorted according to some criteria. We may apply a significance cut-off to the data, and from there sort according to fold-change. Often, we divide log(fold-change) by the significance, combining these two measures in one step. In cases where significance measures are not available, or no genes are significantly altered, we’ll sort according to fold-change alone. You may wonder why we’d burden the database with studies wherein no genes at all were significantly altered…we’ll address that shortly in another post.
Regardless of the sorting method, in some cases there may be
little difference between the top “most upregulated” gene and the 100th;
a gradual decline. In other cases, there may be a steep dropoff from the first
position to the 10th. An extreme case would be a knockdown
experiment in which the targeted transcript is very significantly
downregulated, but no other transcripts are altered to any great degree.
In cases where, say, only the top 25 genes in a list of 200
are “interestingly” altered, you might consider the remaining 175 genes to be
more or less random garbage. This garbage could hide a significant result. When
performing Fisher’s exact test against a selection of other studies, input
composed of those 25 genes might render more significant results than input
composed of the larger 200 gene list.
Or, perhaps you’re using our “relevant studies” tool. You
want to find studies in which your gene of interest is very strongly (vs.
moderately) altered. You’d thus like to eliminate the lower-ranked genes in our
lists.
Our new feature simply allows you to whittle our database
down to “top 100”, “top 50”, or “top 25” genes. Very roughly, most of the
underlying lists are composed of about 200 genes, so the above options allow
you to eliminate the 50%, 75%, and 88% lowest ranked genes in a list. The
feature can be applied when you use the “Relevant studies”, “Coregulation”,
“Fisher”, “Match Studies”, “Regulation”, and “Third Study” tools. You’ll see
this “Restrict IDs” option at the the bottom left of a page. Using
the tool has the additional benefit of speeding up the generation of output, as
you’re tossing up to 88% of the database into the garbage.
Practically speaking, if you compare results with and
without the “Restrict IDs” option, the most likely outcome is a lowered
significance when restricting the database size. This is because a typical gene
list in our database shows a gradual, not steep, change in significance (or
fold-change…whatever). Thus we’d advise that you ignore this option when
looking for broad trends and insights, and use the option when you seek to
refine a result. The above does not apply to the “Relevant Studies” tool, as
this tool simply searches for your gene in our database, and doesn’t generate
any statistics. In the case of “Relevant Studies”, you may wish to begin
with a restricted database size. In the case of "Relevant Studies", we've added "Top 10" and "Top 5" options, meaning you could restrict approximately 97.5% of the database.
In the case of the “Coregulation” tool, let’s say you
restrict the database using “top 25.” In this case, your gene of interest must
be found in the top 25 genes in a study, and its coregulated partners
must also be found in the top 25.
At the beginning of this post, we say that most of
our lists are sorted. In some cases, a journal simply provides a list of genes
without any fold change data. Another sort of list would be represented by “518
human kinases”…we can’t say that one kinase is better than another, so we
simply randomize these identifiers in our database. Also, early in our existence,
we did not sort our lists; as time passes, we replace these early lists with
improved lists that are sorted, but this is a slow process. When you use the "Restrict IDs" feature, in addition to tossing low-ranking genes, you're also tossing randomized (i.e. non-sorted) lists. Thus, if you wish to examine the maximum number of studies in our database, do not use this feature.
***
Given that most of our data is sorted, we thought it might
be interesting to find the genes that most frequently occupy the #1 slot. The
champion might be a bit unexpected: SPP1, appearing atop our lists in 30
different studies! Following that, we have TTR, EGR1, S100A8, FOS, LYZ, HBA1,
and CCL5. If we adjust for rarity of the gene throughout the database, the list
looks like this: EGR1, SPP1, TTR, ABL1, RAP1A, HBA1, and S100A8. Genes
that are relatively rare over the entire database, yet nevertheless found
themselves at the top position on multiple occasions include UBR4, FOXP3, ESRG,
COX1, SNX3, EIF4A3, A1BG, and IGHD.
No comments:
Post a Comment