Friday, February 11, 2022

What's in Our Database?

Recently, we exceeded 40,000 gene lists in our database. At some point in the (not-so-near) future, we'll upgrade the very basic appearance of our site. However, the majority of our labor has been, and will be, focused on the underlying database. More lists means more opportunities for the user to find studies that strongly intersect with his/her own lists. It means more chances to find genes that are significantly co-expressed with the user's own genes of interest. Etc.

40,000 lists means approximately 800,000,000 study/study intersections, each with an associated P-value. When we break the 50,000 mark, we'll have about 1,250,000,000 intersections. Here, a 25% increase in database size means greater than 50% more P-values. For biological truthseekers and hypothesis-generators, database size should be critical, not a pretty interface.

We'd also point out that, with few exceptions, our database is not littered with recycled GO lists and the like. Most of our lists will not be found elsewhere. Sometimes I get the feeling that a large portion of gene enrichment tools are generated by folks whose primary interest is in programming and computer science, not biology. The database content is thus an annoyance that must be dealt with. The easy solution to this annoyance is to grab existing GO lists and manufacture some new, tricky algorithms that make the tool worthy of an NAR paper.

So, what's in the February 2022 incarnation of the database? First, let's look at the species breakdown:


Currently, we do not include drosophila or zebrafish studies in the database. There are plenty of these studies out there...perhaps we'll branch out into flies and fish in the future.

Next, how about tissue types?



Above you'll see the most common 50 terms in the database. In actuality, there are about 250.

The most common tissue in the database is "Blood", with a big "B". The big B means that any cell types that could be found in blood are included...lymphocytes, granulocytes, red blood cells, etc. You'll also find the term "lymphocyte" in the graph. This explains why, if you were to sum up all the tissue counts above, there'd be well over 40,000 terms. A small b "blood" includes only major blood fractions (e.g. whole blood, plasma, etc.). Look here for more information on the cell types you'll find in the database.

How about the sorts of molecules you'll find in the database? Here, 83% of the database falls under the term "transcript." This 83% will dominate any graph that we make, so below we list the other sorts of molecules in the database.


In the case of the terms "PTM", "methylation", "antigen", "chip", and "epitranscriptome", we list the genes associated with these events. Some would dispute the inclusion of such studies within our database. A hypermethylation event, of course, has a very specific location. A nearby location on the same gene could be hypomethylated, meaning that this gene could be found in both hyper- and hypo-methylation lists from a single study. We justify this approach with the simple observation that two hypermethylation lists from different studies may overlap very significantly (try it: find a list of hypermethylated genes in a particular cancer type and enter it into our Fisher tool...don't bother selecting any options under the "molecule" filter).

We've become a bit disinterested in circRNA studies of late. The vast majority are focused on cancer, to the exclusion of other diseases, knockouts, etc. If we were to find that the circRNAs upregulated in a particular knockout study coincided with circRNAs that are downregulated in pancreatic cancer, for example, that would be quite interesting. But given the trend in the field, such comparisons aren't possible.

Finally, how about study types?

"Treatment" refers primarily to application of large molecules to cells (as opposed to overexpression, where expression occurs inside cells, or "drug", which refers to small molecules). "Environment" is a broad term that covers, for example, dieting studies, lifestyle studies (e.g. exercise vs sedentary), as well as surgeries. "PPI" refers to protein-protein interactions.

Our database will continue to grow. However, it's unlikely that the proportions shown in the above graphs will change much in the future.

 

whatismygene.com 


A Preprint

It has been a while since we posted. That's largely because of the effort put into generating a paper. Check it out on BioRxiv . This is...