Saturday, March 28, 2026

A Better Website

We've got a more professional interface. Needless to say, AI (Gemini) was largely responsible for the improvement. The improvements are not limited to cosmetics...you should receive noticeably faster outputs from several of the tools (particularly "Relevant Studies", which doesn't require a lot of behind-the-scenes calculations). Despite the assistance, it still took about three weeks of work and 6,000 lines of code. The old site actually required fewer lines of code...this is not a testament to my own efficient coding, but rather, Gemini's insistence on bullet-proofing and commenting everything. 

There are also several new features. The coolest is this: The "Third Study" tool outputs a nice Venn diagram that you could use for your paper. Graphics generated from bioinformatic sites are rarely, if ever, publication ready, but it's easy enough to right-click on the image and edit it in Photoshop or Illustrator. The sizes of the circles and intersecting regions correspond to the number of genes within, failing to some extent when there are only a few genes in an intersection. Click on an output study, go to the "Venn Diagram" tab, and you'll get something like this:




In case you don't know what the tool is supposed to achieve, the user enters genes from two sets that already have a significant intersection. The tool finds studies that intersect strongly with the "central" set, but not the second set. This is apparent in the above diagram, where the central set and the "study" set share 67 genes, but the second set and the study set share only 11. Bearing in mind that the algorithm references both user-entered and database-associated backgrounds, it determined that the log(P) for the central/study intersection was much more significant than that for the second-set/study intersection. Basically, the tool helps you answer the question, "What ELSE is happening in my gene set?" You may find that your gene set is enriched for, say, cell-cycle genes, but you'd also like to know if a second theme is lurking, possibly overwhelmed by the cell cycle signal. This tool will help. 

Here's another new feature: the "Tissue Specificity" box for the "Fisher" tool. Let's say you have a study that compares breast cancer tissue to healthy tissue. You derive a list of DEGs that are upregulated in breast cancer. You could use our Fisher tool to find knockouts, drugs, etc., that tend to downregulate these genes. However, you might suspect that these treatments could cause unpleasant systemic effects. You'd prefer to target genes that are breast-specific. The "Tissue Specificity" choice allows you to do that. Specifically, the tool looks in a table of breast-specific genes and then filters the database for studies in which these genes were specifically targeted. Though not related to tissue-specificity, we've also included a list of genes that can be targeted by existing drugs. More lists are possible.

Another feature is this: "Select Database", seen in the sidebar for several tools. Currently, there are only two choices, our "standard" database and a second database ("Reduced p10"). Here, we've simply taken the standard database and removed the top 10% of most commonly perturbed genes. It's computationally expensive to do this on the fly, thus a revised database. The revised database is an attempt to address the fact that many input studies converge on relatively few database studies. Here, commonly perturbed genes are removed in order to allow the small-time talents to shine. The idea is far from optimized...currently, it seems like removing a mere 10% of genes was probably too conservative, since outputs currently don't seem to be tremendously different for either choice of database. We've got ideas for other database alterations as well.

whatismygene.com 

Thursday, March 26, 2026

The most common perturbation themes

Let's say we enter a new gene list into our database. We can then perform gene enrichment on it against as many as 208,000 other lists. Tossing three perturb-seq studies that each generate thousands of gene lists, we can still test our new list against 123,000 other lists. If our new list is enriched for a common perturbation theme, thousands of gene lists may significantly intersect with our list. The question arises: of the 123,000 lists, which one significantly intersects with the most (other) lists?

The answer is drawn from HSF1 Inhibits Antitumor Immune Activity in Breast Cancer by Suppressing CCL5. Here, the list of genes downregulated in the c4-2 line upon dthib treatment overlaps significantly with about 10,000 other lists. Unlike GO lists and the like, it's not immediately obvious what dthib treatment is expected to do. A few seconds of googling reveals that it's an HSF1 inhibitor. Studies that intersect with extreme significance involve a diverse array of perturbations: hdac5 knockdown, her2 inhibition, CDK expression, fgf1 treatment, and many more. The highest ranking GO list comes in at position 1595: "GO:0022402 cell cycle process." It seems that the dthib list encapsulates some process far better than the GO list does. If the easily-grasped wording of the GO lists is appealing to you, then we could just rename the dthib list something like this: "WIMG:0000001 DTHIB downregulated." 😀

Interestingly, the dthib study intersects very strongly with our list of genes that are rarely downregulated in cancer. The drug is indeed being studied as a cancer treatment.

How about the study that intersects second-best with all other lists in the database? Actually, this is not so easy to determine. That's because the second best, and even 500th best study, intersects with the dthib theme. Therefore, we add a requirement: the list we deem to be second-best cannot overlap with the dthib list with a significance greater than -log(P) = 20  *. 20 might seem like a very liberal cutoff, but in the case of the dthib study, for example, there are 5924 lists that overlap with at least this level of significance. Given this requirement, genes upregulated in mouse plantaris muscle one day after synergist ablation (Time course of gene expression during mouse skeletal muscle hypertrophy) wins the silver medal. There's actually quite a drop-off from the dthib study here, with only about 5% of database studies significantly overlapping. Again, it's not so obvious what's going on in this study. For a clue, the highest ranking GO list (#1301) is "GO:0006955 immune response." Studies involving viral infections, adjuvant treatment, ischemia, various injuries, and radiation treatment match strongly.  Again, to our way of thinking, the sheer volume of studies outperforming the GO list suggest a process, however murky or difficult to name, that should be considered "fundamental."

The third best list cannot overlap with the first or second-best list at -log(P)>20. These are genes upregulated in mouse medullary epithelial cells on raver2 knockdown (Aire-dependent transcripts escape Raver2-induced splice-event inclusion in the thymic epithelium). There is an impressive variety of means to recapitulate this result: lncrna over-expression, enhancer repression, various diets, aging, ezh2 over-expression, mettl3 knockout, etc. The best ranking GO list (#339) is "GO:0046649 lymphocyte activation." Do you think this GO list really captures what's happening here?

The fourth best list involves genes upregulated in the a549 cell line on IRF1 overexpression. Simply knowing that IRF1 is "interferon response factor 1" lets us know that we're talking about the innate immune response. Indeed, studies involving infection and interferon treatment dominate the top-ranked intersecting lists. Finally, a category that looks something like what we were taught in college! Nevertheless, the highest-ranked GO list comes in at position 749: "GO:0140546 defense response to symbiont."

The next three lists are these: 5) genes downregulated in the hair-m line on 12 hours copanlisib treatment (Copanlisib synergizes with conventional and targeted agents including venetoclax in B- and T-cell lymphoma models), 6) genes upregulated in the hn4 line on ngf treatment (Nerve growth factor (NGF)-TrkA axis in head and neck squamous cell carcinoma triggers EMT and confers resistance to the EGFR inhibitor erlotinib), 7) genes downregulated in rat lumbar dorsal spinal cord on injection with coronavirus p65-derived peptide (A human coronavirus OC43-derived polypeptide causes neuropathic pain). Some quick notes: 1) the best GO match to the copanlisib study comes in at rank #2717, 2) the NGF study matches up nicely to numerous tgfb treatment studies, simplifying conceptualization a bit and 3), the coronavirus p65-derived peptide study aligns well with numerous studies involving sub-cellular organization.

I ran the above text through a chat bot, hoping that it could return my words in a more succinct, elegant, or insightful form. It often works, but not this time. Thus, to wrap things up, I once again offer this: curated gene lists (CGLs, like GO) suck. It's difficult to imagine the number of experiments that never were performed and the potential insights that have been lost because of misleading and/or un-insightful CGL outputs. More generally, I think biology really suffers from an over-enthusiasm for categorization. On the positive side, there's plenty of room for improved delineation of patterns and processes in biology. 


*Actually, this is a crude (?) form of clustering: Find the single most potent study, toss all other studies that overlap with a certain significance, find the new most potent study, etc. The reason we don't use standard clustering here is that a matrix of all study/study P-values would come to about 50 Gb. We'd need some serious computing power to generate this matrix and then cluster it.



whatismygene.com 

A Better Website

We've got a more professional interface. Needless to say, AI (Gemini) was largely responsible for the improvement. The improvements are ...