Saturday, March 28, 2026

A Better Website

We've got a more professional interface. Needless to say, AI (Gemini) was largely responsible for the improvement. The improvements are not limited to cosmetics...you should receive noticeably faster outputs from several of the tools (particularly "Relevant Studies", which doesn't require a lot of behind-the-scenes calculations). Despite the assistance, it still took about three weeks of work and 6,000 lines of code. The old site actually required fewer lines of code...this is not a testament to my own efficient coding, but rather, Gemini's insistence on bullet-proofing and commenting everything. 

There are also several new features. The coolest is this: The "Third Study" tool outputs a nice Venn diagram that you could use for your paper. Graphics generated from bioinformatic sites are rarely, if ever, publication ready, but it's easy enough to right-click on the image and edit it in Photoshop or Illustrator. The sizes of the circles and intersecting regions correspond to the number of genes within, failing to some extent when there are only a few genes in an intersection. Click on an output study, go to the "Venn Diagram" tab, and you'll get something like this:




In case you don't know what the tool is supposed to achieve, the user enters genes from two sets that already have a significant intersection. The tool finds studies that intersect strongly with the "central" set, but not the second set. This is apparent in the above diagram, where the central set and the "study" set share 67 genes, but the second set and the study set share only 11. Bearing in mind that the algorithm references both user-entered and database-associated backgrounds, it determined that the log(P) for the central/study intersection was much more significant than that for the second-set/study intersection. Basically, the tool helps you answer the question, "What ELSE is happening in my gene set?" You may find that your gene set is enriched for, say, cell-cycle genes, but you'd also like to know if a second theme is lurking, possibly overwhelmed by the cell cycle signal. This tool will help. 

Here's another new feature: the "Tissue Specificity" box for the "Fisher" tool. Let's say you have a study that compares breast cancer tissue to healthy tissue. You derive a list of DEGs that are upregulated in breast cancer. You could use our Fisher tool to find knockouts, drugs, etc., that tend to downregulate these genes. However, you might suspect that these treatments could cause unpleasant systemic effects. You'd prefer to target genes that are breast-specific. The "Tissue Specificity" choice allows you to do that. Specifically, the tool looks in a table of breast-specific genes and then filters the database for studies in which these genes were specifically targeted. Though not related to tissue-specificity, we've also included a list of genes that can be targeted by existing drugs. More lists are possible.

Another feature is this: "Select Database", seen in the sidebar for several tools. Currently, there are only two choices, our "standard" database and a second database ("Reduced p10"). Here, we've simply taken the standard database and removed the top 10% of most commonly perturbed genes. It's computationally expensive to do this on the fly, thus a revised database. The revised database is an attempt to address the fact that many input studies converge on relatively few database studies. Here, commonly perturbed genes are removed in order to allow the small-time talents to shine. The idea is far from optimized...currently, it seems like removing a mere 10% of genes was probably too conservative, since outputs currently don't seem to be tremendously different for either choice of database. We've got ideas for other database alterations as well.

whatismygene.com 

Thursday, March 26, 2026

The most common perturbation themes

Let's say we enter a new gene list into our database. We can then perform gene enrichment on it against as many as 208,000 other lists. Tossing three perturb-seq studies that each generate thousands of gene lists, we can still test our new list against 123,000 other lists. If our new list is enriched for a common perturbation theme, thousands of gene lists may significantly intersect with our list. The question arises: of the 123,000 lists, which one significantly intersects with the most (other) lists?

The answer is drawn from HSF1 Inhibits Antitumor Immune Activity in Breast Cancer by Suppressing CCL5. Here, the list of genes downregulated in the c4-2 line upon dthib treatment overlaps significantly with about 10,000 other lists. Unlike GO lists and the like, it's not immediately obvious what dthib treatment is expected to do. A few seconds of googling reveals that it's an HSF1 inhibitor. Studies that intersect with extreme significance involve a diverse array of perturbations: hdac5 knockdown, her2 inhibition, CDK expression, fgf1 treatment, and many more. The highest ranking GO list comes in at position 1595: "GO:0022402 cell cycle process." It seems that the dthib list encapsulates some process far better than the GO list does. If the easily-grasped wording of the GO lists is appealing to you, then we could just rename the dthib list something like this: "WIMG:0000001 DTHIB downregulated." 😀

Interestingly, the dthib study intersects very strongly with our list of genes that are rarely downregulated in cancer. The drug is indeed being studied as a cancer treatment.

How about the study that intersects second-best with all other lists in the database? Actually, this is not so easy to determine. That's because the second best, and even 500th best study, intersects with the dthib theme. Therefore, we add a requirement: the list we deem to be second-best cannot overlap with the dthib list with a significance greater than -log(P) = 20  *. 20 might seem like a very liberal cutoff, but in the case of the dthib study, for example, there are 5924 lists that overlap with at least this level of significance. Given this requirement, genes upregulated in mouse plantaris muscle one day after synergist ablation (Time course of gene expression during mouse skeletal muscle hypertrophy) wins the silver medal. There's actually quite a drop-off from the dthib study here, with only about 5% of database studies significantly overlapping. Again, it's not so obvious what's going on in this study. For a clue, the highest ranking GO list (#1301) is "GO:0006955 immune response." Studies involving viral infections, adjuvant treatment, ischemia, various injuries, and radiation treatment match strongly.  Again, to our way of thinking, the sheer volume of studies outperforming the GO list suggest a process, however murky or difficult to name, that should be considered "fundamental."

The third best list cannot overlap with the first or second-best list at -log(P)>20. These are genes upregulated in mouse medullary epithelial cells on raver2 knockdown (Aire-dependent transcripts escape Raver2-induced splice-event inclusion in the thymic epithelium). There is an impressive variety of means to recapitulate this result: lncrna over-expression, enhancer repression, various diets, aging, ezh2 over-expression, mettl3 knockout, etc. The best ranking GO list (#339) is "GO:0046649 lymphocyte activation." Do you think this GO list really captures what's happening here?

The fourth best list involves genes upregulated in the a549 cell line on IRF1 overexpression. Simply knowing that IRF1 is "interferon response factor 1" lets us know that we're talking about the innate immune response. Indeed, studies involving infection and interferon treatment dominate the top-ranked intersecting lists. Finally, a category that looks something like what we were taught in college! Nevertheless, the highest-ranked GO list comes in at position 749: "GO:0140546 defense response to symbiont."

The next three lists are these: 5) genes downregulated in the hair-m line on 12 hours copanlisib treatment (Copanlisib synergizes with conventional and targeted agents including venetoclax in B- and T-cell lymphoma models), 6) genes upregulated in the hn4 line on ngf treatment (Nerve growth factor (NGF)-TrkA axis in head and neck squamous cell carcinoma triggers EMT and confers resistance to the EGFR inhibitor erlotinib), 7) genes downregulated in rat lumbar dorsal spinal cord on injection with coronavirus p65-derived peptide (A human coronavirus OC43-derived polypeptide causes neuropathic pain). Some quick notes: 1) the best GO match to the copanlisib study comes in at rank #2717, 2) the NGF study matches up nicely to numerous tgfb treatment studies, simplifying conceptualization a bit and 3), the coronavirus p65-derived peptide study aligns well with numerous studies involving sub-cellular organization.

I ran the above text through a chat bot, hoping that it could return my words in a more succinct, elegant, or insightful form. It often works, but not this time. Thus, to wrap things up, I once again offer this: curated gene lists (CGLs, like GO) suck. It's difficult to imagine the number of experiments that never were performed and the potential insights that have been lost because of misleading and/or un-insightful CGL outputs. More generally, I think biology really suffers from an over-enthusiasm for categorization. On the positive side, there's plenty of room for improved delineation of patterns and processes in biology. 


*Actually, this is a crude (?) form of clustering: Find the single most potent study, toss all other studies that overlap with a certain significance, find the new most potent study, etc. The reason we don't use standard clustering here is that a matrix of all study/study P-values would come to about 50 Gb. We'd need some serious computing power to generate this matrix and then cluster it.



whatismygene.com 

Monday, October 6, 2025

Gene Order in Gene Lists

Whenever possible, WIMG gene lists are sorted. Typically, we divide log(fold-change) by significance and sort from largest to smallest values. If genes are not significantly altered, but nevertheless are associated with fold-changes, we sort by fold-change alone. In cases where more than 33% of all genes are significantly altered, we may choose to create a list via the above "fc/p" method (fold change divided by probability), but also create a second list in which we first eliminate all genes that are not significantly altered (i.e. P>.05) and then sort according to fold-change. Such lists are marked with "p&fc" in their descriptions. 

Even GO lists are sorted in our scheme. Here, genes that are most commonly perturbed are found at the beginning of GO lists, while housekeeping genes tend to be found at the end.

It seems reasonable that gene order in these sorted lists should observe some repeated patterns. In, say, a cell cycle study, we might see gene ABC followed by DEF, followed by GHI (etc.), while the reverse order might be relatively rare. It's possible to imagine two studies that intersect strongly at the level of genes, but whose genes do not follow a similar order. Conversely, the DEGs in two studies may overlap fairly weakly, but the few genes that are found in the intersection follow precisely the same order. 

The significance of the intersection of two lists and the significance of the similarity of order within the intersection are independent. With this in mind, we added a new feature to our "Fisher" app: 

The default choice is "No"...you don't want to examine gene order. If you select "Yes", the two significances are combined, possibly lowering or increasing the ranks of particular studies in the output list. If you select "Gene Order Only", Fisher's exact test is not applied to your data, but Spearman's test for rank significance is utilized to see if the intersecting genes are found in similar order in both studies. In the odd situation that you'd like to examine cases in which gene order is reversed (one study has ABC DEF GHI and the other has GHI DEF ABC, in order), you could select "Show non-intersecting studies" in the black bar. This causes our terminology to be a bit confusing..."Gene Order Only" doesn't invoke Fisher's exact test at all, and if you select "Gene Order Only", "Show non-intersecting studies" no longer has anything to do with intersections. Never mind. Another nuance that should be pointed out is that the "intersecting genes" column simply shows up to 25 genes that are found in both studies (your input and studies from the database), but doesn't sort the genes according to their contribution to gene order.

Our Spearman's test algorithm will not output unadjusted p-values smaller than 10-16

***************
Having set up the code for Spearman's test, we can make some inquiries of our database as a whole. One simple question: is there any evidence at all for repeated gene order in gene lists? Absolutely! Restricting ourselves to human rna-seq studies involving perturbations and allowing no more than 400 genes in a gene list, we find several studies whose gene order matches the order found in over 300 other studies at P<=10-16. The champion is The RNA binding protein RALY suppresses p53 activity and promotes lung tumorigenesis, wherein genes downregulated upon raly knockdown are found in similar order in intersections with 362 other perturbation studies. We found 862 studies that matched the order of at least 10 other studies at this significance (a total of about 70,000 study/study intersections).

What about reverse gene order when comparing study A to study B? It's relatively rare to find cases like this (at P<=10-16), but they exist. We won't focus on them today.

Do we see cases in which the P-value associated with the intersection of two sets is uninspiring, yet the P-value associated with gene order is very significant? It's a tad unusual, but yes. As an example, the intersection between two studies we've labeled downregulated in dendritic cells from atopic dermatitis patients on R. mucosa vs s. aureus treatment and downregulated in nscs from 22m vs 6m mice is entirely insignificant, yet the gene order of genes found in the intersection is similar at P<10-16  (1). How about the case where the P associated with the intersection is very significant but the P associated with order is not? This is fairly common. Most typically, however, two studies that match strongly in terms of gene order also match strongly in terms of intersecting genes...this is a bit of a no-brainer, as you can't derive a significant gene-order P if there's little or no intersection between the two sets.

What sorts of studies tend to be associated with gene order? To ask the question, we crossed the 862 studies with each other, generating 370,660 P-values associated with study/study intersections. We can then perform clustering on the resulting P-value matrix. The resulting 7 clusters were fairly clear-cut.

Cluster 1, represented by 248 studies, obviously involves the innate immune response. Gathering together the genes most commonly perturbed in these studies, IFIT3 is the top gene. Examining keywords associated with these studies, "ifn" is over-represented at log(P)= -92. Terms like "cytokine", "infection", and "virus" follow. In the other 6 clusters, the keyword "line" (as in cell line) is quite significant, but not here. A typical gene order looks like this: IFITM1 RSAD2 IFIT1 OASL ISG15 IFIT3 HERC5 IFI44 IFI35 RIPK1 RCAN1 NAPSB SIPA1L1.

Cluster 2 is represented by 472 studies. The gene CCNA2 was found in 465 of them, strongly suggesting that we're talking about the cell cycle. A typical gene order is: MKI67 RRM2 KIF20A ASPM TK1 GTSE1 NUSAP1 KIF23 ZNF367 TCF19 TRIP13 CKS2.

Cluster 3 contains 83 studies, with TRIB3 being found in 78 of them. The keywords are interesting: drug, natural, metabolite, depletion, and more. In other words, the individual studies composing the cluster are over-represented by drug studies, "natural" treatments (diets, fitness regimes, health foods, etc.), metabolite perturbations, and depletion of various nutrients and metabolites. Gene order looks like this: NIBAN1 TRIB3 DDIT3 GDF15 MTHFD2 HYOU1 NADK2 SKIL AZIN1 ZXDB.

Cluster 4 contains 12 studies, with several genes found 11 times: RPL30 RPS23 RPL14 NACA. "ripseq" and "cell part" (meaning studies in which one organelle or the like is examined against another) are prominent keywords. RPL39 RPL35A RPS17 RPL22 EIF3L NCOA3 PLP2 is a typical gene order.

Cluster 5 contains 14 studies...there are quite a few genes found in all of them. Keywords "drug" and "hypoxia" are prominent. The gene order looks like this: HMGCS1 MSMO1 HMGCR ACSL1 VAT1 MRNIP TSC22D3 MVP MT-TC.

Cluster 6 contains a mere 6 studies, with 65 genes being found in all of them. All of the studies are knockdowns, and the keyword "kd" is indeed the top keyword (log(P)<-55). There's also an association with lncrnas and circrnas. TPM4 GANAB CASP7 BICD2 TBC1D10B ZW10 ZSWIM9 LPCAT1 NFKBIB CYP1B1. 

Cluster 7 is the garbage can for the remainder of the studies that we selected. There are 24 studies, with PRDX4 and H1-2 being found 9 times. There is again an association with hypoxia.




(1) I don't want to read too much into two studies, but we might be able to explain the result like this: we know that genes involving immunity are often differentially regulated in old vs young subjects. While the two studies, a human infection study in dendritic cells and a mouse aging study in neural stem cells, would not be expected to intersect greatly, the few genes that do intersect follow a very significant gene order pattern involving an immune process. This is kind of cool, I think...without cheating, we're extracting a link between two studies that would ordinarily be hidden.


whatismygene.com 

Wednesday, September 10, 2025

The WIMG view of mouse Alzheimer's studies

Here's a recent review of the state of the field of Alzheimer's research in non-humans. To summarize...these studies, nearly all of which seek to induce amyloid or tau pathology, have a dismal record.

The WIMG database has quite a large compendium of Alzheimer's studies...the  term "Alzheimer's" is found in about 1200 lists, comprised primarily of human and mouse studies. Previously, we used the human portion of these lists to construct new lists of genes that are canonically up- and down-regulated in the Alzheimer's disease brain (dbase IDs 123049121 and 123050121). How do mouse studies match up with these two lists?

Knowing that it's hard to get any perturbation to generate a result that looks like our Alzheimer's upregulation list, let's start with transcripts that are canonically down-regulated in Alzheimer's. Not surprisingly, the studies that best match up with this list are the human studies that compose the list. This is followed by other human neural disorders...Creutzfeldt-Jakob, Nasu-Hakola, etc. The first mouse match is ranked 21st in terms of match significance (log(P) = -33). We've labeled it as transcripts "upregulated in mouse cortex 4d vs 2d after skull injury", but you can impose a double negative on that wording to get an equivalent: transcripts downregulated in mouse cortex 2d vs 4d after skull injury. Perhaps that wording makes it more obvious that we're talking about transcripts that are downregulated early in the process of injury recovery. These injury-related studies, in fact, dominate the top of our list of mouse studies that mimic the genes that are downregulated in Alzheimer's...of mouse studies, ranks 1, 3, 8 (spinal tissue!), and 10 (cerebral artery occlusion) match our Alzheimer's list fairly significantly*.

How about rank 2? Here we're talking about a single-cell cluster ("neurons2") of brain stem neurons with and without a SOD1 mutation (Single-cell RNA-seq analysis of the brainstem of mutant SOD1 mice reveals perturbed cell types and pathways of amyotrophic lateral sclerosis). This is another theme of our mouse Alzheimer's-mimic list: clustering and/or cell-type results involving neurons, perhaps suggesting that very specific types of neurons may be more or less involved in Alzheimer's.

Another theme involves studies of embryonic brain cells. This is seen in ranks 5, 16, 18,19, and 21. 

Studies that might seem rather odd in their ability to deliver an Alzheimer's signature involve genes downregulated in the colon (!) upon gavaging with mulberry extract nanoparticles (rank 4, GSE185351), genes upregulated on pyk2 knockout (27, GSE180598), genes upregulated in aorta on rage knockout (28, GSE15729), and genes downregulated in microglia on ehmt1 haploinsufficiency (36, Derepression of inflammation-related genes link to microglia activation and neural maturation defect in a mouse model of Kleefstra syndrome). 

Wait a second...where are the explicit mouse Alzheimer's studies that involve, say, the 3XTG or 5XFAD models? Well, the first hint of such a result is found at rank 13: "genes negatively correlated w/plaque intensity in E4 5XFAD mouse brain". Note, however, that this doesn't quite fit the bill, as both the test and control samples involve a 5XFAD mouse brain. It turns out you have to go down to the 99th mouse study on our list to find such a result ("downregulated in mouse 5XFAD vs wt 8m hippocampus", GSE149243, log(P)=-11). In the process, you pass through studies involving the retina, muscles, adrenal glands, heart, myoblasts, and more. In other words, a myriad of seemingly irrelevant mouse studies do a much better job of mirroring the Alzheimer's signature than studies explicitly designed to generate the signature in a mouse brain.

At this point, if we had to say something positive about mouse Alzheimer's studies, we'd say that the 5XFAD model appears best. The first appearance of the term "APP/PS1" appears at rank 798. The term "3XTG" first appears at rank 1950 of 149,000 lists, with an unadjusted log(P) of -1.26.

Perhaps the mouse models do a better job of mimicking genes that are upregulated, not downregulated, in Alzheimer's. Let take a look. Here, the first mouse study is found at rank 45 with log(P)= -8: "up-regulated in mouse cortical culture on ursodiol" (GSE110256). Ursodiol, interestingly, is a bile acid generated by humans, but in higher concentrations in bears and hibernating animals. Perhaps there is some natural justice dealt out to the humans who torture bears for their bile juice.

Eliminating all non-mouse studies, study #2 involves downregulation of hypothalamus genes upon DHA treatment (GSE64807). We've previously noted the possible benefits of DHA. Again, we see studies involving injury: ranks 9, 11, 27 (a heart infarction study), and 53 (a skin-wounding study). Bearing in mind that the p-values aren't impressive, we also see a number of gene perturbation studies that parallel the upregulation signature: lsd1 knockout, hiv-gp120 overexpression, circSCMH1 overexpression, and arx mutation.

Where do we see the first occurrences of "5XFAD" or "3XTG"? Amazingly, the first explicit 5XFAD study is ranked #4273 (unadjusted log(P) = -0.76). The situation is worse for the first 3XTG study in the list: rank #5965; here, genes upregulated in the mouse model match our list of genes downregulated in Alzheimer's better than our list of upregulated genes.

Simply put, mouse Alzheimer's studies suck. Mouse studies that do mirror the Alzheimer's signature weren't conducted with the intention of furthering understanding of Alzheimer's. One could complain that we're judging the mouse studies based on a single perspective (gene set analysis of human vs mouse transcriptomes)...but, as seen in the aforementioned review, the mouse studies have failed in numerous other respects.

*****************

If you're interested in perusing the full list of studies mentioned above, it's easy. Just go to the WIMG website, choose the Fisher tool, enter the database ID for either the Alzheimer's upregulation or downregulation list, and submit. To focus entirely on mouse studies, choose "Mouse" in the species box.


*10/29/2025: Here's an injury study we just added to the database: Transcriptome Profiling of Hippocampus After Cerebral Hypoperfusion in Mice. Here, genes downregulated in the hippocampus upon bilateral carotid artery stenosis match up with our list of genes downregulated in Alzheimer's with a log(P) of -17!



whatismygene.com 

Sunday, August 24, 2025

Still more perturb-seq

Previously, we alluded to yet another perturb-seq dataset. Here it is: Comprehensive transcription factor perturbations recapitulate fibroblast transcriptional states. This time, the authors used crispr gene activation to examine the effects of over-expression of a near-comprehensive list of transcription factors in rpe1 and hs27 cell lines.

Before some discussion of the above Southard et al dataset, we should point out yet another "largest" perturb-seq dataset that we won't be adding to the database: the Tahoe100M matrix. As with the Xaira dataset, there's some hype regarding the data:

Tahoe-100M is a giga-scale single-cell perturbation atlas consisting of over 100 million transcriptomic profiles from 50 cancer cell lines exposed to 1,100 small-molecule perturbations. Generated using Vevo Therapeutics' Mosaic high-throughput platform, Tahoe-100M enables deep, context-aware exploration of gene function, cellular states, and drug responses at unprecedented scale and resolution. This dataset is designed to power the development of next-generation AI models of cell biology, offering broad applications across systems biology, drug discovery, and precision medicine.

Unlike the Xaira data, there's not a lot of sequencing depth here. As the Xaira paper itself points out, Xaira identified 8.45 times more unique molecular identifiers (UMIs...roughly speaking, we're talking about transcripts) per cell than the Tahoe100 folks did. To exaggerate, the bioinformatician is left trying to utilize a list of ribosomal and mitochondrial counts to infer the effects of 1,100 chemical perturbations on 50 different cell lines. As much as WIMG neurotically loves hoarding data, we'll pass on this one.

Getting back to the Southard paper, we see a respectable 5,000 UMIs per cell. The data is available in a fairly processed, compact form, enabling us to churn out gene lists without a lot of optimization. Given the crispr activation, we'd like to see the targeted gene consistently appear in the list of upregulated genes. Though this can be seen at a frequency far above chance, the majority of our 100-member upregulation lists (90% or so) lack the perturbed TF. We attribute this to the fact that TFs are typically non-abundant entities, falling outside the limits of detection in Southard's setup. 

As with the Xaira lists, we can observe the extent to which various Southard lists match up against "WIMG exemplar" lists. If all Southard lists failed to overlap with these lists, or all Southard lists overlapped equally (i.e. they don't cluster) with these lists, we'd question the quality of the data, or our preparation of the data. That's not the case here. As an example, Southard's LHX4, GATA1, MYC, and HIF1A activations all overlap WIMG exemplar data with very significant p-values, without overlapping with each other to any great extent. Below, note how well the HIF1A activation matches up with hypoxia studies:


The "hif1a chip-seq" result (line 14) is quite nice. It's easy to conclude that hif1a is primarily an activating, rather than repressing, transcription factor...the genes that are upregulated when hif1a is overexpressed are also found in a list of hif1a DNA targets.

Here's another nice example of Southard data quality: the single best Southard match to a myc knockout in mouse t-all cells (GSE222937) is myc activation in rpe1 cells.

We thus stamp a "not junk" label on Southard's data and include it in our database.

****************

Previously, we pointed out some issues that arise when we add these massive datasets to our database. In particular, naively combining these sets with the rest of the WIMG database skews co-expression results to an extreme. Thus we must take steps to minimize these effects. To include these perturb-seq sets in your analysis, you'll need to select the "database" option on our website. 


For Fisher analysis, we've lumped the three major perturb-seq studies (Repogle, Xaira, and Southard) in our database together. It's possible, however, that you only wish to conduct analysis with one of these studies. Using the "keyword search" box, you could choose to examine only Repogle's work by typing "Repogle". You could also choose to exclude "Repogle". Likewise with the terms "Xaira" and "Southard". Let's say you only want to examine Xaira's hct116 results, not the hek293 results: type "Xaira hct116". Likewise for "Xaira hek293", "Repogle k562", "Repogle rpe1", "Southard hs27", and "Southard rpe1". Be sure to spell correctly. In general, when folks choose to perform Fisher analysis on WIMG, they want to quickly scan a database of diverse studies. The default settings, which exclude perturb-seq studies, optimize that.

For co-expression analysis, we've set things up so that you don't mix databases. Thus, you can choose "Only Xaira hct116" from the database box, but you can't combine Xaira's hct116 results with our standard database. We did include a somewhat dubious "Include Perturb-Seq crispr i/a" option, which combines Xaira, Southard, and Repogle results.

Let us know if our current interface prevents you from performing your desired analysis. In the worst case, we can get our hands dirty and do some coding.


whatismygene.com 

Wednesday, August 13, 2025

More Perturb-Seq

Xaira, a recently formed billion dollar biotech, has released monster perturb-seq datasets involving crispr-inhibition in hek293 and hct116 cell lines. Thus this data joins the Repogle perturb-seq dataset in our database. For more background on the Repogle set, and on the perturb-seq approach in general, see our relevant post.

In our next post, we will explain how to access the Xaira data on the WIMG website.

Unlike Repogle's data, Xaira's data is not currently available in a form more processed than mere count data. Thus we were faced with the task of dicing up 500 Gb of scRNA count data. To be honest, we've never had reason to process this kind of data for ourselves...we scrounge the processed results from others. We initially attempted to follow standard protocols, where adjustments are made for extremely sparse data and large batch effects. Our initial, naive attempts found that control data could be grouped into two very distinct clusters. One cluster was dominated by high abundance ribosomal and mitochondrial transcripts; the other wasn't. Though batches were clearly labeled in the data, the clusters did not conform to batches (i.e. it cannot be definitively said that batch 100 is overloaded with ribosomal transcripts, and batch 127 isn't), and thus standard single-cell batch-control methods did not alleviate the presence of distinct clustering in controls. After adjusting for our own clustering results, we were disappointed. Another issue: various methods did not seem to dramatically improve the frequency with which the knocked-down gene appeared near the top of the list of downregulated genes1. Without going into the dirty details, we finally settled on a simple procedure...normalize the counts, perform log1p adjustment, grab a random subset of control data, and perform Wilcoxon's test for significance on specifically targeted test samples vs controls. Such an algorithm performed best in drawing targeted genes to the top of their corresponding downregulation lists. Gene lists were sorted according to log(fold-change) divided by significance.

We can cluster the resulting gene lists by first generating a matrix of study/study Fisher p-values. This can be a matrix that matches Xaira lists against Xaira lists. It can also be a matrix that matches Xaira lists against our entire database. Choosing the latter approach, we were again disappointed...both the elbow and silhouette methods identified an optimal cluster number of 2. Ideally, one would like to see tens or hundreds of clusters, each representing special processes in cells. As with the control data alone, one cluster was dominated by high abundance genes.

If Xaira, or some other entity, can provide better processed data, we'll happily snatch it up and overwrite our own.

There are signs, however, that the Xaira data, excreted by our crude procedure, contains worthwhile biological information. We note, for example, that Xaira knockdowns do align with the same knockdowns/outs from other studies at a frequency that is certainly not random. As just one example, genes downregulated in both of Xaira's NRF1 knockdowns strongly align with a study in which NRF1 was knocked out in the mouse retina (GSE150258); the Xaira hct116 list was the third best match out of 146,950 lists and the hek293 data was the ninth best match2. Also, while the numerous lists in which ribosomal/mitochondrial genes seemed most strongly perturbed are bothersome, there may be an element of biological reality here: grouping all the genes whose knockdown apparently strongly perturbs ribosomal transcripts, we find very strong (p<10-20) representation by genes involved in ribosomal RNA processing. These moderate- to low-abundance genes are precisely the genes whose knockdown would be expected to decrease ribosomal RNA levels3,4. Another positive sign: genes targeted by sgrna were found in the corresponding 100 member downregulation lists around 50% of the time. Given that roughly 20,000 genes were identified at non-zero levels, one would expect to see the targeted gene appear in the 100 member downregulation list about 0.5% of the time if the lists were composed of random garbage.

Assuming the sequencing of a suitable number of cells (say, 1000), any scRNA-seq paper is expected to show results of at least one clustering procedure. The optimal number of clusters, arrived at by any number of methods, can be disappointing, as above. I'm not in a position to critique the underlying math of clustering methods, but I can say that these procedures often seem to ignore rare gene patterns in favor of forcing all gene patterns into a fixed number of sets5. Examining Xaira data against 74 "WIMG exemplar" lists which constitute largely non-overlapping gene patterns (as measured by Fisher's exact test: see our preprint), we find Xaira gene lists that strongly match 30 of these patterns. For example, Xaira's TMEM131 kd in hct116 cells matches quite nicely (p<10-38) with genes found in hek293 ER fraction vs cytosol (GSE215768)6. Genes upregulated on Xaira INTS8 kd in hct116 cells match up very nicely with genes upregulated in hcclm3 cells on BRD4 inhibition (GSE181406). Patterns generated by knockdown of genes such as ZC3H13, DDX27, SRSF1, ZWILCH, NAA25, CMTR1, ELOB, TRMT2A, and many more, match up with high significance against our (again, non-overlapping) exemplar lists. 

One of the more interesting and impressive results involved genes upregulated in Xaira's REST knockdown in hek293 cells, which overlapped with great significance with a study in which PRRX1 was overexpressed (p=10-36: GSE180515)7; the next closest Xaira match to this result involves knockdown of CDYL and a p-value of a mere 10-7.4 8 .  Another notable result: genes upregulated in both hek293 and hct116 lines on GRPEL1 kd overlapped strongly with a study in which IGF2BP1 was knocked out (GSE115646). And, to jump the gun a bit (our next post): genes downregulated in Xaira's PPARGC1B kd in hek293 cells overlap strongly with genes upregulated in Southard's perturb-seq PPARGC1A crispr activation: both results overlap a study in which ANLN was knocked-out in mda-mb-231 cells (GSE131120). 

Most of the above observations were made by an "eyeball" approach. Taking a more systematic, computerized approach would probably yield reams of potentially interesting results.

1) Perhaps the biggest oddity in the data was this: the presence of a normally ho-hum transcript, PLXDC1, in a very large number of up- and down-regulation lists in both hct116 and hek293 results. WTH?

2) Another example: The single best Xaira match to genes upregulated on eif4a1 ko in mouse b-cells (GSE237426) is the Xaira eif4a1 kd in hct116. Another: our database's (155,000 lists) 3rd best match to genes upregulated in mouse cerebellum on eif2b5 mutation (GSE128092) is Xaira's eif2b3 kd in hek293...Xaira's eif2b5 kd in hek293 ranks 25th. Another...the single best Xaira match to a zeb1 ko in mouse osteoclasts (GSE212302) is the Xaira zeb1 kd in hek293. Another...the second best Xaira match to a mouse sin3a ko in cd4+ t-cells (GSE196615) is Xaira's sin3a kd in hct116. Another...the second best Xaira match to a mouse cdyl ko in embryonic gonads (GSE226049) is Xaira's cdyl kd in hek293. (If you find it odd that all the above studies involve mice it's simply because we've been focusing on increasing the proportion of mouse studies in the database). Another: the second best Xaira match to genes upregulated in rael cells on uhrf1 ko (GSE136596) is Xaira's uhrf1 kd in hek293.

3) I'd guess that these results are, in turn, strongly dependent on exactly how long the knock down was conducted prior to freezing the cells. Had the average knock down period been increased by a few hours, allowing recovery of ribosomal genes, or a shift into backup programs, the gene lists could be quite different. In the end, despite the massive funding ($2.00 per cell?) and output behind these studies, they only examine particular cells under particular conditions and timeframes. I'm a bit skeptical of the ability of these monster studies to reveal extraordinary insights into cellular biology on their own, whether via standard statistics or AI approaches (yes, this is a WIMG plug).

4) The best example of a Xaira knockdown that generates a list of genes overloaded with ribosomal and mitochondrial entities involves knockdown of cmtr1. In both hek293 and hct116 lines, cmtr1 kd very significantly downregulates these abundant genes. Remarkably, examining an independent study in which cmtr1 was overexpressed in mefs (GSE200103), the single best Xaira match to this study is...cmtr1 kd in hct116 cells. Cmtr1 kd in hek293 was the third best Xaira match. For reference, there are now 37,310 Xaira lists in the WIMG database.

5) To be a tad more precise...whatever value is being minimized/maximized in these procedures, it seems like it's best done not by placing one or two outlying lists into a separate cluster, but by generating clusters derived from larger numbers of lists. Thus merely increasing the cluster number doesn't automatically highlight rare but interesting gene patterns. Having said that, ChatGpt offers me a list of 8 options to overcome this issue...tinkering with the "resolution parameter" sounds promising.

6) Sure enough, a little googling shows that TMEM131 is involved in ER transport.

7) We've pointed to REST as an interesting gene in previous posts. Here, for example. We've also noted a relevance to Alzheimer's. Yup...of 37,000 Xaira gene lists, the one that best overlaps our list of genes downregulated in Alzheimer's is a list of genes upregulated on REST knockdown in hek293 cells.

8) In WIMG parlance, this is something of a "microcluster"...a result which overlaps with high significance with only one or a few other studies, followed by a dramatic drop-off in significance. We've identified about 950 microclusters scattered throughout the database, which currently contains about 19 billion study/study overlaps. In this particular case, I don't actually make the "microcluster" annotation in the database, since there are non-Xaira studies that overlap with the PRRX1 study quite significantly. But within the context of Xaira-only studies plus the PRRX1 study, the REST knockdown really stands out.

whatismygene.com 

Friday, April 25, 2025

Stuff that might be true

I'll add to the below list as thoughts pop into my brain....

*"Celebrity" genes are over-rated. Last I looked there are something like 10,000 papers primarily devoted to tp53. Every now and then I stumble across a knockout that has not, to my knowledge, been performed before. One might think that such knockouts would be less likely to generating a long list of significantly perturbed transcripts than, say, a tp53 knockout. I just entered a study involving a CHSY3 knockout into the database. That's the first instance of a CHSY3 perturbation in the database, and the knockout had a major effect on transcript abundances in the underlying study. This sort of thing happens again and again...it's not as if the list of genes whose knockout strongly alters cell activity was exhausted a decade ago.

*Our understanding of biology is strongly biased according to the order of discovery. As an example, it seems that folks have a fairly fixed idea of what micro-RNAs do, if they do anything at all. When perusing micro-RNA overexpression and inhibition studies, the studies in which large numbers of transcripts are significantly altered (versus few or none) usually seem to involve the symbol "mir" followed by a small number, not a large number (e.g. mir1 vs mir1234). This may seem odd until you consider that, in the early days of miRNA research, new mirnas would simply be given the first integer that had not already been taken. In other words, the early mirnas, which were discovered because they actually did something in cells, may have created the illusion that there may be thousands of interesting mirnas, all of which act according to the principles associated with the earliest mirnas.

*Also, regarding miRNAs: it's possible that, at "ground truth" level, the typical miRNA only targets one or a few transcripts (1). This is based on an observation I haven't quantified: that miRNA overexpression and inhibition studies, in contrast to these experiments conducted on ordinary transcripts, often seem to strongly alter the expression of one or a handful of transcripts, followed by a clear drop-off in significance and/or fold-change.

*Similarly, it's possible that in a typical lab-generated list of significantly altered genes, relatively few matter. That is, a large portion of these transcripts or proteins are basically junk, possibly generated to maintain the proper concentration of RNA and/or protein. I base this on admittedly flimsy evidence (4): that if you take a large database of perturbations (e.g. WhatIsMyGene's), generate all study/study overlap P-values, and cluster the data, you might be surprised at how few clusters you generate (using standard methods to determine optimal cluster numbers...e.g. the "elbow" method). To put it overly-dramatically, one study looks like the next. How can sophisticated biological decisions be made in that case? Because a lot of the sameness between studies is not interesting, while a handful of truly interesting genes make a big difference.

*Some genes may reach celebrity status because they are exceptions to a rule: that one type of study (e.g. knockouts) usually overlaps weakly, if at all, with other types (e.g. chip-seq). P53, for example, breaks the rule: the genes altered in P53 knockouts often do overlap with P53 chip-seq results. Of course, P53 also has the property that it is often mutated in cancer, so another possible "truth" is that genes that break the rule are precisely the ones that get targeted by viruses or cancers.

*Modern biology is hugely skewed by the results of experiments in which cells are essentially blasted with extreme conditions that are rarely, if ever, experienced in normal living creatures...complete knockouts, targeted alterations of single genes, micro-RNA levels 100X greater than anything ever experienced in nature, etc. These conditions are very often measured over extremely short time scales, largely due to the limitations of cell culture approaches. It is possible that these approaches have skewed biology in massive fashion. Again, consider Alzheimer's, a disease whose seeds may be planted decades before symptoms become obvious...I've yet to see any sort of experiment, either with mice or cell culture, that parallels the set of genes that are typically upregulated in Alzheimer's (2). Perhaps this is simply because of the near-impossibility of conducting experiments over the course of decades.

*OK, maybe more of a critique than a "truth": let's say you do a chip-seq experiment using transcription factor XYZ. You collect the list of most-strongly bound genes and test your results against a parallel XYZ knockout experiment. Let's assume your background figures are good. You perform Fisher's test and get -log(P) = 4. Should you be impressed with this statistic? Should you perform further experiments based on this very significant number? I say no. If you had tested your chip-seq list against 150,000 other lists, you may have found 5,000 lists that out-performed your knockout study. In fact, after correction for multiple testing, your knockout results would be rendered insignificant. Yes, I'm plugging WhatIsMyGene. 

*Maybe the "replication crisis" is a bit exaggerated. Cells and organisms are simply very sensitive to seemingly minor differences in experimental settings. Perhaps stochastic effects are much more powerful than we generally believe (3). Let's say you knockout gene ABC in mouse kidneys in your lab. Somebody else does the same thing in their lab. The results overlap weakly. Did somebody screw up? Maybe not. Using a tool like WhatIsMyGene, you may find that both studies nevertheless overlap rather nicely with a third study involving gene XYZ.  (Also, recall the above point that maybe only a few transcripts really "matter" in list of significantly altered genes, with the rest being relatively unimportant).

*Alzheimer's has some relationship to stem cells. I say this because time and time again the genes downregulated in Alzheimer's overlap strongly with studies involving stem cells and embryonic cells. The problem is...brains don't have a lot of stem cells, especially if you exclude the SVZ. I don't know how to work around this issue...perhaps some brain cells, neurons in particular, have a stem-like signature but lack the standard markers for stem-cells.

*A bit more speculatively, Alzheimer's may also have some connection to the appendix and appendicitis. In addition to papers suggesting a link (google it), I'd also point out an odd overlap between our own list of transcripts typically up-regulated in the Alzheimer's brain and a list of transcripts up-regulated in the mouse distal colon following appendicitis (GSE23914). The significance is not impressive...but it's difficult to find any studies that overlap strongly with those up-regulated in the human Alzheimer's brain. The study ranks as the 332nd best match to the Alzheimer's list (against 145,000 other studies), competing against studies primarily derived from human cells and brain cells.

*Some in-vivo studies may produce distorted results because of the time of day at which they cut open their subjects. I'm looking at a 12 week study where mice were treated with control vs drug. Out of 145,000 studies examined, the best match (at p=10^-45) would be one that examined mouse livers at zt21 vs zt12. It turns out that the drug in question, minocycline, actually does recalibrate circadian rhythms. But...how often is the possibility of circadian effects totally ignored?

*Methylation of DNA may do more than repress or activate transcription. It may also regulate consistency/variability in expression. I have admittedly shoddy evidence for this notion: a list of genes that do not commonly correlate with batch effects is replete with genes that are often seen in DNA methylation experiments (see Quantifying batch effects for individual genes in single-cell data) .


1. Just as an example, a study in which mir138 was inhibited (GSE173982) results in very significant downregulation of a single transcript, NDUFA9. Another one...mir-222-3p treatment results in the very significant downregulation of a single transcript (Gm10925) in GSE167753. Another one: mir144-3p inhibition in stress susceptible mice results in significant downregulation of a single gene (kcnj8) in GSE209673. Also, in GSE211749, only one transcript is downregulated with strong significance (zgrf1) on a triple miR-322-503-351 ko in white adipose. Also, in GSE216981, mir150 knockout downregulates tnfrsf26 at a significance of 10^-204, while the next most significant alteration comes in at 10^-20.

2. Downregulated genes in Alzheimer's are a different case...these are seen in many kinds of perturbation and clustering experiments involving brain tissue.

3. Another issue is this: if the only way you can replicate a study is via extreme rigor, how generalizable/interesting are your conclusions about your gene of interest? 

4. Here's some more evidence: if you take two mouse strains and compare transcriptomics from a particular organ, you'll get a long list of differentially regulated genes...it wouldn't be surprising to see more than 50% of transcripts significantly altered. Thus, you have very different transcriptomics, but a very similar product...a mouse. One could surmise, therefore, that most of these transcripts aren't doing anything.

whatismygene.com 

A Better Website

We've got a more professional interface. Needless to say, AI (Gemini) was largely responsible for the improvement. The improvements are ...