WhatIsMyGene: August 2025

Previously, we alluded to yet another perturb-seq dataset. Here it is: Comprehensive transcription factor perturbations recapitulate fibroblast transcriptional states. This time, the authors used crispr gene activation to examine the effects of over-expression of a near-comprehensive list of transcription factors in rpe1 and hs27 cell lines.

Before some discussion of the above Southard et al dataset, we should point out yet another "largest" perturb-seq dataset that we won't be adding to the database: the Tahoe100M matrix. As with the Xaira dataset, there's some hype regarding the data:

Tahoe-100M is a giga-scale single-cell perturbation atlas consisting of over 100 million transcriptomic profiles from 50 cancer cell lines exposed to 1,100 small-molecule perturbations. Generated using Vevo Therapeutics' Mosaic high-throughput platform, Tahoe-100M enables deep, context-aware exploration of gene function, cellular states, and drug responses at unprecedented scale and resolution. This dataset is designed to power the development of next-generation AI models of cell biology, offering broad applications across systems biology, drug discovery, and precision medicine.

Unlike the Xaira data, there's not a lot of sequencing depth here. As the Xaira paper itself points out, Xaira identified 8.45 times more unique molecular identifiers (UMIs...roughly speaking, we're talking about transcripts) per cell than the Tahoe100 folks did. To exaggerate, the bioinformatician is left trying to utilize a list of ribosomal and mitochondrial counts to infer the effects of 1,100 chemical perturbations on 50 different cell lines. As much as WIMG neurotically loves hoarding data, we'll pass on this one.

Getting back to the Southard paper, we see a respectable 5,000 UMIs per cell. The data is available in a fairly processed, compact form, enabling us to churn out gene lists without a lot of optimization. Given the crispr activation, we'd like to see the targeted gene consistently appear in the list of upregulated genes. Though this can be seen at a frequency far above chance, the majority of our 100-member upregulation lists (90% or so) lack the perturbed TF. We attribute this to the fact that TFs are typically non-abundant entities, falling outside the limits of detection in Southard's setup.

As with the Xaira lists, we can observe the extent to which various Southard lists match up against "WIMG exemplar" lists. If all Southard lists failed to overlap with these lists, or all Southard lists overlapped equally (i.e. they don't cluster) with these lists, we'd question the quality of the data, or our preparation of the data. That's not the case here. As an example, Southard's LHX4, GATA1, MYC, and HIF1A activations all overlap WIMG exemplar data with very significant p-values, without overlapping with each other to any great extent. Below, note how well the HIF1A activation matches up with hypoxia studies:

The "hif1a chip-seq" result (line 14) is quite nice. It's easy to conclude that hif1a is primarily an activating, rather than repressing, transcription factor...the genes that are upregulated when hif1a is overexpressed are also found in a list of hif1a DNA targets.

Here's another nice example of Southard data quality: the single best Southard match to a myc knockout in mouse t-all cells (GSE222937) is myc activation in rpe1 cells.

We thus stamp a "not junk" label on Southard's data and include it in our database.

****************

Previously, we pointed out some issues that arise when we add these massive datasets to our database. In particular, naively combining these sets with the rest of the WIMG database skews co-expression results to an extreme. Thus we must take steps to minimize these effects. To include these perturb-seq sets in your analysis, you'll need to select the "database" option on our website.

For Fisher analysis, we've lumped the three major perturb-seq studies (Repogle, Xaira, and Southard) in our database together. It's possible, however, that you only wish to conduct analysis with one of these studies. Using the "keyword search" box, you could choose to examine only Repogle's work by typing "Repogle". You could also choose to exclude "Repogle". Likewise with the terms "Xaira" and "Southard". Let's say you only want to examine Xaira's hct116 results, not the hek293 results: type "Xaira hct116". Likewise for "Xaira hek293", "Repogle k562", "Repogle rpe1", "Southard hs27", and "Southard rpe1". Be sure to spell correctly. In general, when folks choose to perform Fisher analysis on WIMG, they want to quickly scan a database of diverse studies. The default settings, which exclude perturb-seq studies, optimize that.

For co-expression analysis, we've set things up so that you don't mix databases. Thus, you can choose "Only Xaira hct116" from the database box, but you can't combine Xaira's hct116 results with our standard database. We did include a somewhat dubious "Include Perturb-Seq crispr i/a" option, which combines Xaira, Southard, and Repogle results.

Let us know if our current interface prevents you from performing your desired analysis. In the worst case, we can get our hands dirty and do some coding.

whatismygene.com

Xaira, a recently formed billion dollar biotech, has released monster perturb-seq datasets involving crispr-inhibition in hek293 and hct116 cell lines. Thus this data joins the Repogle perturb-seq dataset in our database. For more background on the Repogle set, and on the perturb-seq approach in general, see our relevant post.

In our next post, we will explain how to access the Xaira data on the WIMG website.

Unlike Repogle's data, Xaira's data is not currently available in a form more processed than mere count data. Thus we were faced with the task of dicing up 500 Gb of scRNA count data. To be honest, we've never had reason to process this kind of data for ourselves...we scrounge the processed results from others. We initially attempted to follow standard protocols, where adjustments are made for extremely sparse data and large batch effects. Our initial, naive attempts found that control data could be grouped into two very distinct clusters. One cluster was dominated by high abundance ribosomal and mitochondrial transcripts; the other wasn't. Though batches were clearly labeled in the data, the clusters did not conform to batches (i.e. it cannot be definitively said that batch 100 is overloaded with ribosomal transcripts, and batch 127 isn't), and thus standard single-cell batch-control methods did not alleviate the presence of distinct clustering in controls. After adjusting for our own clustering results, we were disappointed. Another issue: various methods did not seem to dramatically improve the frequency with which the knocked-down gene appeared near the top of the list of downregulated genes¹. Without going into the dirty details, we finally settled on a simple procedure...normalize the counts, perform log1p adjustment, grab a random subset of control data, and perform Wilcoxon's test for significance on specifically targeted test samples vs controls. Such an algorithm performed best in drawing targeted genes to the top of their corresponding downregulation lists. Gene lists were sorted according to log(fold-change) divided by significance.

We can cluster the resulting gene lists by first generating a matrix of study/study Fisher p-values. This can be a matrix that matches Xaira lists against Xaira lists. It can also be a matrix that matches Xaira lists against our entire database. Choosing the latter approach, we were again disappointed...both the elbow and silhouette methods identified an optimal cluster number of 2. Ideally, one would like to see tens or hundreds of clusters, each representing special processes in cells. As with the control data alone, one cluster was dominated by high abundance genes.

If Xaira, or some other entity, can provide better processed data, we'll happily snatch it up and overwrite our own.

There are signs, however, that the Xaira data, excreted by our crude procedure, contains worthwhile biological information. We note, for example, that Xaira knockdowns do align with the same knockdowns/outs from other studies at a frequency that is certainly not random. As just one example, genes downregulated in both of Xaira's NRF1 knockdowns strongly align with a study in which NRF1 was knocked out in the mouse retina (GSE150258); the Xaira hct116 list was the third best match out of 146,950 lists and the hek293 data was the ninth best match². Also, while the numerous lists in which ribosomal/mitochondrial genes seemed most strongly perturbed are bothersome, there may be an element of biological reality here: grouping all the genes whose knockdown apparently strongly perturbs ribosomal transcripts, we find very strong (p<10^-20) representation by genes involved in ribosomal RNA processing. These moderate- to low-abundance genes are precisely the genes whose knockdown would be expected to decrease ribosomal RNA levels^3,4. Another positive sign: genes targeted by sgrna were found in the corresponding 100 member downregulation lists around 50% of the time. Given that roughly 20,000 genes were identified at non-zero levels, one would expect to see the targeted gene appear in the 100 member downregulation list about 0.5% of the time if the lists were composed of random garbage.

Assuming the sequencing of a suitable number of cells (say, 1000), any scRNA-seq paper is expected to show results of at least one clustering procedure. The optimal number of clusters, arrived at by any number of methods, can be disappointing, as above. I'm not in a position to critique the underlying math of clustering methods, but I can say that these procedures often seem to ignore rare gene patterns in favor of forcing all gene patterns into a fixed number of sets⁵. Examining Xaira data against 74 "WIMG exemplar" lists which constitute largely non-overlapping gene patterns (as measured by Fisher's exact test: see our preprint), we find Xaira gene lists that strongly match 30 of these patterns. For example, Xaira's TMEM131 kd in hct116 cells matches quite nicely (p<10^-38) with genes found in hek293 ER fraction vs cytosol (GSE215768)⁶. Genes upregulated on Xaira INTS8 kd in hct116 cells match up very nicely with genes upregulated in hcclm3 cells on BRD4 inhibition (GSE181406). Patterns generated by knockdown of genes such as ZC3H13, DDX27, SRSF1, ZWILCH, NAA25, CMTR1, ELOB, TRMT2A, and many more, match up with high significance against our (again, non-overlapping) exemplar lists.

One of the more interesting and impressive results involved genes upregulated in Xaira's REST knockdown in hek293 cells, which overlapped with great significance with a study in which PRRX1 was overexpressed (p=10^-36: GSE180515)⁷; the next closest Xaira match to this result involves knockdown of CDYL and a p-value of a mere 10^{-7.4 8} . Another notable result: genes upregulated in both hek293 and hct116 lines on GRPEL1 kd overlapped strongly with a study in which IGF2BP1 was knocked out (GSE115646). And, to jump the gun a bit (our next post): genes downregulated in Xaira's PPARGC1B kd in hek293 cells overlap strongly with genes upregulated in Southard's perturb-seq PPARGC1A crispr activation: both results overlap a study in which ANLN was knocked-out in mda-mb-231 cells (GSE131120).

Most of the above observations were made by an "eyeball" approach. Taking a more systematic, computerized approach would probably yield reams of potentially interesting results.

1) Perhaps the biggest oddity in the data was this: the presence of a normally ho-hum transcript, PLXDC1, in a very large number of up- and down-regulation lists in both hct116 and hek293 results. WTH?

2) Another example: The single best Xaira match to genes upregulated on eif4a1 ko in mouse b-cells (GSE237426) is the Xaira eif4a1 kd in hct116. Another: our database's (155,000 lists) 3rd best match to genes upregulated in mouse cerebellum on eif2b5 mutation (GSE128092) is Xaira's eif2b3 kd in hek293...Xaira's eif2b5 kd in hek293 ranks 25th. Another...the single best Xaira match to a zeb1 ko in mouse osteoclasts (GSE212302) is the Xaira zeb1 kd in hek293. Another...the second best Xaira match to a mouse sin3a ko in cd4+ t-cells (GSE196615) is Xaira's sin3a kd in hct116. Another...the second best Xaira match to a mouse cdyl ko in embryonic gonads (GSE226049) is Xaira's cdyl kd in hek293. (If you find it odd that all the above studies involve mice it's simply because we've been focusing on increasing the proportion of mouse studies in the database). Another: the second best Xaira match to genes upregulated in rael cells on uhrf1 ko (GSE136596) is Xaira's uhrf1 kd in hek293.

3) I'd guess that these results are, in turn, strongly dependent on exactly how long the knock down was conducted prior to freezing the cells. Had the average knock down period been increased by a few hours, allowing recovery of ribosomal genes, or a shift into backup programs, the gene lists could be quite different. In the end, despite the massive funding ($2.00 per cell?) and output behind these studies, they only examine particular cells under particular conditions and timeframes. I'm a bit skeptical of the ability of these monster studies to reveal extraordinary insights into cellular biology on their own, whether via standard statistics or AI approaches (yes, this is a WIMG plug).

4) The best example of a Xaira knockdown that generates a list of genes overloaded with ribosomal and mitochondrial entities involves knockdown of cmtr1. In both hek293 and hct116 lines, cmtr1 kd very significantly downregulates these abundant genes. Remarkably, examining an independent study in which cmtr1 was overexpressed in mefs (GSE200103), the single best Xaira match to this study is...cmtr1 kd in hct116 cells. Cmtr1 kd in hek293 was the third best Xaira match. For reference, there are now 37,310 Xaira lists in the WIMG database.

5) To be a tad more precise...whatever value is being minimized/maximized in these procedures, it seems like it's best done not by placing one or two outlying lists into a separate cluster, but by generating clusters derived from larger numbers of lists. Thus merely increasing the cluster number doesn't automatically highlight rare but interesting gene patterns. Having said that, ChatGpt offers me a list of 8 options to overcome this issue...tinkering with the "resolution parameter" sounds promising.

6) Sure enough, a little googling shows that TMEM131 is involved in ER transport.

7) We've pointed to REST as an interesting gene in previous posts. Here, for example. We've also noted a relevance to Alzheimer's. Yup...of 37,000 Xaira gene lists, the one that best overlaps our list of genes downregulated in Alzheimer's is a list of genes upregulated on REST knockdown in hek293 cells.

8) In WIMG parlance, this is something of a "microcluster"...a result which overlaps with high significance with only one or a few other studies, followed by a dramatic drop-off in significance. We've identified about 950 microclusters scattered throughout the database, which currently contains about 19 billion study/study overlaps. In this particular case, I don't actually make the "microcluster" annotation in the database, since there are non-Xaira studies that overlap with the PRRX1 study quite significantly. But within the context of Xaira-only studies plus the PRRX1 study, the REST knockdown really stands out.

whatismygene.com

WhatIsMyGene

Sunday, August 24, 2025

Still more perturb-seq

Wednesday, August 13, 2025

More Perturb-Seq

Gene Order in Gene Lists

Report Abuse