WhatIsMyGene: Still more perturb-seq

Previously, we alluded to yet another perturb-seq dataset. Here it is: Comprehensive transcription factor perturbations recapitulate fibroblast transcriptional states. This time, the authors used crispr gene activation to examine the effects of over-expression of a near-comprehensive list of transcription factors in rpe1 and hs27 cell lines.

Before some discussion of the above Southard et al dataset, we should point out yet another "largest" perturb-seq dataset that we won't be adding to the database: the Tahoe100M matrix. As with the Xaira dataset, there's some hype regarding the data:

Tahoe-100M is a giga-scale single-cell perturbation atlas consisting of over 100 million transcriptomic profiles from 50 cancer cell lines exposed to 1,100 small-molecule perturbations. Generated using Vevo Therapeutics' Mosaic high-throughput platform, Tahoe-100M enables deep, context-aware exploration of gene function, cellular states, and drug responses at unprecedented scale and resolution. This dataset is designed to power the development of next-generation AI models of cell biology, offering broad applications across systems biology, drug discovery, and precision medicine.

Unlike the Xaira data, there's not a lot of sequencing depth here. As the Xaira paper itself points out, Xaira identified 8.45 times more unique molecular identifiers (UMIs...roughly speaking, we're talking about transcripts) per cell than the Tahoe100 folks did. To exaggerate, the bioinformatician is left trying to utilize a list of ribosomal and mitochondrial counts to infer the effects of 1,100 chemical perturbations on 50 different cell lines. As much as WIMG neurotically loves hoarding data, we'll pass on this one.

Getting back to the Southard paper, we see a respectable 5,000 UMIs per cell. The data is available in a fairly processed, compact form, enabling us to churn out gene lists without a lot of optimization. Given the crispr activation, we'd like to see the targeted gene consistently appear in the list of upregulated genes. Though this can be seen at a frequency far above chance, the majority of our 100-member upregulation lists (90% or so) lack the perturbed TF. We attribute this to the fact that TFs are typically non-abundant entities, falling outside the limits of detection in Southard's setup.

As with the Xaira lists, we can observe the extent to which various Southard lists match up against "WIMG exemplar" lists. If all Southard lists failed to overlap with these lists, or all Southard lists overlapped equally (i.e. they don't cluster) with these lists, we'd question the quality of the data, or our preparation of the data. That's not the case here. As an example, Southard's LHX4, GATA1, MYC, and HIF1A activations all overlap WIMG exemplar data with very significant p-values, without overlapping with each other to any great extent. Below, note how well the HIF1A activation matches up with hypoxia studies:

The "hif1a chip-seq" result (line 14) is quite nice. It's easy to conclude that hif1a is primarily an activating, rather than repressing, transcription factor...the genes that are upregulated when hif1a is overexpressed are also found in a list of hif1a DNA targets.

Here's another nice example of Southard data quality: the single best Southard match to a myc knockout in mouse t-all cells (GSE222937) is myc activation in rpe1 cells.

We thus stamp a "not junk" label on Southard's data and include it in our database.

****************

Previously, we pointed out some issues that arise when we add these massive datasets to our database. In particular, naively combining these sets with the rest of the WIMG database skews co-expression results to an extreme. Thus we must take steps to minimize these effects. To include these perturb-seq sets in your analysis, you'll need to select the "database" option on our website.

For Fisher analysis, we've lumped the three major perturb-seq studies (Repogle, Xaira, and Southard) in our database together. It's possible, however, that you only wish to conduct analysis with one of these studies. Using the "keyword search" box, you could choose to examine only Repogle's work by typing "Repogle". You could also choose to exclude "Repogle". Likewise with the terms "Xaira" and "Southard". Let's say you only want to examine Xaira's hct116 results, not the hek293 results: type "Xaira hct116". Likewise for "Xaira hek293", "Repogle k562", "Repogle rpe1", "Southard hs27", and "Southard rpe1". Be sure to spell correctly. In general, when folks choose to perform Fisher analysis on WIMG, they want to quickly scan a database of diverse studies. The default settings, which exclude perturb-seq studies, optimize that.

For co-expression analysis, we've set things up so that you don't mix databases. Thus, you can choose "Only Xaira hct116" from the database box, but you can't combine Xaira's hct116 results with our standard database. We did include a somewhat dubious "Include Perturb-Seq crispr i/a" option, which combines Xaira, Southard, and Repogle results.

Let us know if our current interface prevents you from performing your desired analysis. In the worst case, we can get our hands dirty and do some coding.

whatismygene.com

WhatIsMyGene

Sunday, August 24, 2025

Still more perturb-seq

No comments:

Post a Comment

Gene Order in Gene Lists

Report Abuse