Previously, we alluded to yet another perturb-seq dataset. Here it is: Comprehensive transcription factor perturbations recapitulate fibroblast transcriptional states. This time, the authors used crispr gene activation to examine the effects of over-expression of a near-comprehensive list of transcription factors in rpe1 and hs27 cell lines.
Before some discussion of the above Southard et al dataset, we should point out yet another "largest" perturb-seq dataset that we won't be adding to the database: the Tahoe100M matrix. As with the Xaira dataset, there's some hype regarding the data:
Tahoe-100M is a giga-scale single-cell perturbation atlas consisting of over 100 million transcriptomic profiles from 50 cancer cell lines exposed to 1,100 small-molecule perturbations. Generated using Vevo Therapeutics' Mosaic high-throughput platform, Tahoe-100M enables deep, context-aware exploration of gene function, cellular states, and drug responses at unprecedented scale and resolution. This dataset is designed to power the development of next-generation AI models of cell biology, offering broad applications across systems biology, drug discovery, and precision medicine.
Unlike the Xaira data, there's not a lot of sequencing depth here. As the Xaira paper itself points out, Xaira identified 8.45 times more unique molecular identifiers (UMIs...roughly speaking, we're talking about transcripts) per cell than the Tahoe100 folks did. To exaggerate, the bioinformatician is left trying to utilize a list of ribosomal and mitochondrial counts to infer the effects of 1,100 chemical perturbations on 50 different cell lines. As much as WIMG neurotically loves hoarding data, we'll pass on this one.
Getting back to the Southard paper, we see a respectable 5,000 UMIs per cell. The data is available in a fairly processed, compact form, enabling us to churn out gene lists without a lot of optimization. Given the crispr activation, we'd like to see the targeted gene consistently appear in the list of upregulated genes. Though this can be seen at a frequency far above chance, the majority of our 100-member upregulation lists (90% or so) lack the perturbed TF. We attribute this to the fact that TFs are typically non-abundant entities, falling outside the limits of detection in Southard's setup.
As with the Xaira lists, we can observe the extent to which various Southard lists match up against "WIMG exemplar" lists. If all Southard lists failed to overlap with these lists, or all Southard lists overlapped equally (i.e. they don't cluster) with these lists, we'd question the quality of the data, or our preparation of the data. That's not the case here. As an example, Southard's LHX4, GATA1, MYC, and HIF1A activations all overlap WIMG exemplar data with very significant p-values, without overlapping with each other to any great extent. Below, note how well the HIF1A activation matches up with hypoxia studies:
The "hif1a chip-seq" result (line 14) is quite nice. It's easy to conclude that hif1a is primarily an activating, rather than repressing, transcription factor...the genes that are upregulated when hif1a is overexpressed are also found in a list of hif1a DNA targets.
****************
Previously, we pointed out some issues that arise when we add these massive datasets to our database. In particular, naively combining these sets with the rest of the WIMG database skews co-expression results to an extreme. Thus we must take steps to minimize these effects. To include these perturb-seq sets in your analysis, you'll need to select the "database" option on our website.
For Fisher analysis, we've lumped the three major perturb-seq studies (Repogle, Xaira, and Southard) in our database together. It's possible, however, that you only wish to conduct analysis with one of these studies. Using the "keyword search" box, you could choose to examine only Repogle's work by typing "Repogle". You could also choose to exclude "Repogle". Likewise with the terms "Xaira" and "Southard". Let's say you only want to examine Xaira's hct116 results, not the hek293 results: type "Xaira hct116". Likewise for "Xaira hek293", "Repogle k562", "Repogle rpe1", "Southard hs27", and "Southard rpe1". Be sure to spell correctly. In general, when folks choose to perform Fisher analysis on WIMG, they want to quickly scan a database of diverse studies. The default settings, which exclude perturb-seq studies, optimize that.
No comments:
Post a Comment