Sunday, August 24, 2025

Still more perturb-seq

Previously, we alluded to yet another perturb-seq dataset. Here it is: Comprehensive transcription factor perturbations recapitulate fibroblast transcriptional states. This time, the authors used crispr gene activation to examine the effects of over-expression of a near-comprehensive list of transcription factors in rpe1 and hs27 cell lines.

Before some discussion of the above Southard et al dataset, we should point out yet another "largest" perturb-seq dataset that we won't be adding to the database: the Tahoe100M matrix. As with the Xaira dataset, there's some hype regarding the data:

Tahoe-100M is a giga-scale single-cell perturbation atlas consisting of over 100 million transcriptomic profiles from 50 cancer cell lines exposed to 1,100 small-molecule perturbations. Generated using Vevo Therapeutics' Mosaic high-throughput platform, Tahoe-100M enables deep, context-aware exploration of gene function, cellular states, and drug responses at unprecedented scale and resolution. This dataset is designed to power the development of next-generation AI models of cell biology, offering broad applications across systems biology, drug discovery, and precision medicine.

Unlike the Xaira data, there's not a lot of sequencing depth here. As the Xaira paper itself points out, Xaira identified 8.45 times more unique molecular identifiers (UMIs...roughly speaking, we're talking about transcripts) per cell than the Tahoe100 folks did. To exaggerate, the bioinformatician is left trying to utilize a list of ribosomal and mitochondrial counts to infer the effects of 1,100 chemical perturbations on 50 different cell lines. As much as WIMG neurotically loves hoarding data, we'll pass on this one.

Getting back to the Southard paper, we see a respectable 5,000 UMIs per cell. The data is available in a fairly processed, compact form, enabling us to churn out gene lists without a lot of optimization. Given the crispr activation, we'd like to see the targeted gene consistently appear in the list of upregulated genes. Though this can be seen at a frequency far above chance, the majority of our 100-member upregulation lists (90% or so) lack the perturbed TF. We attribute this to the fact that TFs are typically non-abundant entities, falling outside the limits of detection in Southard's setup. 

As with the Xaira lists, we can observe the extent to which various Southard lists match up against "WIMG exemplar" lists. If all Southard lists failed to overlap with these lists, or all Southard lists overlapped equally (i.e. they don't cluster) with these lists, we'd question the quality of the data, or our preparation of the data. That's not the case here. As an example, Southard's LHX4, GATA1, MYC, and HIF1A activations all overlap WIMG exemplar data with very significant p-values, without overlapping with each other to any great extent. Below, note how well the HIF1A activation matches up with hypoxia studies:


The "hif1a chip-seq" result (line 14) is quite nice. It's easy to conclude that hif1a is primarily an activating, rather than repressing, transcription factor...the genes that are upregulated when hif1a is overexpressed are also found in a list of hif1a DNA targets.

We thus stamp a "not junk" label on Southard's data and include it in our database.

****************

Previously, we pointed out some issues that arise when we add these massive datasets to our database. In particular, naively combining these sets with the rest of the WIMG database skews co-expression results to an extreme. Thus we must take steps to minimize these effects. To include these perturb-seq sets in your analysis, you'll need to select the "database" option on our website. 


For Fisher analysis, we've lumped the three major perturb-seq studies (Repogle, Xaira, and Southard) in our database together. It's possible, however, that you only wish to conduct analysis with one of these studies. Using the "keyword search" box, you could choose to examine only Repogle's work by typing "Repogle". You could also choose to exclude "Repogle". Likewise with the terms "Xaira" and "Southard". Let's say you only want to examine Xaira's hct116 results, not the hek293 results: type "Xaira hct116". Likewise for "Xaira hek293", "Repogle k562", "Repogle rpe1", "Southard hs27", and "Southard rpe1". Be sure to spell correctly. In general, when folks choose to perform Fisher analysis on WIMG, they want to quickly scan a database of diverse studies. The default settings, which exclude perturb-seq studies, optimize that.

For co-expression analysis, we've set things up so that you don't mix databases. Thus, you can choose "Only Xaira hct116" from the database box, but you can't combine Xaira's hct116 results with our standard database. We did include a somewhat dubious "Include Perturb-Seq crispr i/a" option, which combines Xaira, Southard, and Repogle results.

Let us know if our current interface prevents you from performing your desired analysis. In the worst case, we can get our hands dirty and do some coding.


whatismygene.com 

Wednesday, August 13, 2025

More Perturb-Seq

Xaira, a recently formed billion dollar biotech, has released monster perturb-seq datasets involving crispr-inhibition in hek293 and hct116 cell lines. Thus this data joins the Repogle perturb-seq dataset in our database. For more background on the Repogle set, and on the perturb-seq approach in general, see our relevant post.

In our next post, we will explain how to access the Xaira data on the WIMG website.

Unlike Repogle's data, Xaira's data is not currently available in a form more processed than mere count data. Thus we were faced with the task of dicing up 500 Gb of scRNA count data. To be honest, we've never had reason to process this kind of data for ourselves...we scrounge the processed results from others. We initially attempted to follow standard protocols, where adjustments are made for extremely sparse data and large batch effects. Our initial, naive attempts found that control data could be grouped into two very distinct clusters. One cluster was dominated by high abundance ribosomal and mitochondrial transcripts; the other wasn't. Though batches were clearly labeled in the data, the clusters did not conform to batches (i.e. it cannot be definitively said that batch 100 is overloaded with ribosomal transcripts, and batch 127 isn't), and thus standard single-cell batch-control methods did not alleviate the presence of distinct clustering in controls. After adjusting for our own clustering results, we were disappointed. Another issue: various methods did not seem to dramatically improve the frequency with which the knocked-down gene appeared near the top of the list of downregulated genes1. Without going into the dirty details, we finally settled on a simple procedure...normalize the counts, perform log1p adjustment, grab a random subset of control data, and perform Wilcoxon's test for significance on specifically targeted test samples vs controls. Such an algorithm performed best in drawing targeted genes to the top of their corresponding downregulation lists. Gene lists were sorted according to log(fold-change) divided by significance.

We can cluster the resulting gene lists by first generating a matrix of study/study Fisher p-values. This can be a matrix that matches Xaira lists against Xaira lists. It can also be a matrix that matches Xaira lists against our entire database. Choosing the latter approach, we were again disappointed...both the elbow and silhouette methods identified an optimal cluster number of 2. Ideally, one would like to see tens or hundreds of clusters, each representing special processes in cells. As with the control data alone, one cluster was dominated by high abundance genes.

If Xaira, or some other entity, can provide better processed data, we'll happily snatch it up and overwrite our own.

There are signs, however, that the Xaira data, excreted by our crude procedure, contains worthwhile biological information. We note, for example, that Xaira knockdowns do align with the same knockdowns/outs from other studies at a frequency that is certainly not random. As just one example, genes downregulated in both of Xaira's NRF1 knockdowns strongly align with a study in which NRF1 was knocked out in the mouse retina (GSE150258); the Xaira hct116 list was the third best match out of 146,950 lists and the hek293 data was the ninth best match2. Also, while the numerous lists in which ribosomal/mitochondrial genes seemed most strongly perturbed are bothersome, there may be an element of biological reality here: grouping all the genes whose knockdown apparently strongly perturbs ribosomal transcripts, we find very strong (p<10-20) representation by genes involved in ribosomal RNA processing. These moderate- to low-abundance genes are precisely the genes whose knockdown would be expected to decrease ribosomal RNA levels3,4. Another positive sign: genes targeted by sgrna were found in the corresponding 100 member downregulation lists around 50% of the time. Given that roughly 20,000 genes were identified at non-zero levels, one would expect to see the targeted gene appear in the 100 member downregulation list about 0.5% of the time if the lists were composed of random garbage.

Assuming the sequencing of a suitable number of cells (say, 1000), any scRNA-seq paper is expected to show results of at least one clustering procedure. The optimal number of clusters, arrived at by any number of methods, can be disappointing, as above. I'm not in a position to critique the underlying math of clustering methods, but I can say that these procedures often seem to ignore rare gene patterns in favor of forcing all gene patterns into a fixed number of sets5. Examining Xaira data against 74 "WIMG exemplar" lists which constitute largely non-overlapping gene patterns (as measured by Fisher's exact test: see our preprint), we find Xaira gene lists that strongly match 30 of these patterns. For example, Xaira's TMEM131 kd in hct116 cells matches quite nicely (p<10-38) with genes found in hek293 ER fraction vs cytosol (GSE215768)6. Genes upregulated on Xaira INTS8 kd in hct116 cells match up very nicely with genes upregulated in hcclm3 cells on BRD4 inhibition (GSE181406). Patterns generated by knockdown of genes such as ZC3H13, DDX27, SRSF1, ZWILCH, NAA25, CMTR1, ELOB, TRMT2A, and many more, match up with high significance against our (again, non-overlapping) exemplar lists. 

One of the more interesting and impressive results involved genes upregulated in Xaira's REST knockdown in hek293 cells, which overlapped with great significance with a study in which PRRX1 was overexpressed (p=10-36: GSE180515)7; the next closest Xaira match to this result involves knockdown of CDYL and a p-value of a mere 10-7.4 8 .  Another notable result: genes upregulated in both hek293 and hct116 lines on GRPEL1 kd overlapped strongly with a study in which IGF2BP1 was knocked out (GSE115646). And, to jump the gun a bit (our next post): genes downregulated in Xaira's PPARGC1B kd in hek293 cells overlap strongly with genes upregulated in Southard's perturb-seq PPARGC1A crispr activation: both results overlap a study in which ANLN was knocked-out in mda-mb-231 cells (GSE131120). 

Most of the above observations were made by an "eyeball" approach. Taking a more systematic, computerized approach would probably yield reams of potentially interesting results.

1) Perhaps the biggest oddity in the data was this: the presence of a normally ho-hum transcript, PLXDC1, in a very large number of up- and down-regulation lists in both hct116 and hek293 results. WTH?

2) Another example: The single best Xaira match to genes upregulated on eif4a1 ko in mouse b-cells (GSE237426) is the Xaira eif4a1 kd in hct116. Another: our database's (155,000 lists) 3rd best match to genes upregulated in mouse cerebellum on eif2b5 mutation (GSE128092) is Xaira's eif2b3 kd in hek293...Xaira's eif2b5 kd in hek293 ranks 25th. Another...the single best Xaira match to a zeb1 ko in mouse osteoclasts (GSE212302) is the Xaira zeb1 kd in hek293. 

3) I'd guess that these results are, in turn, strongly dependent on exactly how long the knock down was conducted prior to freezing the cells. Had the average knock down period been increased by a few hours, allowing recovery of ribosomal genes, or a shift into backup programs, the gene lists could be quite different. In the end, despite the massive funding ($2.00 per cell?) and output behind these studies, they only examine particular cells under particular conditions and timeframes. I'm a bit skeptical of the ability of these monster studies to reveal extraordinary insights into cellular biology on their own, whether via standard statistics or AI approaches (yes, this is a WIMG plug).

4) The best example of a Xaira knockdown that generates a list of genes overloaded with ribosomal and mitochondrial entities involves knockdown of cmtr1. In both hek293 and hct116 lines, cmtr1 kd very significantly downregulates these abundant genes. Remarkably, examining an independent study in which cmtr1 was overexpressed in mefs (GSE200103), the single best Xaira match to this study is...cmtr1 kd in hct116 cells. Cmtr1 kd in hek293 was the third best Xaira match. For reference, there are now 37,310 Xaira lists in the WIMG database.

5) To be a tad more precise...whatever value is being minimized/maximized in these procedures, it seems like it's best done not by placing one or two outlying lists into a separate cluster, but by generating clusters derived from larger numbers of lists. Thus merely increasing the cluster number doesn't automatically highlight rare but interesting gene patterns. Having said that, ChatGpt offers me a list of 8 options to overcome this issue...tinkering with the "resolution parameter" sounds promising.

6) Sure enough, a little googling shows that TMEM131 is involved in ER transport.

7) We've pointed to REST as an interesting gene in previous posts. Here, for example. We've also noted a relevance to Alzheimer's. Yup...of 37,000 Xaira gene lists, the one that best overlaps our list of genes downregulated in Alzheimer's is a list of genes upregulated on REST knockdown in hek293 cells.

8) In WIMG parlance, this is something of a "microcluster"...a result which overlaps with high significance with only one or a few other studies, followed by a dramatic drop-off in significance. We've identified about 950 microclusters scattered throughout the database, which currently contains about 19 billion study/study overlaps. In this particular case, I don't actually make the "microcluster" annotation in the database, since there are non-Xaira studies that overlap with the PRRX1 study quite significantly. But within the context of Xaira-only studies plus the PRRX1 study, the REST knockdown really stands out.

whatismygene.com 

Friday, April 25, 2025

Stuff that might be true

I'll add to the below list as thoughts pop into my brain....

*"Celebrity" genes are over-rated. Last I looked there are something like 10,000 papers primarily devoted to tp53. Every now and then I stumble across a knockout that has not, to my knowledge, been performed before. One might think that such knockouts would be less likely to generating a long list of significantly perturbed transcripts than, say, a tp53 knockout. I just entered a study involving a CHSY3 knockout into the database. That's the first instance of a CHSY3 perturbation in the database, and the knockout had a major effect on transcript abundances in the underlying study. This sort of thing happens again and again...it's not as if the list of genes whose knockout strongly alters cell activity was exhausted a decade ago.

*Our understanding of biology is strongly biased according to the order of discovery. As an example, it seems that folks have a fairly fixed idea of what micro-RNAs do, if they do anything at all. When perusing micro-RNA overexpression and inhibition studies, the studies in which large numbers of transcripts are significantly altered (versus few or none) usually seem to involve the symbol "mir" followed by a small number, not a large number (e.g. mir1 vs mir1234). This may seem odd until you consider that, in the early days of miRNA research, new mirnas would simply be given the first integer that had not already been taken. In other words, the early mirnas, which were discovered because they actually did something in cells, may have created the illusion that there may be thousands of interesting mirnas, all of which act according to the principles associated with the earliest mirnas.

*Also, regarding miRNAs: it's possible that, at "ground truth" level, the typical miRNA only targets one or a few transcripts (1). This is based on an observation I haven't quantified: that miRNA overexpression and inhibition studies, in contrast to these experiments conducted on ordinary transcripts, often seem to strongly alter the expression of one or a handful of transcripts, followed by a clear drop-off in significance and/or fold-change.

*Similarly, it's possible that in a typical lab-generated list of significantly altered genes, relatively few matter. That is, a large portion of these transcripts or proteins are basically junk, possibly generated to maintain the proper concentration of RNA and/or protein. I base this on admittedly flimsy evidence (4): that if you take a large database of perturbations (e.g. WhatIsMyGene's), generate all study/study overlap P-values, and cluster the data, you might be surprised at how few clusters you generate (using standard methods to determine optimal cluster numbers...e.g. the "elbow" method). To put it overly-dramatically, one study looks like the next. How can sophisticated biological decisions be made in that case? Because a lot of the sameness between studies is not interesting, while a handful of truly interesting genes make a big difference.

*Some genes may reach celebrity status because they are exceptions to a rule: that one type of study (e.g. knockouts) usually overlaps weakly, if at all, with other types (e.g. chip-seq). P53, for example, breaks the rule: the genes altered in P53 knockouts often do overlap with P53 chip-seq results. Of course, P53 also has the property that it is often mutated in cancer, so another possible "truth" is that genes that break the rule are precisely the ones that get targeted by viruses or cancers.

*Modern biology is hugely skewed by the results of experiments in which cells are essentially blasted with extreme conditions that are rarely, if ever, experienced in normal living creatures...complete knockouts, targeted alterations of single genes, micro-RNA levels 100X greater than anything ever experienced in nature, etc. These conditions are very often measured over extremely short time scales, largely due to the limitations of cell culture approaches. It is possible that these approaches have skewed biology in massive fashion. Again, consider Alzheimer's, a disease whose seeds may be planted decades before symptoms become obvious...I've yet to see any sort of experiment, either with mice or cell culture, that parallels the set of genes that are typically upregulated in Alzheimer's (2). Perhaps this is simply because of the near-impossibility of conducting experiments over the course of decades.

*OK, maybe more of a critique than a "truth": let's say you do a chip-seq experiment using transcription factor XYZ. You collect the list of most-strongly bound genes and test your results against a parallel XYZ knockout experiment. Let's assume your background figures are good. You perform Fisher's test and get -log(P) = 4. Should you be impressed with this statistic? Should you perform further experiments based on this very significant number? I say no. If you had tested your chip-seq list against 150,000 other lists, you may have found 5,000 lists that out-performed your knockout study. In fact, after correction for multiple testing, your knockout results would be rendered insignificant. Yes, I'm plugging WhatIsMyGene. 

*Maybe the "replication crisis" is a bit exaggerated. Cells and organisms are simply very sensitive to seemingly minor differences in experimental settings. Perhaps stochastic effects are much more powerful than we generally believe (3). Let's say you knockout gene ABC in mouse kidneys in your lab. Somebody else does the same thing in their lab. The results overlap weakly. Did somebody screw up? Maybe not. Using a tool like WhatIsMyGene, you may find that both studies nevertheless overlap rather nicely with a third study involving gene XYZ.  (Also, recall the above point that maybe only a few transcripts really "matter" in list of significantly altered genes, with the rest being relatively unimportant).

*Alzheimer's has some relationship to stem cells. I say this because time and time again the genes downregulated in Alzheimer's overlap strongly with studies involving stem cells and embryonic cells. The problem is...brains don't have a lot of stem cells, especially if you exclude the SVZ. I don't know how to work around this issue...perhaps some brain cells, neurons in particular, have a stem-like signature but lack the standard markers for stem-cells.

*A bit more speculatively, Alzheimer's may also have some connection to the appendix and appendicitis. In addition to papers suggesting a link (google it), I'd also point out an odd overlap between our own list of transcripts typically up-regulated in the Alzheimer's brain and a list of transcripts up-regulated in the mouse distal colon following appendicitis (GSE23914). The significance is not impressive...but it's difficult to find any studies that overlap strongly with those up-regulated in the human Alzheimer's brain. The study ranks as the 332nd best match to the Alzheimer's list (against 145,000 other studies), competing against studies primarily derived from human cells and brain cells.

*Some in-vivo studies may produce distorted results because of the time of day at which they cut open their subjects. I'm looking at a 12 week study where mice were treated with control vs drug. Out of 145,000 studies examined, the best match (at p=10^-45) would be one that examined mouse livers at zt21 vs zt12. It turns out that the drug in question, minocycline, actually does recalibrate circadian rhythms. But...how often is the possibility of circadian effects totally ignored?

*Methylation of DNA may do more than repress or activate transcription. It may also regulate consistency/variability in expression. I have admittedly shoddy evidence for this notion: a list of genes that do not commonly correlate with batch effects is replete with genes that are often seen in DNA methylation experiments (see Quantifying batch effects for individual genes in single-cell data) .


1. Just as an example, a study in which mir138 was inhibited (GSE173982) results in very significant downregulation of a single transcript, NDUFA9. Another one...mir-222-3p treatment results in the very significant downregulation of a single transcript (Gm10925) in GSE167753. Another one: mir144-3p inhibition in stress susceptible mice results in significant downregulation of a single gene (kcnj8) in GSE209673. Also, in GSE211749, only one transcript is downregulated with strong significance (zgrf1) on a triple miR-322-503-351 ko in white adipose. Also, in GSE216981, mir150 knockout downregulates tnfrsf26 at a significance of 10^-204, while the next most significant alteration comes in at 10^-20.

2. Downregulated genes in Alzheimer's are a different case...these are seen in many kinds of perturbation and clustering experiments involving brain tissue.

3. Another issue is this: if the only way you can replicate a study is via extreme rigor, how generalizable/interesting are your conclusions about your gene of interest? 

4. Here's some more evidence: if you take two mouse strains and compare transcriptomics from a particular organ, you'll get a long list of differentially regulated genes...it wouldn't be surprising to see more than 50% of transcripts significantly altered. Thus, you have very different transcriptomics, but a very similar product...a mouse. One could surmise, therefore, that most of these transcripts aren't doing anything.

whatismygene.com 

Thursday, September 12, 2024

T-cell Exhaustion

"T-Cell Exhaustion" is associated with an inability of the immune system to fight off cancer and other diseases. We grabbed 7 markers of exhausted t-cells (pd-1, ctla4, tigit, lag3, tim3, cd244 and cd160) and searched our database for studies in which these markers were strongly perturbed. In only one of 91,000 gene lists were all 7 of these markers perturbed: Hematopoietic Progenitor Kinase1 (HPK1) Mediates T Cell Dysfunction and Is a Druggable Target for T Cell-Based Immunotherapies, wherein knockout of map4k1 downregulated all of these markers.

Grabbing all gene lists in which at least three of the markers were perturbed gave us 307 lists. Retaining the markers, we generated a frequency table of genes most commonly found in these lists. The markers lag3, pd-1, and tim3 topped the list. The fourth most frequent gene in our list was not one of the 7 markers: gzmb. After ctla4 and tigit we have ccl5, cst7, ccl4, gzma, and ccl3. Cd244 and cd160 occupied the 21st and 27th positions on the list. Our final list of genes associated with t-cell exhaustion contains 188 genes, with all genes required to be found at least 60 times over the 307 lists.

Presumably, we'd like to downregulate these genes aggressively in cancer, allowing the immune system and immunotherapies to go to work. Sticking with known drug/treatment regimens (as opposed to, say, knockouts which may be difficult to implement for the time being) in lymphocytes, the single best treatment would be the presence (versus absence) of zinc in mouse drinking water: Interleukin-10 induces interferon-γ-dependent emergency myelopoiesis. Next is a dca (16-didehydro-cortistatin A) regimen: The Cyclin-Dependent Kinase 8 (CDK8) Inhibitor DCA Promotes a Tolerogenic Chemical Immunophenotype in CD4+ T Cells via a Novel CDK8-GATA3-FOXP3 Pathway. This is followed by mouse studies involving leukocyte costimulatory blockade antibody treatment, Short-term Immunosuppression Promotes Engraftment of Embryonic and Induced Pluripotent Stem Cells, and NAC treatment, Impaired mitochondrial oxidative phosphorylation limits the self-renewal of T cells exposed to persistent antigen. A mouse study involving ricolinostat, an hdac6 inhibitor, follows, but we note that this drug also upregulated a significant number of genes in our t-cell exhaustion list. Such is biology.

The first human study wherein a treatment downregulates genes in the t-cell exhaustion list is this: TNFR2 Costimulation Differentially Impacts Regulatory and Conventional CD4+ T-Cell Metabolism. The study involves application of a tnfr2 agonist antibody to cd4 t-cells. The next human study involves treatment with a cd45 fragment: The soluble cytoplasmic tail of CD45 (ct‐CD45) in human plasma contributes to keep T cells in a quiescent state.

Ignoring solutions that might be relatively practical in 2024, we see a study in which a foxp3 k18r mutation results in exhaustion gene downregulation (Foxp3 Reprograms T Cell Metabolism to Function in Low-Glucose, High-Lactate Environments), followed by the aforementioned map4k1 ko, batf3 oe, tbx21 ko, tak1 ko, tfam ko, regnase-1 ko, rbx1 ko, and en2 ko.

In terms of disease-related studies, we see these exhaustion genes downregulated in responding vs non-responding leukemia patients in Reversal of in situ T-cell exhaustion during effective human antileukemia responses to donor lymphocyte infusion. This is not surprising, but it's nice to see validation of the standard dogma regarding t-cell exhaustion. Then again, the next disease study on the list might surprise: In Single-cell landscape of the ecosystem in early-relapse hepatocellular carcinoma, t-cells associated with relapse tended to be depleted of exhaustion genes. Upregulated exhaustion genes were not only seen in cancers: see lymphocytic genes in Metallothioneins as dynamic markers for brain disease in lysosomal disorders and  Hypomethylation and Overexpression of Th17-Associated Genes is a Hallmark of Intestinal CD4+ Lymphocytes in Crohn's Disease. HIV progression vs control is associated with upregulation of exhaustion genes in Transcriptional analysis of HIV-specific CD8+ T cells shows that PD-1 inhibits T cell function by upregulating BATF. In DUSP4-mediated accelerated T-cell senescence in idiopathic CD4 lymphopenia, mouse t-regs show an upregulated exhaustion signature in the diseased state.

Unfortunately, there aren't any "DIY" sorts of treatments that downregulate exhaustion genes with high significance (we set P = 10^-15 as a cutoff). Zinc supplementation is interesting, but we wish the study were conducted in humans. We will upload the exhaustion list to our database in the next week or two and post the database ID just below when we do*. Then you can search for all treatments, diseases, knockouts, etc. that up- or down-regulate the exhaustion signature. It is possible that strong alteration of the exhaustion signature could be accomplished with a cocktail of treatments, each without astounding efficacy alone; to test such hypothesise, be sure to check out our "Third Set" tool to examine this possibility.

*The dbase ID is 188419856 .




whatismygene.com 

Monday, August 12, 2024

Reversing Disease Signatures

Here, we discuss the use of WIMG tools to search for drugs or treatments or gene perturbations that may reverse various  disease signatures. Perhaps I'm jumping the gun a bit here...it would first be nice to show that reversing a disease signature can actually reverse a disease. I may provide concrete examples that both confirm and contradict the possibility in the future. Based on the experience of scouring tens of thousands of studies, however, it is fairly obvious that reversing a disease signature can often, if not always, effectively treat a disease. When examining cancer signatures, for example, MEK inhibitors, commonly used in cancer treatment, often do a fine job of downregulating transcripts that are upregulated in cancer, and upregulating transcripts that are downregulated in cancer. We will ignore complicating factors such as resistance. We also assume that readers are educated/experienced enough to understand that most treatments involve tradeoffs...self-experimentation is not recommended.

We've accumulated a number of gene lists involving "canonical" disease signatures. They are listed at the bottom of this page. Additional details, such as the number of studies examined in accumulating the data, are omitted for simplicity. If your disease of interest is found in the list below, you can perform several actions to search for studies in which the signature is reversed. For example, you could open up the "Fisher" app, enter the DBASE ID for "WIMG up-regulated in bald skin" in the "Enter identifers or database ID" box, and simply hit "submit." If you are only interested in reversing the signature, select "downregulated" in the "Regulation" box. Then again, if you're interested in searching for factors that could encourage balding, you could choose "upregulated." You can also enter both portions of a study (upregulated and downregulated) into the "Match Studies" tool; to reverse the two be sure to select "Inverse Correlations." It is advisable to try both apps, if possible: "Match Studies" will give you individual studies ranked according their potencies in reversing both the up- and down-regulated portions of a study. It's possible that the most potent treatment for disease reversal would involve separately altering the up- and downregulated portions (i.e. two drugs), in which case you'd want to stick with the Fisher app.

If you're only interested in drug-based treatments, you can choose "drug" in the "Experiment" box. Choose "treatment" for non-small-molecule approaches (antibody-based therapy, etc). "Environment/behavior" might also be worth examining. 

One nice, very unique WIMG option is the "natural" option in the "Cell Type" box*. Choose it, and you will only receive "do-it-yourself" types of treatments as output...fitness programs, diets, vitamins, Chinese medicine, and stuff you might find in a "health-food" store. Again, I will assume readers are mature enough to be cautious here.

It is possible that the upregulated (or downregulated) portion of a disease signature would best be reversed by two or more treatments. Here, you might consider using the "Third Set" tool. Enter, say, the upregulated portion of disease transcripts and the downregulated portion of a signature involving a drug that you know to be effective. It's important that "Set1" be the upregulated disease signature. The tool will spit out a list of studies that intersect with the disease signature, but not the known drug signature.Again, you will probably wish to select "downregulated" in the "Regulation" box, and something like "drug" in the "Experiment" box. If you're insane and wish to find three non-overlapping drug treatments, you'll need to know all the transcripts that are considered to be downregulated in the two drug studies above...WIMG doesn't provide you with this, so you could contact us or dig up that data yourself in the studies of interest. Find the union of those two sets and discard one copy of any genes that appear twice. Use this new "dual" drug signature in the "Match Studies" tool, along with the upregulated disease signature.

Looking below, you will see that we have a fairly limited selection of canonical disease signatures to choose from. That's because we usually create these lists when there's a substantial selection of studies from which we can draw repeatedly perturbed genes. If you wish to reverse a disease signature that doesn't have a "WIMG list", you can create one yourself using whatever studies you can find. In the case of a rare disease, there may be only one study that is relevant. It's possible that no studies exist for a disease of interest, in which case you would have to find a signature for a similar disease. You could ask us to try to dig up the studies...don't worry, we're neurotic about hoarding and analyzing data.


DBASE ID STUDY

118765101 WIMG canonical up in cancer vs. adjacent 

118766101 WIMG canonical down in cancer vs. adjacent 

118767101 WIMG canonical up-regulated in metastasis vs. primary 

118768101 WIMG canonical down-regulated in metastasis vs. primary 

118771101 WIMG new canonical cytokine storm up 

118772101 WIMG new canonical cytokine storm down 

123049121 WIMG canonical up in human Alzheimer's brain 

123050121 WIMG canonical down in human Alzheimer's brain 

123069121 WIMG canonically upregulated in blood of Alzheimer's patients 

123070121 WIMG canonically downregulated in blood of Alzheimer's patients 

124415121 WIMG canonically up in Parkinson's brain 

124416121 WIMG canonically down in Parkinson's brain 

124416122 WIMG canonically down in alcoholic brain 

124416131 WIMG canonically up in alcoholic brain 

124417121 WIMG canonically up in schizophrenia brain 

124418121 WIMG canonically down in schizophrenia brain 

124419121 WIMG canonically up in depression/bipolar brain 

124419122 WIMG canonically down in depression/bipolar brain 

124420121 WIMG canonically up in autism brain 

124421121 WIMG canonically down in autism brain 

125583121 WIMG canonical up-regulated in aging brain 

125584121 WIMG canonical down-regulated in aging brain 

137716203 WIMG canonically up-regulated in lung squamous cell carcinoma vs lung adenocarcinoma 

137717203 WIMG canonically down-regulated in lung squamous cell carcinoma vs lung adenocarcinoma 

141048203 WIMG canonically up-regulated in lung cancer 

141049203 WIMG canonically down-regulated in lung cancer 

141259203 WIMG up-regulated in liver cancer vs adjacent 

141260203 WIMG down-regulated in liver cancer vs adjacent 

142124203 WIMG canonically up-regulated in colorectal cancer vs adjacent/normal 

142125203 WIMG canonically down-regulated in colorectal cancer vs adjacent/normal 

142928203 WIMG up-regulated in cervical cancer 

142929203 WIMG down-regulated in cervical cancer 

143176203 WIMG transcripts rarely perturbed in human cancer 

143177203 WIMG transcripts most commonly perturbed in human cancer 

143178203 WIMG transcripts most rarely up-regulated in human cancer 

143179203 WIMG transcripts most commonly up-regulated in human cancer 

143180203 WIMG transcripts most rarely down-regulated in human cancer 

143181203 WIMG transcripts most commonly down-regulated in human cancer 

146502203 WIMG genes that are rarely down-regulated in cancer vs adjacent studies 

146503204 WIMG genes that are rarely up-regulated in cancer vs adjacent studies 

146503205 WIMG genes that are never down-regulated in our cancer vs adjacent studies 

146504206 WIMG genes that are never up-regulated in our cancer vs adjacent studies 

160517531 WIMG up-regulated in aging 

160518531 WIMG down-regulated in aging 

160519531 WIMG up-regulated in HUMAN aging 

160520531 WIMG down-regulated in HUMAN aging 

164739532 WIMG up-regulated in bald skin 

164740532 WIMG down-regulated in bald skin 

165959532 WIMG up-regulated on cancer recurrence 

165960532 WIMG down-regulated on cancer recurrence 

165961532 WIMG up-regulated in high vs low-grade cancer 

165962532 WIMG down-regulated in high vs low-grade cancer 

176813532 WIMG up-regulated in blood of systemic sclerosis patients 

176814532 WIMG down-regulated in blood of systemic sclerosis patients 

180119532 WIMG up-regulated in inflammatory disease 

180120532 WIMG down-regulated in inflammatory disease 


*Why is the "natural" option found in the "Cell Type" box? It's unintuitive, but it was easy to program. We can fix that in the future.

whatismygene.com 

Thursday, August 8, 2024

Genes beginning with "LOC"

Our database contains more than 27,000 genes that begin with the "LOC" designation (meaning "locus"). In total, they make 390,000 appearances in the database. Most of these genes are poorly characterized; one indication of that is the fact that all but about 2000 are lacking ENSG identifiers. Nevertheless, a couple of these LOCs appear more than 1,000 times in the database, and 869 appear at least 100 times. Most, but not all of these, are non-coding. 

Before proceeding, we should note that "poorly characterized" can also mean "unsure about their existence as separate species." LOC102724852, noted below, is associated with chromosome 11, but apparently hasn't been pinpointed to a location. It is also co-expressed with other chromosome 11 genes, which is a bit odd.

Using the "Cell Type" app on our website, let's plug in some of the most common LOCs in our database and get a feeling for what they do.

LOC102724852: Appears 1013 times in the database. Found significantly more often in female tissue than male, despite being found on chromosome 11. Perhaps amazingly, 15 studies in our database list this gene as the top ranking perturbation. Using the co-expression tool, we also see that it is very commonly found in association with H19 (H19 Imprinted Maternally Expressed Transcript) and, to a lesser extent, mir675, both of which are also found on chromosome 11. Hmmmm.

LOC112268238: Appears 920 times in the database. Again, more common in female studies. Significantly associated with results involving bromodomain targeting (drugs, knockout, etc). Also associated with degron experiments, which seems odd until you realize that degron experiments often target bromodomain proteins. Co-expressed genes are hugely overrepresented by histones. BRD2, a bromodomain gene, is also strongly associated.

LOC112268430: Appears 897 times in the database, but isn't strongly associated with any of our key words.

LOC107986126: 830 times. Slight association with leukemia.

LOC112268313: 810 times. Again, associated with degron experiments.

LOC100044068: 786 times. A mouse gene, oddly associated with knockout experiments (log(p) = -22). Also associated with the brain, particularly the hippocampus.

LOC105374985: 735 times. Associated with prostate studies.

LOC100419583: 717 times. Strongly associated with innate immune response keywords (ifn, cytokine, virus, infection, etc).

LOC112267876: 689 times. Associated with stem cell studies.

LOC107984316: 685 times. No strong associations.

LOC112268267: 668 times. Associated with studies involving ifn-gamma.

LOC101928841: 661 times. Associated with studies involving the HELA cell line.

LOC101929185: 660 times. Strongly associated with HEK293 studies.

LOC112268284: 658 times. Associated with studies involving fungi (i.e. fungal infections).

LOC112268155: 654 times. Commonly found in MCF7 studies (female breast line), but also LNCAP (prostate).

LOC107986762: 643 times. Weak association with macrophage studies.

LOC105369370: 643 times. Associated with carcinoma.

LOC107987206: 636 times. No significant associations.

LOC105378936: 597 times. Slight association with leukemia.

LOC112268109: 535 times. Associated with studies involving huvecs.

LOC112268426: 534 times. Associated with endothelial cells.

LOC112268447: 504 times. Strong association with fibroblast studies.

Just for the fun of it, we lumped together the top 250 of these LOCs and ran the list through our Fisher app. Somehow, it seems that a large number of these LOCs found themselves in a list of genes upregulated in "glioblastoma tissue after g207 innoculation" (unadjusted log(p) = -16). They are also "downregulated in high-grade T1 micropapillary bladder cancer w/micropapillarity = 1 vs 0", and "upregulated in caco2 line on 12h vs 7h SARS-CoV-2 infection." The most common keyword associated with the list was "line", meaning the LOCs are overrepresented in cell line experiments (particularly HEK293) vs in vivo studies. The bias toward female studies is also retained. However, this bias may relate to a disproportionate number of female cell lines, as the bias disappears when cell lines are eliminated from consideration. In fact, when only in vivo tissue is examined, the association with the keyword "disease" is surprisingly significant (log(p) = -44). Other keywords of interest include "resistance" (as in drug resistance) and "virus".



whatismygene.com 

Saturday, June 1, 2024

The Best Gene Names

NIPSNAP3B

COBL (cordon-bleu WH2 repeat protein)

ASS1

Any RING protein (Really Interesting New Gene)

MINDY (I'm not Dead Yet)

SMAD (Mothers Against Decapentaplegic Homolog)

PYGO (pygopus)

MTHFR (Methylenetetrahydrofolate Reductase, but it reminds me of a nasty word)

SVIL (supervillin)


Bonus: Best drug name

Eltrombopag

whatismygene.com 

Still more perturb-seq

Previously, we alluded to yet another perturb-seq dataset. Here it is:  Comprehensive transcription factor perturbations recapitulate fibrob...