WhatIsMyGene

Monday, October 6, 2025

Gene Order in Gene Lists

Whenever possible, WIMG gene lists are sorted. Typically, we divide log(fold-change) by significance and sort from largest to smallest values. If genes are not significantly altered, but nevertheless are associated with fold-changes, we sort by fold-change alone. In cases where more than 33% of all genes are significantly altered, we may choose to create a list via the above "fc/p" method (fold change divided by probability), but also create a second list in which we first eliminate all genes that are not significantly altered (i.e. P>.05) and then sort according to fold-change. Such lists are marked with "p&fc" in their descriptions.

Even GO lists are sorted in our scheme. Here, genes that are most commonly perturbed are found at the beginning of GO lists, while housekeeping genes tend to be found at the end.

It seems reasonable that gene order in these sorted lists should observe some repeated patterns. In, say, a cell cycle study, we might see gene ABC followed by DEF, followed by GHI (etc.), while the reverse order might be relatively rare. It's possible to imagine two studies that intersect strongly at the level of genes, but whose genes do not follow a similar order. Conversely, the DEGs in two studies may overlap fairly weakly, but the few genes that are found in the intersection follow precisely the same order.

The significance of the intersection of two lists and the significance of the similarity of order within the intersection are independent. With this in mind, we added a new feature to our "Fisher" app:

The default choice is "No"...you don't want to examine gene order. If you select "Yes", the two significances are combined, possibly lowering or increasing the ranks of particular studies in the output list. If you select "Gene Order Only", Fisher's exact test is not applied to your data, but Spearman's test for rank significance is utilized to see if the intersecting genes are found in similar order in both studies. In the odd situation that you'd like to examine cases in which gene order is reversed (one study has ABC DEF GHI and the other has GHI DEF ABC, in order), you could select "Show non-intersecting studies" in the black bar. This causes our terminology to be a bit confusing..."Gene Order Only" doesn't invoke Fisher's exact test at all, and if you select "Gene Order Only", "Show non-intersecting studies" no longer has anything to do with intersections. Never mind. Another nuance that should be pointed out is that the "intersecting genes" column simply shows up to 25 genes that are found in both studies (your input and studies from the database), but doesn't sort the genes according to their contribution to gene order.

Our Spearman's test algorithm will not output unadjusted p-values smaller than 10^-16.

***************

Having set up the code for Spearman's test, we can make some inquiries of our database as a whole. One simple question: is there any evidence at all for repeated gene order in gene lists? Absolutely! Restricting ourselves to human rna-seq studies involving perturbations and allowing no more than 400 genes in a gene list, we find several studies whose gene order matches the order found in over 300 other studies at P<=10^-16. The champion is The RNA binding protein RALY suppresses p53 activity and promotes lung tumorigenesis, wherein genes downregulated upon raly knockdown are found in similar order in intersections with 362 other perturbation studies. We found 862 studies that matched the order of at least 10 other studies at this significance (a total of about 70,000 study/study intersections).

What about reverse gene order when comparing study A to study B? It's relatively rare to find cases like this (at P<=10^-16), but they exist. We won't focus on them today.

Do we see cases in which the P-value associated with the intersection of two sets is uninspiring, yet the P-value associated with gene order is very significant? It's a tad unusual, but yes. As an example, the intersection between two studies we've labeled downregulated in dendritic cells from atopic dermatitis patients on R. mucosa vs s. aureus treatment and downregulated in nscs from 22m vs 6m mice is entirely insignificant, yet the gene order of genes found in the intersection is similar at P<10^-16 (1). How about the case where the P associated with the intersection is very significant but the P associated with order is not? This is fairly common. Most typically, however, two studies that match strongly in terms of gene order also match strongly in terms of intersecting genes...this is a bit of a no-brainer, as you can't derive a significant gene-order P if there's little or no intersection between the two sets.

What sorts of studies tend to be associated with gene order? To ask the question, we crossed the 862 studies with each other, generating 370,660 P-values associated with study/study intersections. We can then perform clustering on the resulting P-value matrix. The resulting 7 clusters were fairly clear-cut.

Cluster 1, represented by 248 studies, obviously involves the innate immune response. Gathering together the genes most commonly perturbed in these studies, IFIT3 is the top gene. Examining keywords associated with these studies, "ifn" is over-represented at log(P)= -92. Terms like "cytokine", "infection", and "virus" follow. In the other 6 clusters, the keyword "line" (as in cell line) is quite significant, but not here. A typical gene order looks like this: IFITM1 RSAD2 IFIT1 OASL ISG15 IFIT3 HERC5 IFI44 IFI35 RIPK1 RCAN1 NAPSB SIPA1L1.

Cluster 2 is represented by 472 studies. The gene CCNA2 was found in 465 of them, strongly suggesting that we're talking about the cell cycle. A typical gene order is: MKI67 RRM2 KIF20A ASPM TK1 GTSE1 NUSAP1 KIF23 ZNF367 TCF19 TRIP13 CKS2.

Cluster 3 contains 83 studies, with TRIB3 being found in 78 of them. The keywords are interesting: drug, natural, metabolite, depletion, and more. In other words, the individual studies composing the cluster are over-represented by drug studies, "natural" treatments (diets, fitness regimes, health foods, etc.), metabolite perturbations, and depletion of various nutrients and metabolites. Gene order looks like this: NIBAN1 TRIB3 DDIT3 GDF15 MTHFD2 HYOU1 NADK2 SKIL AZIN1 ZXDB.

Cluster 4 contains 12 studies, with several genes found 11 times: RPL30 RPS23 RPL14 NACA. "ripseq" and "cell part" (meaning studies in which one organelle or the like is examined against another) are prominent keywords. RPL39 RPL35A RPS17 RPL22 EIF3L NCOA3 PLP2 is a typical gene order.

Cluster 5 contains 14 studies...there are quite a few genes found in all of them. Keywords "drug" and "hypoxia" are prominent. The gene order looks like this: HMGCS1 MSMO1 HMGCR ACSL1 VAT1 MRNIP TSC22D3 MVP MT-TC.

Cluster 6 contains a mere 6 studies, with 65 genes being found in all of them. All of the studies are knockdowns, and the keyword "kd" is indeed the top keyword (log(P)<-55). There's also an association with lncrnas and circrnas. TPM4 GANAB CASP7 BICD2 TBC1D10B ZW10 ZSWIM9 LPCAT1 NFKBIB CYP1B1.

Cluster 7 is the garbage can for the remainder of the studies that we selected. There are 24 studies, with PRDX4 and H1-2 being found 9 times. There is again an association with hypoxia.

(1) I don't want to read too much into two studies, but we might be able to explain the result like this: we know that genes involving immunity are often differentially regulated in old vs young subjects. While the two studies, a human infection study in dendritic cells and a mouse aging study in neural stem cells, would not be expected to intersect greatly, the few genes that do intersect follow a very significant gene order pattern involving an immune process. This is kind of cool, I think...without cheating, we're extracting a link between two studies that would ordinarily be hidden.

whatismygene.com

Wednesday, September 10, 2025

The WIMG view of mouse Alzheimer's studies

Here's a recent review of the state of the field of Alzheimer's research in non-humans. To summarize...these studies, nearly all of which seek to induce amyloid or tau pathology, have a dismal record.

The WIMG database has quite a large compendium of Alzheimer's studies...the term "Alzheimer's" is found in about 1200 lists, comprised primarily of human and mouse studies. Previously, we used the human portion of these lists to construct new lists of genes that are canonically up- and down-regulated in the Alzheimer's disease brain (dbase IDs 123049121 and 123050121). How do mouse studies match up with these two lists?

Knowing that it's hard to get any perturbation to generate a result that looks like our Alzheimer's upregulation list, let's start with transcripts that are canonically down-regulated in Alzheimer's. Not surprisingly, the studies that best match up with this list are the human studies that compose the list. This is followed by other human neural disorders...Creutzfeldt-Jakob, Nasu-Hakola, etc. The first mouse match is ranked 21st in terms of match significance (log(P) = -33). We've labeled it as transcripts "upregulated in mouse cortex 4d vs 2d after skull injury", but you can impose a double negative on that wording to get an equivalent: transcripts downregulated in mouse cortex 2d vs 4d after skull injury. Perhaps that wording makes it more obvious that we're talking about transcripts that are downregulated early in the process of injury recovery. These injury-related studies, in fact, dominate the top of our list of mouse studies that mimic the genes that are downregulated in Alzheimer's...of mouse studies, ranks 1, 3, 8 (spinal tissue!), and 10 (cerebral artery occlusion) match our Alzheimer's list fairly significantly*.

How about rank 2? Here we're talking about a single-cell cluster ("neurons2") of brain stem neurons with and without a SOD1 mutation (Single-cell RNA-seq analysis of the brainstem of mutant SOD1 mice reveals perturbed cell types and pathways of amyotrophic lateral sclerosis). This is another theme of our mouse Alzheimer's-mimic list: clustering and/or cell-type results involving neurons, perhaps suggesting that very specific types of neurons may be more or less involved in Alzheimer's.

Another theme involves studies of embryonic brain cells. This is seen in ranks 5, 16, 18,19, and 21.

Studies that might seem rather odd in their ability to deliver an Alzheimer's signature involve genes downregulated in the colon (!) upon gavaging with mulberry extract nanoparticles (rank 4, GSE185351), genes upregulated on pyk2 knockout (27, GSE180598), genes upregulated in aorta on rage knockout (28, GSE15729), and genes downregulated in microglia on ehmt1 haploinsufficiency (36, Derepression of inflammation-related genes link to microglia activation and neural maturation defect in a mouse model of Kleefstra syndrome).

Wait a second...where are the explicit mouse Alzheimer's studies that involve, say, the 3XTG or 5XFAD models? Well, the first hint of such a result is found at rank 13: "genes negatively correlated w/plaque intensity in E4 5XFAD mouse brain". Note, however, that this doesn't quite fit the bill, as both the test and control samples involve a 5XFAD mouse brain. It turns out you have to go down to the 99th mouse study on our list to find such a result ("downregulated in mouse 5XFAD vs wt 8m hippocampus", GSE149243, log(P)=-11). In the process, you pass through studies involving the retina, muscles, adrenal glands, heart, myoblasts, and more. In other words, a myriad of seemingly irrelevant mouse studies do a much better job of mirroring the Alzheimer's signature than studies explicitly designed to generate the signature in a mouse brain.

At this point, if we had to say something positive about mouse Alzheimer's studies, we'd say that the 5XFAD model appears best. The first appearance of the term "APP/PS1" appears at rank 798. The term "3XTG" first appears at rank 1950 of 149,000 lists, with an unadjusted log(P) of -1.26.

Perhaps the mouse models do a better job of mimicking genes that are upregulated, not downregulated, in Alzheimer's. Let take a look. Here, the first mouse study is found at rank 45 with log(P)= -8: "up-regulated in mouse cortical culture on ursodiol" (GSE110256). Ursodiol, interestingly, is a bile acid generated by humans, but in higher concentrations in bears and hibernating animals. Perhaps there is some natural justice dealt out to the humans who torture bears for their bile juice.

Eliminating all non-mouse studies, study #2 involves downregulation of hypothalamus genes upon DHA treatment (GSE64807). We've previously noted the possible benefits of DHA. Again, we see studies involving injury: ranks 9, 11, 27 (a heart infarction study), and 53 (a skin-wounding study). Bearing in mind that the p-values aren't impressive, we also see a number of gene perturbation studies that parallel the upregulation signature: lsd1 knockout, hiv-gp120 overexpression, circSCMH1 overexpression, and arx mutation.

Where do we see the first occurrences of "5XFAD" or "3XTG"? Amazingly, the first explicit 5XFAD study is ranked #4273 (unadjusted log(P) = -0.76). The situation is worse for the first 3XTG study in the list: rank #5965; here, genes upregulated in the mouse model match our list of genes downregulated in Alzheimer's better than our list of upregulated genes.

Simply put, mouse Alzheimer's studies suck. Mouse studies that do mirror the Alzheimer's signature weren't conducted with the intention of furthering understanding of Alzheimer's. One could complain that we're judging the mouse studies based on a single perspective (gene set analysis of human vs mouse transcriptomes)...but, as seen in the aforementioned review, the mouse studies have failed in numerous other respects.

*****************

If you're interested in perusing the full list of studies mentioned above, it's easy. Just go to the WIMG website, choose the Fisher tool, enter the database ID for either the Alzheimer's upregulation or downregulation list, and submit. To focus entirely on mouse studies, choose "Mouse" in the species box.

*10/29/2025: Here's an injury study we just added to the database: Transcriptome Profiling of Hippocampus After Cerebral Hypoperfusion in Mice. Here, genes downregulated in the hippocampus upon bilateral carotid artery stenosis match up with our list of genes downregulated in Alzheimer's with a log(P) of -17!

whatismygene.com

Sunday, August 24, 2025

Still more perturb-seq

Previously, we alluded to yet another perturb-seq dataset. Here it is: Comprehensive transcription factor perturbations recapitulate fibroblast transcriptional states. This time, the authors used crispr gene activation to examine the effects of over-expression of a near-comprehensive list of transcription factors in rpe1 and hs27 cell lines.

Before some discussion of the above Southard et al dataset, we should point out yet another "largest" perturb-seq dataset that we won't be adding to the database: the Tahoe100M matrix. As with the Xaira dataset, there's some hype regarding the data:

Tahoe-100M is a giga-scale single-cell perturbation atlas consisting of over 100 million transcriptomic profiles from 50 cancer cell lines exposed to 1,100 small-molecule perturbations. Generated using Vevo Therapeutics' Mosaic high-throughput platform, Tahoe-100M enables deep, context-aware exploration of gene function, cellular states, and drug responses at unprecedented scale and resolution. This dataset is designed to power the development of next-generation AI models of cell biology, offering broad applications across systems biology, drug discovery, and precision medicine.

Unlike the Xaira data, there's not a lot of sequencing depth here. As the Xaira paper itself points out, Xaira identified 8.45 times more unique molecular identifiers (UMIs...roughly speaking, we're talking about transcripts) per cell than the Tahoe100 folks did. To exaggerate, the bioinformatician is left trying to utilize a list of ribosomal and mitochondrial counts to infer the effects of 1,100 chemical perturbations on 50 different cell lines. As much as WIMG neurotically loves hoarding data, we'll pass on this one.

Getting back to the Southard paper, we see a respectable 5,000 UMIs per cell. The data is available in a fairly processed, compact form, enabling us to churn out gene lists without a lot of optimization. Given the crispr activation, we'd like to see the targeted gene consistently appear in the list of upregulated genes. Though this can be seen at a frequency far above chance, the majority of our 100-member upregulation lists (90% or so) lack the perturbed TF. We attribute this to the fact that TFs are typically non-abundant entities, falling outside the limits of detection in Southard's setup.

As with the Xaira lists, we can observe the extent to which various Southard lists match up against "WIMG exemplar" lists. If all Southard lists failed to overlap with these lists, or all Southard lists overlapped equally (i.e. they don't cluster) with these lists, we'd question the quality of the data, or our preparation of the data. That's not the case here. As an example, Southard's LHX4, GATA1, MYC, and HIF1A activations all overlap WIMG exemplar data with very significant p-values, without overlapping with each other to any great extent. Below, note how well the HIF1A activation matches up with hypoxia studies:

The "hif1a chip-seq" result (line 14) is quite nice. It's easy to conclude that hif1a is primarily an activating, rather than repressing, transcription factor...the genes that are upregulated when hif1a is overexpressed are also found in a list of hif1a DNA targets.

Here's another nice example of Southard data quality: the single best Southard match to a myc knockout in mouse t-all cells (GSE222937) is myc activation in rpe1 cells.

We thus stamp a "not junk" label on Southard's data and include it in our database.

****************

Previously, we pointed out some issues that arise when we add these massive datasets to our database. In particular, naively combining these sets with the rest of the WIMG database skews co-expression results to an extreme. Thus we must take steps to minimize these effects. To include these perturb-seq sets in your analysis, you'll need to select the "database" option on our website.

For Fisher analysis, we've lumped the three major perturb-seq studies (Repogle, Xaira, and Southard) in our database together. It's possible, however, that you only wish to conduct analysis with one of these studies. Using the "keyword search" box, you could choose to examine only Repogle's work by typing "Repogle". You could also choose to exclude "Repogle". Likewise with the terms "Xaira" and "Southard". Let's say you only want to examine Xaira's hct116 results, not the hek293 results: type "Xaira hct116". Likewise for "Xaira hek293", "Repogle k562", "Repogle rpe1", "Southard hs27", and "Southard rpe1". Be sure to spell correctly. In general, when folks choose to perform Fisher analysis on WIMG, they want to quickly scan a database of diverse studies. The default settings, which exclude perturb-seq studies, optimize that.

For co-expression analysis, we've set things up so that you don't mix databases. Thus, you can choose "Only Xaira hct116" from the database box, but you can't combine Xaira's hct116 results with our standard database. We did include a somewhat dubious "Include Perturb-Seq crispr i/a" option, which combines Xaira, Southard, and Repogle results.

Let us know if our current interface prevents you from performing your desired analysis. In the worst case, we can get our hands dirty and do some coding.

whatismygene.com

Wednesday, August 13, 2025

More Perturb-Seq

Xaira, a recently formed billion dollar biotech, has released monster perturb-seq datasets involving crispr-inhibition in hek293 and hct116 cell lines. Thus this data joins the Repogle perturb-seq dataset in our database. For more background on the Repogle set, and on the perturb-seq approach in general, see our relevant post.

In our next post, we will explain how to access the Xaira data on the WIMG website.

Unlike Repogle's data, Xaira's data is not currently available in a form more processed than mere count data. Thus we were faced with the task of dicing up 500 Gb of scRNA count data. To be honest, we've never had reason to process this kind of data for ourselves...we scrounge the processed results from others. We initially attempted to follow standard protocols, where adjustments are made for extremely sparse data and large batch effects. Our initial, naive attempts found that control data could be grouped into two very distinct clusters. One cluster was dominated by high abundance ribosomal and mitochondrial transcripts; the other wasn't. Though batches were clearly labeled in the data, the clusters did not conform to batches (i.e. it cannot be definitively said that batch 100 is overloaded with ribosomal transcripts, and batch 127 isn't), and thus standard single-cell batch-control methods did not alleviate the presence of distinct clustering in controls. After adjusting for our own clustering results, we were disappointed. Another issue: various methods did not seem to dramatically improve the frequency with which the knocked-down gene appeared near the top of the list of downregulated genes¹. Without going into the dirty details, we finally settled on a simple procedure...normalize the counts, perform log1p adjustment, grab a random subset of control data, and perform Wilcoxon's test for significance on specifically targeted test samples vs controls. Such an algorithm performed best in drawing targeted genes to the top of their corresponding downregulation lists. Gene lists were sorted according to log(fold-change) divided by significance.

We can cluster the resulting gene lists by first generating a matrix of study/study Fisher p-values. This can be a matrix that matches Xaira lists against Xaira lists. It can also be a matrix that matches Xaira lists against our entire database. Choosing the latter approach, we were again disappointed...both the elbow and silhouette methods identified an optimal cluster number of 2. Ideally, one would like to see tens or hundreds of clusters, each representing special processes in cells. As with the control data alone, one cluster was dominated by high abundance genes.

If Xaira, or some other entity, can provide better processed data, we'll happily snatch it up and overwrite our own.

There are signs, however, that the Xaira data, excreted by our crude procedure, contains worthwhile biological information. We note, for example, that Xaira knockdowns do align with the same knockdowns/outs from other studies at a frequency that is certainly not random. As just one example, genes downregulated in both of Xaira's NRF1 knockdowns strongly align with a study in which NRF1 was knocked out in the mouse retina (GSE150258); the Xaira hct116 list was the third best match out of 146,950 lists and the hek293 data was the ninth best match². Also, while the numerous lists in which ribosomal/mitochondrial genes seemed most strongly perturbed are bothersome, there may be an element of biological reality here: grouping all the genes whose knockdown apparently strongly perturbs ribosomal transcripts, we find very strong (p<10^-20) representation by genes involved in ribosomal RNA processing. These moderate- to low-abundance genes are precisely the genes whose knockdown would be expected to decrease ribosomal RNA levels^3,4. Another positive sign: genes targeted by sgrna were found in the corresponding 100 member downregulation lists around 50% of the time. Given that roughly 20,000 genes were identified at non-zero levels, one would expect to see the targeted gene appear in the 100 member downregulation list about 0.5% of the time if the lists were composed of random garbage.

Assuming the sequencing of a suitable number of cells (say, 1000), any scRNA-seq paper is expected to show results of at least one clustering procedure. The optimal number of clusters, arrived at by any number of methods, can be disappointing, as above. I'm not in a position to critique the underlying math of clustering methods, but I can say that these procedures often seem to ignore rare gene patterns in favor of forcing all gene patterns into a fixed number of sets⁵. Examining Xaira data against 74 "WIMG exemplar" lists which constitute largely non-overlapping gene patterns (as measured by Fisher's exact test: see our preprint), we find Xaira gene lists that strongly match 30 of these patterns. For example, Xaira's TMEM131 kd in hct116 cells matches quite nicely (p<10^-38) with genes found in hek293 ER fraction vs cytosol (GSE215768)⁶. Genes upregulated on Xaira INTS8 kd in hct116 cells match up very nicely with genes upregulated in hcclm3 cells on BRD4 inhibition (GSE181406). Patterns generated by knockdown of genes such as ZC3H13, DDX27, SRSF1, ZWILCH, NAA25, CMTR1, ELOB, TRMT2A, and many more, match up with high significance against our (again, non-overlapping) exemplar lists.

One of the more interesting and impressive results involved genes upregulated in Xaira's REST knockdown in hek293 cells, which overlapped with great significance with a study in which PRRX1 was overexpressed (p=10^-36: GSE180515)⁷; the next closest Xaira match to this result involves knockdown of CDYL and a p-value of a mere 10^{-7.4 8} . Another notable result: genes upregulated in both hek293 and hct116 lines on GRPEL1 kd overlapped strongly with a study in which IGF2BP1 was knocked out (GSE115646). And, to jump the gun a bit (our next post): genes downregulated in Xaira's PPARGC1B kd in hek293 cells overlap strongly with genes upregulated in Southard's perturb-seq PPARGC1A crispr activation: both results overlap a study in which ANLN was knocked-out in mda-mb-231 cells (GSE131120).

Most of the above observations were made by an "eyeball" approach. Taking a more systematic, computerized approach would probably yield reams of potentially interesting results.

1) Perhaps the biggest oddity in the data was this: the presence of a normally ho-hum transcript, PLXDC1, in a very large number of up- and down-regulation lists in both hct116 and hek293 results. WTH?

2) Another example: The single best Xaira match to genes upregulated on eif4a1 ko in mouse b-cells (GSE237426) is the Xaira eif4a1 kd in hct116. Another: our database's (155,000 lists) 3rd best match to genes upregulated in mouse cerebellum on eif2b5 mutation (GSE128092) is Xaira's eif2b3 kd in hek293...Xaira's eif2b5 kd in hek293 ranks 25th. Another...the single best Xaira match to a zeb1 ko in mouse osteoclasts (GSE212302) is the Xaira zeb1 kd in hek293. Another...the second best Xaira match to a mouse sin3a ko in cd4+ t-cells (GSE196615) is Xaira's sin3a kd in hct116. Another...the second best Xaira match to a mouse cdyl ko in embryonic gonads (GSE226049) is Xaira's cdyl kd in hek293. (If you find it odd that all the above studies involve mice it's simply because we've been focusing on increasing the proportion of mouse studies in the database). Another: the second best Xaira match to genes upregulated in rael cells on uhrf1 ko (GSE136596) is Xaira's uhrf1 kd in hek293.

3) I'd guess that these results are, in turn, strongly dependent on exactly how long the knock down was conducted prior to freezing the cells. Had the average knock down period been increased by a few hours, allowing recovery of ribosomal genes, or a shift into backup programs, the gene lists could be quite different. In the end, despite the massive funding ($2.00 per cell?) and output behind these studies, they only examine particular cells under particular conditions and timeframes. I'm a bit skeptical of the ability of these monster studies to reveal extraordinary insights into cellular biology on their own, whether via standard statistics or AI approaches (yes, this is a WIMG plug).

4) The best example of a Xaira knockdown that generates a list of genes overloaded with ribosomal and mitochondrial entities involves knockdown of cmtr1. In both hek293 and hct116 lines, cmtr1 kd very significantly downregulates these abundant genes. Remarkably, examining an independent study in which cmtr1 was overexpressed in mefs (GSE200103), the single best Xaira match to this study is...cmtr1 kd in hct116 cells. Cmtr1 kd in hek293 was the third best Xaira match. For reference, there are now 37,310 Xaira lists in the WIMG database.

5) To be a tad more precise...whatever value is being minimized/maximized in these procedures, it seems like it's best done not by placing one or two outlying lists into a separate cluster, but by generating clusters derived from larger numbers of lists. Thus merely increasing the cluster number doesn't automatically highlight rare but interesting gene patterns. Having said that, ChatGpt offers me a list of 8 options to overcome this issue...tinkering with the "resolution parameter" sounds promising.

6) Sure enough, a little googling shows that TMEM131 is involved in ER transport.

7) We've pointed to REST as an interesting gene in previous posts. Here, for example. We've also noted a relevance to Alzheimer's. Yup...of 37,000 Xaira gene lists, the one that best overlaps our list of genes downregulated in Alzheimer's is a list of genes upregulated on REST knockdown in hek293 cells.

8) In WIMG parlance, this is something of a "microcluster"...a result which overlaps with high significance with only one or a few other studies, followed by a dramatic drop-off in significance. We've identified about 950 microclusters scattered throughout the database, which currently contains about 19 billion study/study overlaps. In this particular case, I don't actually make the "microcluster" annotation in the database, since there are non-Xaira studies that overlap with the PRRX1 study quite significantly. But within the context of Xaira-only studies plus the PRRX1 study, the REST knockdown really stands out.

whatismygene.com

Friday, April 25, 2025

Stuff that might be true

I'll add to the below list as thoughts pop into my brain....

*"Celebrity" genes are over-rated. Last I looked there are something like 10,000 papers primarily devoted to tp53. Every now and then I stumble across a knockout that has not, to my knowledge, been performed before. One might think that such knockouts would be less likely to generating a long list of significantly perturbed transcripts than, say, a tp53 knockout. I just entered a study involving a CHSY3 knockout into the database. That's the first instance of a CHSY3 perturbation in the database, and the knockout had a major effect on transcript abundances in the underlying study. This sort of thing happens again and again...it's not as if the list of genes whose knockout strongly alters cell activity was exhausted a decade ago.

*Our understanding of biology is strongly biased according to the order of discovery. As an example, it seems that folks have a fairly fixed idea of what micro-RNAs do, if they do anything at all. When perusing micro-RNA overexpression and inhibition studies, the studies in which large numbers of transcripts are significantly altered (versus few or none) usually seem to involve the symbol "mir" followed by a small number, not a large number (e.g. mir1 vs mir1234). This may seem odd until you consider that, in the early days of miRNA research, new mirnas would simply be given the first integer that had not already been taken. In other words, the early mirnas, which were discovered because they actually did something in cells, may have created the illusion that there may be thousands of interesting mirnas, all of which act according to the principles associated with the earliest mirnas.

*Also, regarding miRNAs: it's possible that, at "ground truth" level, the typical miRNA only targets one or a few transcripts (1). This is based on an observation I haven't quantified: that miRNA overexpression and inhibition studies, in contrast to these experiments conducted on ordinary transcripts, often seem to strongly alter the expression of one or a handful of transcripts, followed by a clear drop-off in significance and/or fold-change.

*Similarly, it's possible that in a typical lab-generated list of significantly altered genes, relatively few matter. That is, a large portion of these transcripts or proteins are basically junk, possibly generated to maintain the proper concentration of RNA and/or protein. I base this on admittedly flimsy evidence (4): that if you take a large database of perturbations (e.g. WhatIsMyGene's), generate all study/study overlap P-values, and cluster the data, you might be surprised at how few clusters you generate (using standard methods to determine optimal cluster numbers...e.g. the "elbow" method). To put it overly-dramatically, one study looks like the next. How can sophisticated biological decisions be made in that case? Because a lot of the sameness between studies is not interesting, while a handful of truly interesting genes make a big difference.

*Some genes may reach celebrity status because they are exceptions to a rule: that one type of study (e.g. knockouts) usually overlaps weakly, if at all, with other types (e.g. chip-seq). P53, for example, breaks the rule: the genes altered in P53 knockouts often do overlap with P53 chip-seq results. Of course, P53 also has the property that it is often mutated in cancer, so another possible "truth" is that genes that break the rule are precisely the ones that get targeted by viruses or cancers.

*Modern biology is hugely skewed by the results of experiments in which cells are essentially blasted with extreme conditions that are rarely, if ever, experienced in normal living creatures...complete knockouts, targeted alterations of single genes, micro-RNA levels 100X greater than anything ever experienced in nature, etc. These conditions are very often measured over extremely short time scales, largely due to the limitations of cell culture approaches. It is possible that these approaches have skewed biology in massive fashion. Again, consider Alzheimer's, a disease whose seeds may be planted decades before symptoms become obvious...I've yet to see any sort of experiment, either with mice or cell culture, that parallels the set of genes that are typically upregulated in Alzheimer's (2). Perhaps this is simply because of the near-impossibility of conducting experiments over the course of decades.

*OK, maybe more of a critique than a "truth": let's say you do a chip-seq experiment using transcription factor XYZ. You collect the list of most-strongly bound genes and test your results against a parallel XYZ knockout experiment. Let's assume your background figures are good. You perform Fisher's test and get -log(P) = 4. Should you be impressed with this statistic? Should you perform further experiments based on this very significant number? I say no. If you had tested your chip-seq list against 150,000 other lists, you may have found 5,000 lists that out-performed your knockout study. In fact, after correction for multiple testing, your knockout results would be rendered insignificant. Yes, I'm plugging WhatIsMyGene.

*Maybe the "replication crisis" is a bit exaggerated. Cells and organisms are simply very sensitive to seemingly minor differences in experimental settings. Perhaps stochastic effects are much more powerful than we generally believe (3). Let's say you knockout gene ABC in mouse kidneys in your lab. Somebody else does the same thing in their lab. The results overlap weakly. Did somebody screw up? Maybe not. Using a tool like WhatIsMyGene, you may find that both studies nevertheless overlap rather nicely with a third study involving gene XYZ. (Also, recall the above point that maybe only a few transcripts really "matter" in list of significantly altered genes, with the rest being relatively unimportant).

*Alzheimer's has some relationship to stem cells. I say this because time and time again the genes downregulated in Alzheimer's overlap strongly with studies involving stem cells and embryonic cells. The problem is...brains don't have a lot of stem cells, especially if you exclude the SVZ. I don't know how to work around this issue...perhaps some brain cells, neurons in particular, have a stem-like signature but lack the standard markers for stem-cells.

*A bit more speculatively, Alzheimer's may also have some connection to the appendix and appendicitis. In addition to papers suggesting a link (google it), I'd also point out an odd overlap between our own list of transcripts typically up-regulated in the Alzheimer's brain and a list of transcripts up-regulated in the mouse distal colon following appendicitis (GSE23914). The significance is not impressive...but it's difficult to find any studies that overlap strongly with those up-regulated in the human Alzheimer's brain. The study ranks as the 332nd best match to the Alzheimer's list (against 145,000 other studies), competing against studies primarily derived from human cells and brain cells.

*Some in-vivo studies may produce distorted results because of the time of day at which they cut open their subjects. I'm looking at a 12 week study where mice were treated with control vs drug. Out of 145,000 studies examined, the best match (at p=10^-45) would be one that examined mouse livers at zt21 vs zt12. It turns out that the drug in question, minocycline, actually does recalibrate circadian rhythms. But...how often is the possibility of circadian effects totally ignored?

*Methylation of DNA may do more than repress or activate transcription. It may also regulate consistency/variability in expression. I have admittedly shoddy evidence for this notion: a list of genes that do not commonly correlate with batch effects is replete with genes that are often seen in DNA methylation experiments (see Quantifying batch effects for individual genes in single-cell data) .

1. Just as an example, a study in which mir138 was inhibited (GSE173982) results in very significant downregulation of a single transcript, NDUFA9. Another one...mir-222-3p treatment results in the very significant downregulation of a single transcript (Gm10925) in GSE167753. Another one: mir144-3p inhibition in stress susceptible mice results in significant downregulation of a single gene (kcnj8) in GSE209673. Also, in GSE211749, only one transcript is downregulated with strong significance (zgrf1) on a triple miR-322-503-351 ko in white adipose. Also, in GSE216981, mir150 knockout downregulates tnfrsf26 at a significance of 10^-204, while the next most significant alteration comes in at 10^-20.

2. Downregulated genes in Alzheimer's are a different case...these are seen in many kinds of perturbation and clustering experiments involving brain tissue.

3. Another issue is this: if the only way you can replicate a study is via extreme rigor, how generalizable/interesting are your conclusions about your gene of interest?

4. Here's some more evidence: if you take two mouse strains and compare transcriptomics from a particular organ, you'll get a long list of differentially regulated genes...it wouldn't be surprising to see more than 50% of transcripts significantly altered. Thus, you have very different transcriptomics, but a very similar product...a mouse. One could surmise, therefore, that most of these transcripts aren't doing anything.

whatismygene.com

Thursday, September 12, 2024

T-cell Exhaustion

"T-Cell Exhaustion" is associated with an inability of the immune system to fight off cancer and other diseases. We grabbed 7 markers of exhausted t-cells (pd-1, ctla4, tigit, lag3, tim3, cd244 and cd160) and searched our database for studies in which these markers were strongly perturbed. In only one of 91,000 gene lists were all 7 of these markers perturbed: Hematopoietic Progenitor Kinase1 (HPK1) Mediates T Cell Dysfunction and Is a Druggable Target for T Cell-Based Immunotherapies, wherein knockout of map4k1 downregulated all of these markers.

Grabbing all gene lists in which at least three of the markers were perturbed gave us 307 lists. Retaining the markers, we generated a frequency table of genes most commonly found in these lists. The markers lag3, pd-1, and tim3 topped the list. The fourth most frequent gene in our list was not one of the 7 markers: gzmb. After ctla4 and tigit we have ccl5, cst7, ccl4, gzma, and ccl3. Cd244 and cd160 occupied the 21st and 27th positions on the list. Our final list of genes associated with t-cell exhaustion contains 188 genes, with all genes required to be found at least 60 times over the 307 lists.

Presumably, we'd like to downregulate these genes aggressively in cancer, allowing the immune system and immunotherapies to go to work. Sticking with known drug/treatment regimens (as opposed to, say, knockouts which may be difficult to implement for the time being) in lymphocytes, the single best treatment would be the presence (versus absence) of zinc in mouse drinking water: Interleukin-10 induces interferon-γ-dependent emergency myelopoiesis. Next is a dca (16-didehydro-cortistatin A) regimen: The Cyclin-Dependent Kinase 8 (CDK8) Inhibitor DCA Promotes a Tolerogenic Chemical Immunophenotype in CD4+ T Cells via a Novel CDK8-GATA3-FOXP3 Pathway. This is followed by mouse studies involving leukocyte costimulatory blockade antibody treatment, Short-term Immunosuppression Promotes Engraftment of Embryonic and Induced Pluripotent Stem Cells, and NAC treatment, Impaired mitochondrial oxidative phosphorylation limits the self-renewal of T cells exposed to persistent antigen. A mouse study involving ricolinostat, an hdac6 inhibitor, follows, but we note that this drug also upregulated a significant number of genes in our t-cell exhaustion list. Such is biology.

The first human study wherein a treatment downregulates genes in the t-cell exhaustion list is this: TNFR2 Costimulation Differentially Impacts Regulatory and Conventional CD4+ T-Cell Metabolism. The study involves application of a tnfr2 agonist antibody to cd4 t-cells. The next human study involves treatment with a cd45 fragment: The soluble cytoplasmic tail of CD45 (ct‐CD45) in human plasma contributes to keep T cells in a quiescent state.

Ignoring solutions that might be relatively practical in 2024, we see a study in which a foxp3 k18r mutation results in exhaustion gene downregulation (Foxp3 Reprograms T Cell Metabolism to Function in Low-Glucose, High-Lactate Environments), followed by the aforementioned map4k1 ko, batf3 oe, tbx21 ko, tak1 ko, tfam ko, regnase-1 ko, rbx1 ko, and en2 ko.

In terms of disease-related studies, we see these exhaustion genes downregulated in responding vs non-responding leukemia patients in Reversal of in situ T-cell exhaustion during effective human antileukemia responses to donor lymphocyte infusion. This is not surprising, but it's nice to see validation of the standard dogma regarding t-cell exhaustion. Then again, the next disease study on the list might surprise: In Single-cell landscape of the ecosystem in early-relapse hepatocellular carcinoma, t-cells associated with relapse tended to be depleted of exhaustion genes. Upregulated exhaustion genes were not only seen in cancers: see lymphocytic genes in Metallothioneins as dynamic markers for brain disease in lysosomal disorders and Hypomethylation and Overexpression of Th17-Associated Genes is a Hallmark of Intestinal CD4+ Lymphocytes in Crohn's Disease. HIV progression vs control is associated with upregulation of exhaustion genes in Transcriptional analysis of HIV-specific CD8+ T cells shows that PD-1 inhibits T cell function by upregulating BATF. In DUSP4-mediated accelerated T-cell senescence in idiopathic CD4 lymphopenia, mouse t-regs show an upregulated exhaustion signature in the diseased state.

Unfortunately, there aren't any "DIY" sorts of treatments that downregulate exhaustion genes with high significance (we set P = 10^-15 as a cutoff). Zinc supplementation is interesting, but we wish the study were conducted in humans. We will upload the exhaustion list to our database in the next week or two and post the database ID just below when we do*. Then you can search for all treatments, diseases, knockouts, etc. that up- or down-regulate the exhaustion signature. It is possible that strong alteration of the exhaustion signature could be accomplished with a cocktail of treatments, each without astounding efficacy alone; to test such hypothesise, be sure to check out our "Third Set" tool to examine this possibility.

*The dbase ID is 188419856 .

whatismygene.com

Monday, August 12, 2024

Reversing Disease Signatures

Here, we discuss the use of WIMG tools to search for drugs or treatments or gene perturbations that may reverse various disease signatures. Perhaps I'm jumping the gun a bit here...it would first be nice to show that reversing a disease signature can actually reverse a disease. I may provide concrete examples that both confirm and contradict the possibility in the future. Based on the experience of scouring tens of thousands of studies, however, it is fairly obvious that reversing a disease signature can often, if not always, effectively treat a disease. When examining cancer signatures, for example, MEK inhibitors, commonly used in cancer treatment, often do a fine job of downregulating transcripts that are upregulated in cancer, and upregulating transcripts that are downregulated in cancer. We will ignore complicating factors such as resistance. We also assume that readers are educated/experienced enough to understand that most treatments involve tradeoffs...self-experimentation is not recommended.

We've accumulated a number of gene lists involving "canonical" disease signatures. They are listed at the bottom of this page. Additional details, such as the number of studies examined in accumulating the data, are omitted for simplicity. If your disease of interest is found in the list below, you can perform several actions to search for studies in which the signature is reversed. For example, you could open up the "Fisher" app, enter the DBASE ID for "WIMG up-regulated in bald skin" in the "Enter identifers or database ID" box, and simply hit "submit." If you are only interested in reversing the signature, select "downregulated" in the "Regulation" box. Then again, if you're interested in searching for factors that could encourage balding, you could choose "upregulated." You can also enter both portions of a study (upregulated and downregulated) into the "Match Studies" tool; to reverse the two be sure to select "Inverse Correlations." It is advisable to try both apps, if possible: "Match Studies" will give you individual studies ranked according their potencies in reversing both the up- and down-regulated portions of a study. It's possible that the most potent treatment for disease reversal would involve separately altering the up- and downregulated portions (i.e. two drugs), in which case you'd want to stick with the Fisher app.

If you're only interested in drug-based treatments, you can choose "drug" in the "Experiment" box. Choose "treatment" for non-small-molecule approaches (antibody-based therapy, etc). "Environment/behavior" might also be worth examining.

One nice, very unique WIMG option is the "natural" option in the "Cell Type" box*. Choose it, and you will only receive "do-it-yourself" types of treatments as output...fitness programs, diets, vitamins, Chinese medicine, and stuff you might find in a "health-food" store. Again, I will assume readers are mature enough to be cautious here.

It is possible that the upregulated (or downregulated) portion of a disease signature would best be reversed by two or more treatments. Here, you might consider using the "Third Set" tool. Enter, say, the upregulated portion of disease transcripts and the downregulated portion of a signature involving a drug that you know to be effective. It's important that "Set1" be the upregulated disease signature. The tool will spit out a list of studies that intersect with the disease signature, but not the known drug signature.Again, you will probably wish to select "downregulated" in the "Regulation" box, and something like "drug" in the "Experiment" box. If you're insane and wish to find three non-overlapping drug treatments, you'll need to know all the transcripts that are considered to be downregulated in the two drug studies above...WIMG doesn't provide you with this, so you could contact us or dig up that data yourself in the studies of interest. Find the union of those two sets and discard one copy of any genes that appear twice. Use this new "dual" drug signature in the "Match Studies" tool, along with the upregulated disease signature.

Looking below, you will see that we have a fairly limited selection of canonical disease signatures to choose from. That's because we usually create these lists when there's a substantial selection of studies from which we can draw repeatedly perturbed genes. If you wish to reverse a disease signature that doesn't have a "WIMG list", you can create one yourself using whatever studies you can find. In the case of a rare disease, there may be only one study that is relevant. It's possible that no studies exist for a disease of interest, in which case you would have to find a signature for a similar disease. You could ask us to try to dig up the studies...don't worry, we're neurotic about hoarding and analyzing data.

DBASE ID STUDY

118765101 WIMG canonical up in cancer vs. adjacent

118766101 WIMG canonical down in cancer vs. adjacent

118767101 WIMG canonical up-regulated in metastasis vs. primary

118768101 WIMG canonical down-regulated in metastasis vs. primary

118771101 WIMG new canonical cytokine storm up

118772101 WIMG new canonical cytokine storm down

123049121 WIMG canonical up in human Alzheimer's brain

123050121 WIMG canonical down in human Alzheimer's brain

123069121 WIMG canonically upregulated in blood of Alzheimer's patients

123070121 WIMG canonically downregulated in blood of Alzheimer's patients

124415121 WIMG canonically up in Parkinson's brain

124416121 WIMG canonically down in Parkinson's brain

124416122 WIMG canonically down in alcoholic brain

124416131 WIMG canonically up in alcoholic brain

124417121 WIMG canonically up in schizophrenia brain

124418121 WIMG canonically down in schizophrenia brain

124419121 WIMG canonically up in depression/bipolar brain

124419122 WIMG canonically down in depression/bipolar brain

124420121 WIMG canonically up in autism brain

124421121 WIMG canonically down in autism brain

125583121 WIMG canonical up-regulated in aging brain

125584121 WIMG canonical down-regulated in aging brain

137716203 WIMG canonically up-regulated in lung squamous cell carcinoma vs lung adenocarcinoma

137717203 WIMG canonically down-regulated in lung squamous cell carcinoma vs lung adenocarcinoma

141048203 WIMG canonically up-regulated in lung cancer

141049203 WIMG canonically down-regulated in lung cancer

141259203 WIMG up-regulated in liver cancer vs adjacent

141260203 WIMG down-regulated in liver cancer vs adjacent

142124203 WIMG canonically up-regulated in colorectal cancer vs adjacent/normal

142125203 WIMG canonically down-regulated in colorectal cancer vs adjacent/normal

142928203 WIMG up-regulated in cervical cancer

142929203 WIMG down-regulated in cervical cancer

143176203 WIMG transcripts rarely perturbed in human cancer

143177203 WIMG transcripts most commonly perturbed in human cancer

143178203 WIMG transcripts most rarely up-regulated in human cancer

143179203 WIMG transcripts most commonly up-regulated in human cancer

143180203 WIMG transcripts most rarely down-regulated in human cancer

143181203 WIMG transcripts most commonly down-regulated in human cancer

146502203 WIMG genes that are rarely down-regulated in cancer vs adjacent studies

146503204 WIMG genes that are rarely up-regulated in cancer vs adjacent studies

146503205 WIMG genes that are never down-regulated in our cancer vs adjacent studies

146504206 WIMG genes that are never up-regulated in our cancer vs adjacent studies

160517531 WIMG up-regulated in aging

160518531 WIMG down-regulated in aging

160519531 WIMG up-regulated in HUMAN aging

160520531 WIMG down-regulated in HUMAN aging

164739532 WIMG up-regulated in bald skin

164740532 WIMG down-regulated in bald skin

165959532 WIMG up-regulated on cancer recurrence

165960532 WIMG down-regulated on cancer recurrence

165961532 WIMG up-regulated in high vs low-grade cancer

165962532 WIMG down-regulated in high vs low-grade cancer

176813532 WIMG up-regulated in blood of systemic sclerosis patients

176814532 WIMG down-regulated in blood of systemic sclerosis patients

180119532 WIMG up-regulated in inflammatory disease

180120532 WIMG down-regulated in inflammatory disease

*Why is the "natural" option found in the "Cell Type" box? It's unintuitive, but it was easy to program. We can fix that in the future.

whatismygene.com