Monday, October 6, 2025

Gene Order in Gene Lists

Whenever possible, WIMG gene lists are sorted. Typically, we divide log(fold-change) by significance and sort from largest to smallest values. If genes are not significantly altered, but nevertheless are associated with fold-changes, we sort by fold-change alone. In cases where more than 33% of all genes are significantly altered, we may choose to create a list via the above "fc/p" method (fold change divided by probability), but also create a second list in which we first eliminate all genes that are not significantly altered (i.e. P>.05) and then sort according to fold-change. Such lists are marked with "p&fc" in their descriptions. 

Even GO lists are sorted in our scheme. Here, genes that are most commonly perturbed are found at the beginning of GO lists, while housekeeping genes tend to be found at the end.

It seems reasonable that gene order in these sorted lists should observe some repeated patterns. In, say, a cell cycle study, we might see gene ABC followed by DEF, followed by GHI (etc.), while the reverse order might be relatively rare. It's possible to imagine two studies that intersect strongly at the level of genes, but whose genes do not follow a similar order. Conversely, the DEGs in two studies may overlap fairly weakly, but the few genes that are found in the intersection follow precisely the same order. 

The significance of the intersection of two lists and the significance of the similarity of order within the intersection are independent. With this in mind, we added a new feature to our "Fisher" app: 

The default choice is "No"...you don't want to examine gene order. If you select "Yes", the two significances are combined, possibly lowering or increasing the ranks of particular studies in the output list. If you select "Gene Order Only", Fisher's exact test is not applied to your data, but Spearman's test for rank significance is utilized to see if the intersecting genes are found in similar order in both studies. In the odd situation that you'd like to examine cases in which gene order is reversed (one study has ABC DEF GHI and the other has GHI DEF ABC, in order), you could select "Show non-intersecting studies" in the black bar. This causes our terminology to be a bit confusing..."Gene Order Only" doesn't invoke Fisher's exact test at all, and if you select "Gene Order Only", "Show non-intersecting studies" no longer has anything to do with intersections. Never mind. Another nuance that should be pointed out is that the "intersecting genes" column simply shows up to 25 genes that are found in both studies (your input and studies from the database), but doesn't sort the genes according to their contribution to gene order.

Our Spearman's test algorithm will not output unadjusted p-values smaller than 10-16

***************
Having set up the code for Spearman's test, we can make some inquiries of our database as a whole. One simple question: is there any evidence at all for repeated gene order in gene lists? Absolutely! Restricting ourselves to human rna-seq studies involving perturbations and allowing no more than 400 genes in a gene list, we find several studies whose gene order matches the order found in over 300 other studies at P<=10-16. The champion is The RNA binding protein RALY suppresses p53 activity and promotes lung tumorigenesis, wherein genes downregulated upon raly knockdown are found in similar order in intersections with 362 other perturbation studies. We found 862 studies that matched the order of at least 10 other studies at this significance (a total of about 70,000 study/study intersections).

What about reverse gene order when comparing study A to study B? It's relatively rare to find cases like this (at P<=10-16), but they exist. We won't focus on them today.

Do we see cases in which the P-value associated with the intersection of two sets is uninspiring, yet the P-value associated with gene order is very significant? It's a tad unusual, but yes. As an example, the intersection between two studies we've labeled downregulated in dendritic cells from atopic dermatitis patients on R. mucosa vs s. aureus treatment and downregulated in nscs from 22m vs 6m mice is entirely insignificant, yet the gene order of genes found in the intersection is similar at P<10-16  (1). How about the case where the P associated with the intersection is very significant but the P associated with order is not? This is fairly common. Most typically, however, two studies that match strongly in terms of gene order also match strongly in terms of intersecting genes...this is a bit of a no-brainer, as you can't derive a significant gene-order P if there's little or no intersection between the two sets.

What sorts of studies tend to be associated with gene order? To ask the question, we crossed the 862 studies with each other, generating 370,660 P-values associated with study/study intersections. We can then perform clustering on the resulting P-value matrix. The resulting 7 clusters were fairly clear-cut.

Cluster 1, represented by 248 studies, obviously involves the innate immune response. Gathering together the genes most commonly perturbed in these studies, IFIT3 is the top gene. Examining keywords associated with these studies, "ifn" is over-represented at log(P)= -92. Terms like "cytokine", "infection", and "virus" follow. In the other 6 clusters, the keyword "line" (as in cell line) is quite significant, but not here. A typical gene order looks like this: IFITM1 RSAD2 IFIT1 OASL ISG15 IFIT3 HERC5 IFI44 IFI35 RIPK1 RCAN1 NAPSB SIPA1L1.

Cluster 2 is represented by 472 studies. The gene CCNA2 was found in 465 of them, strongly suggesting that we're talking about the cell cycle. A typical gene order is: MKI67 RRM2 KIF20A ASPM TK1 GTSE1 NUSAP1 KIF23 ZNF367 TCF19 TRIP13 CKS2.

Cluster 3 contains 83 studies, with TRIB3 being found in 78 of them. The keywords are interesting: drug, natural, metabolite, depletion, and more. In other words, the individual studies composing the cluster are over-represented by drug studies, "natural" treatments (diets, fitness regimes, health foods, etc.), metabolite perturbations, and depletion of various nutrients and metabolites. Gene order looks like this: NIBAN1 TRIB3 DDIT3 GDF15 MTHFD2 HYOU1 NADK2 SKIL AZIN1 ZXDB.

Cluster 4 contains 12 studies, with several genes found 11 times: RPL30 RPS23 RPL14 NACA. "ripseq" and "cell part" (meaning studies in which one organelle or the like is examined against another) are prominent keywords. RPL39 RPL35A RPS17 RPL22 EIF3L NCOA3 PLP2 is a typical gene order.

Cluster 5 contains 14 studies...there are quite a few genes found in all of them. Keywords "drug" and "hypoxia" are prominent. The gene order looks like this: HMGCS1 MSMO1 HMGCR ACSL1 VAT1 MRNIP TSC22D3 MVP MT-TC.

Cluster 6 contains a mere 6 studies, with 65 genes being found in all of them. All of the studies are knockdowns, and the keyword "kd" is indeed the top keyword (log(P)<-55). There's also an association with lncrnas and circrnas. TPM4 GANAB CASP7 BICD2 TBC1D10B ZW10 ZSWIM9 LPCAT1 NFKBIB CYP1B1. 

Cluster 7 is the garbage can for the remainder of the studies that we selected. There are 24 studies, with PRDX4 and H1-2 being found 9 times. There is again an association with hypoxia.




(1) I don't want to read too much into two studies, but we might be able to explain the result like this: we know that genes involving immunity are often differentially regulated in old vs young subjects. While the two studies, a human infection study in dendritic cells and a mouse aging study in neural stem cells, would not be expected to intersect greatly, the few genes that do intersect follow a very significant gene order pattern involving an immune process. This is kind of cool, I think...without cheating, we're extracting a link between two studies that would ordinarily be hidden.


whatismygene.com 

Gene Order in Gene Lists

Whenever possible, WIMG gene lists are sorted. Typically, we divide log(fold-change) by significance and sort from largest to smallest value...