Friday, March 11, 2022

An Odd Result

After adding gene lists to the database, it's important to test these new lists against all other lists in the database via Fisher's exact test. This helps us spot potential errors. For example, it's possible we've already entered the data into the database; the new study is re-using data from another study. Let's say the new study examines interferon effects on a cell line; if the up-regulation results mirror the down-regulation results from myriad other IFN studies, there's likely some error involving +/- signs or labeling of data. These errors could be generated on our side, but they definitely can also be generated by the folks who do the wet lab work and initial data generation.

Here's another kind of error. We grab a list of genes that are most commonly mutated in cancer (Mutational landscape and significance across 12 major cancer types, table s2). Running the list against all other lists in the database, we see that proteins with high molecular weights intersect with high significance. This signals an error of sorts. Long genes simply have more opportunity to be mutated than short genes. It makes sense, then, to adjust the mutation list by molecular weight. We do that.

We then run the adjusted list of commonly mutated cancer genes against our database again. The results, for the most part, make sense. For example, hypomethylated genes in lung cancer match up nicely with the mutation list (log(P)=-44); one can imagine that if a cancer wishes to target a gene, it could alter methylation patterns or mutate it (or, perhaps, the mutation itself alters the methylation pattern). Hypomethylated genes in other cancer types also match up with those in the mutation list. An independent list of genes mutated in ampullary carcinomas intersects nicely; no surprise. The same goes for a study of mutations in glioma. And so on.

The lung cancer hypomethylation list generates the second most significant intersection against the cancer mutation list. What's the most significant gene set? It's a GO list: GOMF_OLFACTORY_RECEPTOR_ACTIVITY, with a log P-value of -74. We'd list the intersecting genes, all 111 of them, here, but it's simply a tedious list of olfactory receptor genes (e.g. ORF5F1). Our GOMF_OLFACTORY_RECEPTOR_ACTIVITY list is "background-adjusted" (see here), so there's no concern that the list is strongly biased toward abundant genes. In fact, these receptors, not surprisingly, are fairly rare over most human tissues.

Despite the massive overweighting of olfactory receptors in the adjusted cancer mutation lists, TP53 still reigns supreme as the single most commonly mutated cancer gene. In fact, after adjustment, it's about 3X more commonly mutated than the next gene on the list, KRAS. The most commonly mutated olfactory receptor would be OR2T33, which ranks 21rst on the list. For what it's worth, it's a moderate sized protein (32kd).

There are indeed scholarly works on the subject of olfactory genes in cancer. Most, if not all, of these papers, however, focus on altered expression of olfactory receptors, not the tendency of these genes to be mutated.

So, let me ask blog readers: What the hell is going on here?

Here's one (flawed) explanation which should be nipped in the bud: there are a huge number of olfactory receptors in the human proteome. Therefore, any random selection of proteins is going to strongly overlap with a devoted list of olfactory receptors. According to Wikipedia, there are about 800 ORs in the human proteome; roughly 3% of the proteome. But of the 499 genes in our adjusted cancer mutation list, 111 are ORs; a whopping 22%. Another test is simply to pull a random selection of proteins and run Fisher's exact test against all lists in our database; the exercise can be repeated, Monte-Carlo style. We did that, and there's actually a slightly negative correlation between these random selections (out of 19,000 proteins) and the olfactory receptor list.

**************
Mar 15, 2022: To get some clarity on the question, I examined several TCGA tissue-specific mutation lists, applying the above adjustment for molecular weight. As with the data mentioned above, without the adjustment, TTN (Titin, the largest human protein) appears to be the most mutant entity in cancer; with adjustment, genes like TP53 and KRAS are inevitably found near the top of the list. However, these new lists are not overloaded with olfactory receptors. I still don't know what is going on, but the inability to reproduce the result dims my enthusiasm for the OR-mutation/cancer connection. Despite the absence of ORs in these lists, they still tend to intersect nicely with the above list (with log(P-values) around -20 or better).

For what it's worth, I note that there's a very noticeable tendency to find an arginine mutation in a well-conserved "DRY" sequence around position 122 (or thereabouts...the N-terminal leader sequence varies from OR to OR). 

One interesting tendency in many of the lists above is for the mutant proteins to be found in chromosomal regions that are absent of other genes. Specifically, no other genes are found within 200,000 bases of these frequently-mutated genes. However, this pattern is not consistent across cancers; colon cancer and urothelial carcinoma intersect strongly with these "nomad" genes (log(P) about -20), while breast cancer and glioblastoma show insignificant intersection.

June 1, 2022: Apparently, some programs may simply discard olfactory receptors as having driver roles in cancer. From NCG 4.0: The network of cancer genes in the era of massive mutational screenings of cancer genomes:  

Despite all efforts to refine the identification of driver mutations, current approaches are still prone to false positives, i.e. mutated genes that are erroneously identified as cancer drivers. For example, genes encoding olfactory receptors are often included in the list of candidates, because they tend to mutate although the biological function and expression pattern of these genes strongly dismiss a possible functional role in the disease. Similarly, overly long genes are also probable false positives because their recurrent mutations in several samples are most likely due to their length more than to their function.

Regarding this "tendency to mutate", the assertion is based on a 2013 paper. My own admittedly cursory search for chromosomal regions with a tendency to mutate in healthy tissue, based on table S2 in Whole genome DNA sequencing provides an atlas of somatic mutagenesis in healthy human cells and identifies a tumor-prone cell type, does not reveal any excess tendency for olfactory receptors to mutate. The mutation analysis program from this work, MutSigCV, makes adjustments to candidate cancer drivers based on frequencies of synonymous mutations and non-coding mutations in a sample's chromosomal regions. Further...

Because in most cases these data are too sparse to obtain accurate estimates, we increased accuracy by pooling data from other genes with similar properties (for example, replication time, expression level).

Genes that tend to replicate late in a cycle are apparently more likely to mutate. Is this really true for the hundreds of olfactory receptors in the human genome? There's also a general tendency for genes with low expression to mutate, at least in this particular paper.

In general, isn't it a bit presumptuous to offhandedly dismiss olfactory receptor mutations as potential cancer drivers? There may yet be other reasons why olfactory receptors are mutated in some cancer datasets and not in others.


whatismygene.com 


A Couple Potentially Useful Tweaks

First, a quick note to WIMG users: I'll be disappearing into the Himalayas for 2 months or so. Forgive the absence of new posts, database updates, and responses to your e-mails.

*********

We've made a couple additions to the "Cell Type" filter that can be applied in most of our apps.

First, you'll see a "Dominant Tissue" choice. What does that mean? A number of big science studies (e.g. A deep proteome and transcriptome abundance atlas of 29 healthy human tissues) have attempted to delineate proteomes/transcriptomes across whole organisms. Such studies allow one to ask, "what genes are expressed uniquely in a particular tissue (versus other tissues)?" We've combed through these studies to find these genes. Thus, for example, the gene MAGEE2 is expressed near exclusively in nerve tissue.

Knowledge of a gene's tissue-uniqueness is potentially useful for at least two purposes, we think. First, if you see an abundance of, say, appendix-unique genes in the blood, perhaps there's some leakage from the appendix. We've indeed noticed an enrichment for appendix transcripts in septic blood in some studies. Unexpected levels of particular genes could also indicate sample contamination. Secondly, tissue-unique genes could be excellent drug targets in some cases. You can place current drug treatments at some point between two extremes. At one extreme, a very general sort of treatment would have an equal effect on all cells in the body. At the other end, you have modern personalized medicine approaches that only target very specific cells (e.g. neoantigen vaccines against cancer). In the middle, or perhaps toward the "specialized" end of the spectrum, you could have treatments that only target specific organs or cell types. If a gene is both lung-unique and necessary in lung cancer, one could target that gene without effects on other organs.

The most obvious use of this feature is with the "Fisher" or "Match Studies" apps. Let's say you have a list of blood transcripts. Plug them into the Fisher app and select "Dominant Tissue" in the "Cell Type" filter (left side of the screen, black background). Submit. You'll receive information about the various tissue types found within the blood sample. Of course, if the blood is absolutely "pure", you won't get any interesting output...perhaps you'll find that the blood is enriched with blood-only transcripts, which is not particularly exciting. In any case, you probably won't see extreme P-values in the list; some of the "tissue dominant" lists in our database are fairly short, simply because tissue-unique transcripts/protein are not common. The shortness of these lists limits the possibility of seeing crazy P-values.

Bear in mind that the output is only as good as the tissue-dominant lists we've constructed. As seen above, one of the studies we draw upon is a deep analysis of 29 human tissues. In this case, we base "uniqueness" on the fact that particular transcripts were seen in only one of the 29 tissues. There are, of course, more than 29 tissues in the human body, so it's possible that a transcript we've labeled as "tissue-unique" could be found in a tissue (say, tissue #30) that was not examined in the study. One can also question whether some transcripts/proteins would be so unique under perturbation (e.g. cancer, infection, drug treatment, etc), as the underlying studies focus primarily on healthy, equilibrium tissues.

The second addition to choices under the "Cell Type" filter involves blood. You could select "blood" or "blood plus." Mere "blood" will eliminate all studies not involving whole blood, or large fractions of blood. "Blood plus", however, includes studies involving all the sorts of cells that are expected to be found in blood; macrophages, lymphocytes, mast cells, monocytes, erythrocytes, blood stem cells, as well as whole blood. If you're examining the blood transcriptome, you may find this minor alteration to be of use.



whatismygene.com 


Thursday, March 10, 2022

What's Up and Down in Lung Cancer?

Pulling data from 11 lung cancer studies, we've assembled a list of transcripts that are commonly up- and down-regulated in lung cancer. The PMID IDs for these studies are 33801812, 32649874, 30389658, 30177858, 29127420, 27669169, 27354471, 27093186, 26483346, and 25429762 (that's 10...we extracted two sub-studies from 3380182). The database IDs for these up- and down-regulation lists are 141048203 and 141049203.

On the side of up-regulation, ube2t leads the pack, appearing in 8 out of 11 of the upregulation lists. Given the small size of our lists (typically about 200 genes), that's fairly impressive. A bit of googling reveals this gene is indeed implicated in lung cancer. At least one paper describes the development of a ube2t inhibitor, though this drug was injected in stomach tumors in mice. Genes up-regulated 7 times include anln, depdc1b, aspm, nek2, cenpf, stil, top2a, prom2, c16orf59, and melk. In general, the list is heavy with cell-cycle regulators. Wikipedia points out that given melk's abundance in cancers, attempts have been made to inhibit it; however, a crispr study casts doubt on melk's necessity in cancers. Perhaps melk is just along for the ride in a swarm of cell-cycle-related transcripts. Plugging melk into our co-expression app, that seems to be the case, with prominent cell cycle regulators like top2a, cdk1, aurkb, and more swarming alongside melk with extreme significance.* 

Going one step further, we can plug the list of melk-coexpressed genes into our Fisher app. There, relevance to the cell cycle is obvious. For example, genes downregulated in a myeloma line on cdk4/6 inhibition overlap the melk-coexpression list with a log(P-value) of -133. Genes upregulated in S phase vs G1 phase in fibroblasts overlap with a value of -130. Etc. Thus we see how melk can be prominently upregulated in lung cancer without being a necessity. 

To complicate matters, we plugged melk into the Regulation app. We have three studies in which melk was specifically targeted (one drug study, two knockdown studies). In the drug study, melk inhibition strongly downregulates (log(P) = -25) the genes with which melk swarms, while this effect was not seen in the two knockdown studies. We note that the drug study involved glioma stem cells.

Looking at down-regulation, the presence of ca4, carbonic anhydrase, in the list is quite impressive; 9 appearances in the 11 studies. There are papers on the role of ca4 in cancer, though there's not a high level of enthusiasm for developing ca4-agonists as cancer therapies. Entering the gene in our "Relevant Studies" app, we see two studies in which dexamethasone seems to do a nice job of upregulating ca4. Then again, we see a paper showing a positive link with lung cancer dexamethasone treatment and metastasis. Studies where primary cancer treatment apparently enhances the transcriptome you'd expect in metastasis are common in our database; perhaps we should do a deep dive on this subject. It's not as if an inverse correlation between primary cancer treatment and metastasis has not been noted; look here.

Genes appearing 8 times in the down-regulation list include fmo2, cav1, tcf21, fam107a, and rage.

Plugging the lung cancer up/down-regulation lists into the "Match Studies" app and selecting "inverse correlations" to search for means by which the lung cancer transcriptome could be reversed, the most prominent result is a mouse study in which the thymus transcriptome was altered via full body radiation (log(P) = -206).  A study involving bmp2 treatment of MSCs ranks second (-186). Restricting "cell type" to lungs, the best lung-cancer reverser involves a MAPK inhibitor (see here). Not surprisingly, the treatment targets the cell cycle. Studies involving lactoferrin, erlotinib, etoposide, and more figure prominently in the list of lung cancer reversers, at least at the cell-line level.

*In fact, in some cases the significance is so extreme that our app won't output a log(P-value). It seems that P values below about 10^-320 don't get output, resulting in blank cells in our "log10 binomials" column. We're not particularly motivated to find a workaround for this issue, as we'd say that 10^-320 is pretty damn significant.

whatismygene.com 


A Preprint

It has been a while since we posted. That's largely because of the effort put into generating a paper. Check it out on BioRxiv . This is...