Friday, March 11, 2022

An Odd Result

After adding gene lists to the database, it's important to test these new lists against all other lists in the database via Fisher's exact test. This helps us spot potential errors. For example, it's possible we've already entered the data into the database; the new study is re-using data from another study. Let's say the new study examines interferon effects on a cell line; if the up-regulation results mirror the down-regulation results from myriad other IFN studies, there's likely some error involving +/- signs or labeling of data. These errors could be generated on our side, but they definitely can also be generated by the folks who do the wet lab work and initial data generation.

Here's another kind of error. We grab a list of genes that are most commonly mutated in cancer (Mutational landscape and significance across 12 major cancer types, table s2). Running the list against all other lists in the database, we see that proteins with high molecular weights intersect with high significance. This signals an error of sorts. Long genes simply have more opportunity to be mutated than short genes. It makes sense, then, to adjust the mutation list by molecular weight. We do that.

We then run the adjusted list of commonly mutated cancer genes against our database again. The results, for the most part, make sense. For example, hypomethylated genes in lung cancer match up nicely with the mutation list (log(P)=-44); one can imagine that if a cancer wishes to target a gene, it could alter methylation patterns or mutate it (or, perhaps, the mutation itself alters the methylation pattern). Hypomethylated genes in other cancer types also match up with those in the mutation list. An independent list of genes mutated in ampullary carcinomas intersects nicely; no surprise. The same goes for a study of mutations in glioma. And so on.

The lung cancer hypomethylation list generates the second most significant intersection against the cancer mutation list. What's the most significant gene set? It's a GO list: GOMF_OLFACTORY_RECEPTOR_ACTIVITY, with a log P-value of -74. We'd list the intersecting genes, all 111 of them, here, but it's simply a tedious list of olfactory receptor genes (e.g. ORF5F1). Our GOMF_OLFACTORY_RECEPTOR_ACTIVITY list is "background-adjusted" (see here), so there's no concern that the list is strongly biased toward abundant genes. In fact, these receptors, not surprisingly, are fairly rare over most human tissues.

Despite the massive overweighting of olfactory receptors in the adjusted cancer mutation lists, TP53 still reigns supreme as the single most commonly mutated cancer gene. In fact, after adjustment, it's about 3X more commonly mutated than the next gene on the list, KRAS. The most commonly mutated olfactory receptor would be OR2T33, which ranks 21rst on the list. For what it's worth, it's a moderate sized protein (32kd).

There are indeed scholarly works on the subject of olfactory genes in cancer. Most, if not all, of these papers, however, focus on altered expression of olfactory receptors, not the tendency of these genes to be mutated.

So, let me ask blog readers: What the hell is going on here?

Here's one (flawed) explanation which should be nipped in the bud: there are a huge number of olfactory receptors in the human proteome. Therefore, any random selection of proteins is going to strongly overlap with a devoted list of olfactory receptors. According to Wikipedia, there are about 800 ORs in the human proteome; roughly 3% of the proteome. But of the 499 genes in our adjusted cancer mutation list, 111 are ORs; a whopping 22%. Another test is simply to pull a random selection of proteins and run Fisher's exact test against all lists in our database; the exercise can be repeated, Monte-Carlo style. We did that, and there's actually a slightly negative correlation between these random selections (out of 19,000 proteins) and the olfactory receptor list.

**************
Mar 15, 2022: To get some clarity on the question, I examined several TCGA tissue-specific mutation lists, applying the above adjustment for molecular weight. As with the data mentioned above, without the adjustment, TTN (Titin, the largest human protein) appears to be the most mutant entity in cancer; with adjustment, genes like TP53 and KRAS are inevitably found near the top of the list. However, these new lists are not overloaded with olfactory receptors. I still don't know what is going on, but the inability to reproduce the result dims my enthusiasm for the OR-mutation/cancer connection. Despite the absence of ORs in these lists, they still tend to intersect nicely with the above list (with log(P-values) around -20 or better).

For what it's worth, I note that there's a very noticeable tendency to find an arginine mutation in a well-conserved "DRY" sequence around position 122 (or thereabouts...the N-terminal leader sequence varies from OR to OR). 

One interesting tendency in many of the lists above is for the mutant proteins to be found in chromosomal regions that are absent of other genes. Specifically, no other genes are found within 200,000 bases of these frequently-mutated genes. However, this pattern is not consistent across cancers; colon cancer and urothelial carcinoma intersect strongly with these "nomad" genes (log(P) about -20), while breast cancer and glioblastoma show insignificant intersection.

June 1, 2022: Apparently, some programs may simply discard olfactory receptors as having driver roles in cancer. From NCG 4.0: The network of cancer genes in the era of massive mutational screenings of cancer genomes:  

Despite all efforts to refine the identification of driver mutations, current approaches are still prone to false positives, i.e. mutated genes that are erroneously identified as cancer drivers. For example, genes encoding olfactory receptors are often included in the list of candidates, because they tend to mutate although the biological function and expression pattern of these genes strongly dismiss a possible functional role in the disease. Similarly, overly long genes are also probable false positives because their recurrent mutations in several samples are most likely due to their length more than to their function.

Regarding this "tendency to mutate", the assertion is based on a 2013 paper. My own admittedly cursory search for chromosomal regions with a tendency to mutate in healthy tissue, based on table S2 in Whole genome DNA sequencing provides an atlas of somatic mutagenesis in healthy human cells and identifies a tumor-prone cell type, does not reveal any excess tendency for olfactory receptors to mutate. The mutation analysis program from this work, MutSigCV, makes adjustments to candidate cancer drivers based on frequencies of synonymous mutations and non-coding mutations in a sample's chromosomal regions. Further...

Because in most cases these data are too sparse to obtain accurate estimates, we increased accuracy by pooling data from other genes with similar properties (for example, replication time, expression level).

Genes that tend to replicate late in a cycle are apparently more likely to mutate. Is this really true for the hundreds of olfactory receptors in the human genome? There's also a general tendency for genes with low expression to mutate, at least in this particular paper.

In general, isn't it a bit presumptuous to offhandedly dismiss olfactory receptor mutations as potential cancer drivers? There may yet be other reasons why olfactory receptors are mutated in some cancer datasets and not in others.


whatismygene.com 


No comments:

Post a Comment

A Preprint

It has been a while since we posted. That's largely because of the effort put into generating a paper. Check it out on BioRxiv . This is...