Saturday, August 14, 2021

Abundant Transcripts

 As long as we’re in heatmap mode, we thought we’d throw some fairly “conventional” data into R’s “heatmap.2” function. This time, it’s simply the most abundant transcripts found in 61 different tissues. All the data, of course, is found in our database; if your data strongly overlaps with a particular tissue type and you run it through our apps, you’ll certainly know. You can find the underlying data (and a lot more) here: https://www.proteinatlas.org/about/download .

The presence of abundance data in our database is also useful in another respect. If your data is biased with abundant transcripts/protein, as opposed to a more typical mix of abundant and rare entities, and you apply our Fisher app to the data, the app’s output will be overloaded with abundance studies. This is a bit of a warning. It’s possible that you simply need to adjust the “background” for your data (e.g. a typical proteomic study contains around 4,000 proteins…our default background of 20,000 is not appropriate in this case, and you’ll need to tweak it). There may be other, more problematic reasons for the bias in your data. Alternatively, there may be a silly mistake…your data was sorted according to abundance. We’ve noted problems in our own data entries this way (and then fixed them!). We also note issues in external datasets. Our favorite is probably the GO “ESTABLISHMENT OF PROTEIN LOCALIZATION TO ENDOPLASMIC RETICULUM” group, which is a wonderful proxy for the most abundant proteins in human tissue. Of course, a final possibility is this: your data is legitimately tweaked toward or against abundance. One phenomenon we note is that cancer tissues often lose, to some extent, the underlying tissue identity; stomach cancer tissue, for example, will become less stomach-y, and you may note this via a cancer-related decrease in the most abundant stomach entities.

One final technical point…if you’re using our co-expression app, these abundance lists will not be examined unless you manually tweak the “regulation” feature to “ANY.” In other words, only lists involving up/down-regulated entities will be examined unless you overrule the “regulation” feature.

Here’s the heatmap, with “values” being –log(P-values), as generated by Fisher’s exact test, applied to all combinations of cell types:


It shouldn’t be surprising that some extreme P-values were generated. The basic components for metabolism and structure (etc.) are both plentiful and don’t vary much from cell to cell.

The image probably aligns with your own ideas about similarities between various cell types. I was surprised how cleanly, however, the cell types clustered into various groups (see the dendrogram on top). The left-most columns contain cell types that don’t seem to overlap with other cell types with great significance: testis, liver, parathyroid, and placenta, in particular, followed by cerebellum, granulocytes, skeletal muscle, thymus, heart, and intestines. Next, there’s a square of red/orange/yellow color. That’s all brain tissue: basal ganglia, pons/medulla, spinal cord, olfactory gland, hypothalamus, amygdala, midbrain, cerebral cortex, corpus callosum, thalamus, hippocampus. Perhaps it’s interesting that the cerebellum was not found in this group. Next is a grouping of cell types that don’t overlap any other types with extreme significance, with a few exceptions (total pbmcs/monocytes, duodenum/colon, monocytes/dendritic cells). The next strong red/orange patch consists of gall bladder, vagina, skeletal muscle, cervix, prostate, fallopian tube, endometrium, and bladder. The next red/orange patch contains t-cells, NK cells, b-cells, pbmcs, and the spleen. Next, the appendix, lymph nodes, and tonsils group together strongly.

Examining the underlying data, the weakest overlap belongs to the cerebellum/liver pairing, with a P-value that doesn't even reach 0.05.

Oddly, the midbrain and the amygdala match up with the rectum fairly significantly!



*Note to self and anybody who 1) doesn’t think my heatmap is utter garbage and 2) would like to do something similar. Here’s the code that generates the colors:

col = c("navy","blue","dodgerblue","lightskyblue","palegreen","yellow","orange", "red")

breaks <- c(0, 2, 12, 25, 50, 100, 150, 200, 325)

heatmap.2(blah, blah, col = col, breaks = breaks, blah, blah)

I like this approach because it’s easy. Just make sure you’ve got one more break than colors (above there are 9 breaks and 8 colors). Of course, if you must make a gradient, you can’t use this easy method. In any case, here’s a nice color “cheatsheet”: https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf . The bottom of the sheet contains names for something like 600 colors that you can plug in as above.


whatismygene.com 


No comments:

Post a Comment

A Preprint

It has been a while since we posted. That's largely because of the effort put into generating a paper. Check it out on BioRxiv . This is...