Saturday, August 14, 2021

Abundant Transcripts

 As long as we’re in heatmap mode, we thought we’d throw some fairly “conventional” data into R’s “heatmap.2” function. This time, it’s simply the most abundant transcripts found in 61 different tissues. All the data, of course, is found in our database; if your data strongly overlaps with a particular tissue type and you run it through our apps, you’ll certainly know. You can find the underlying data (and a lot more) here: https://www.proteinatlas.org/about/download .

The presence of abundance data in our database is also useful in another respect. If your data is biased with abundant transcripts/protein, as opposed to a more typical mix of abundant and rare entities, and you apply our Fisher app to the data, the app’s output will be overloaded with abundance studies. This is a bit of a warning. It’s possible that you simply need to adjust the “background” for your data (e.g. a typical proteomic study contains around 4,000 proteins…our default background of 20,000 is not appropriate in this case, and you’ll need to tweak it). There may be other, more problematic reasons for the bias in your data. Alternatively, there may be a silly mistake…your data was sorted according to abundance. We’ve noted problems in our own data entries this way (and then fixed them!). We also note issues in external datasets. Our favorite is probably the GO “ESTABLISHMENT OF PROTEIN LOCALIZATION TO ENDOPLASMIC RETICULUM” group, which is a wonderful proxy for the most abundant proteins in human tissue. Of course, a final possibility is this: your data is legitimately tweaked toward or against abundance. One phenomenon we note is that cancer tissues often lose, to some extent, the underlying tissue identity; stomach cancer tissue, for example, will become less stomach-y, and you may note this via a cancer-related decrease in the most abundant stomach entities.

One final technical point…if you’re using our co-expression app, these abundance lists will not be examined unless you manually tweak the “regulation” feature to “ANY.” In other words, only lists involving up/down-regulated entities will be examined unless you overrule the “regulation” feature.

Here’s the heatmap, with “values” being –log(P-values), as generated by Fisher’s exact test, applied to all combinations of cell types:


It shouldn’t be surprising that some extreme P-values were generated. The basic components for metabolism and structure (etc.) are both plentiful and don’t vary much from cell to cell.

The image probably aligns with your own ideas about similarities between various cell types. I was surprised how cleanly, however, the cell types clustered into various groups (see the dendrogram on top). The left-most columns contain cell types that don’t seem to overlap with other cell types with great significance: testis, liver, parathyroid, and placenta, in particular, followed by cerebellum, granulocytes, skeletal muscle, thymus, heart, and intestines. Next, there’s a square of red/orange/yellow color. That’s all brain tissue: basal ganglia, pons/medulla, spinal cord, olfactory gland, hypothalamus, amygdala, midbrain, cerebral cortex, corpus callosum, thalamus, hippocampus. Perhaps it’s interesting that the cerebellum was not found in this group. Next is a grouping of cell types that don’t overlap any other types with extreme significance, with a few exceptions (total pbmcs/monocytes, duodenum/colon, monocytes/dendritic cells). The next strong red/orange patch consists of gall bladder, vagina, skeletal muscle, cervix, prostate, fallopian tube, endometrium, and bladder. The next red/orange patch contains t-cells, NK cells, b-cells, pbmcs, and the spleen. Next, the appendix, lymph nodes, and tonsils group together strongly.

Examining the underlying data, the weakest overlap belongs to the cerebellum/liver pairing, with a P-value that doesn't even reach 0.05.

Oddly, the midbrain and the amygdala match up with the rectum fairly significantly!



*Note to self and anybody who 1) doesn’t think my heatmap is utter garbage and 2) would like to do something similar. Here’s the code that generates the colors:

col = c("navy","blue","dodgerblue","lightskyblue","palegreen","yellow","orange", "red")

breaks <- c(0, 2, 12, 25, 50, 100, 150, 200, 325)

heatmap.2(blah, blah, col = col, breaks = breaks, blah, blah)

I like this approach because it’s easy. Just make sure you’ve got one more break than colors (above there are 9 breaks and 8 colors). Of course, if you must make a gradient, you can’t use this easy method. In any case, here’s a nice color “cheatsheet”: https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf . The bottom of the sheet contains names for something like 600 colors that you can plug in as above.


whatismygene.com 


Wednesday, August 11, 2021

The Perturbome

When focusing on cell types, we could make a list of the most abundant transcripts in particular cell types. We could also focus on the proteome. We could even ask, “what are the common transcripts/proteins that are rarely seen in a particular cell type?” We could search for cell type markers that are rarely or never seen in other cell types, even if these markers are not particularly abundant in the cell type of interest. Our database is chock-full of the above sorts of lists.

There’s another sort of list we are able to prepare, largely because the sheer size of our database affords the opportunity. The largest portion of the database falls into the category of “perturbation studies” wherein cells are perturbed via drug, knockout, heat, whatever. We can thus ask the question, “what transcripts/proteins are most commonly perturbed in a particular cell type?” We can also ask which entities are least frequently perturbed in a particular cell type. This is not a question of abundance or of “uniqueness” to a particular cell type. Rather, we’re focusing on the entities that fluctuate when you tweak a particular sort of cell.

Pulling data from about 10,000 studies, we’ve constructed these lists for 20 different cell types: brain, liver, skin, muscle, lymphocytes, stem, kidney, breast, colon, prostate, heart, lung, intestines, glands, pancreas, dendritic, ovaries, adipose, fibroblasts, epithelial. Why not other cell types in our database? That’s primarily because the above 20 designations are the most common in our database; we required at least 100 studies for each cell type. We could have also included “blood” as a category, but we chose to break it down further into two common subtypes: lymphocytes and dendritic cells. Some other choices were somewhat arbitrary (we have a lot of macrophage studies…why didn’t we include them?) Note also that some of these cell types can overlap….skin and liver are different tissues, but skin can contain stem cells, and breast cells can be epithelial. For this initial stab at the “perturbome”, this isn’t a problem.

With the 20 cell types, we generated 40 lists, as each cell contains entities that are frequently perturbed, as well as entities that are rarely/never perturbed; two lists per cell type. Entities that are “rarely perturbed” most likely are simply never expressed in the particular cell type, though it is possible that they are indeed expressed, but it’s difficult to tweak them; we don’t discriminate between these two cases.

If you’re interested, the dirty details are as follows: We first generated a list of all genes found in the above studies. We then simply counted their occurrences in the above 20 cell types. We then used the binomial distribution to calculate how significantly a particular gene may be over/under-represented in a particular cell type. The “probability” input for the binomial distribution (which is .5 if you’re talking about coin flips) is calculated by dividing the total genes perturbed in a tissue (e.g. brain) by the total genes perturbed over all 20 tissues. Liver, for example, constitutes 7% (.07) of all genes in our database’s perturbome. Thus, if you know that gene ABC is found 100 times in our liver studies, and 500 times over all studies, you’re equipped to perform a probability calculation. In the final step, we simply rank genes according to these probabilities, making sure to discriminate between significance generated from an excess, versus depletion, of a particular gene.

So…what did we find? First, what is the most commonly perturbed gene over all tissue types? The answer is EGR1, perturbed 1581 times, followed by SERPINA3, IFIT1, GDF15, and FOS. What is the most common gene which was never perturbed in a particular tissue? That distinction belongs to LUM (lumican), which was never perturbed in dendritic cells, despite being altered a total of 758 times over the other 19 tissues. Perhaps dendritic cells are adamant that they not be confused with other cell types that express lumican, which is largely an extracellular protein.

Of note, GAPDH, commonly used as a housekeeping control gene, was only perturbed 358 times. Actin-B was seen 610 times. Our gene lists primarily reflect perturbations, not abundance.

Below is one of the more obscure gene tables you’ll ever stumble across. It details the top genes that were never perturbed in particular cell types. The table is ordered by the count over all other tissues; thus, the cell types at the bottom of the table express a large array of transcripts/proteins.

CELL TYPE

GENE

COUNT OVER OTHER TISSUES

dendritic

lum

758

adipose

hopx

603

intestines

nav2

591

ovary

lam1

481

heart

jdp1

469

prostate

pck1

456

gland

hpp1

435

pancreas

cd38

422

muscle

slc6a14

405

colon

kcnk2

339

fibroblast

c1orf116

329

kidney

scca1

312

skin

sizn

230

brain

ugt2b15

205

lung

gpr37l1

183

breast

miat

175

stem

cyp2c9

164

liver

blcap

149

epithelial

ces1g

121

 

Another question: what are the genes that were uniquely perturbed in particular tissues? The champion is probably GM1818, a mouse gene that was perturbed 21 times in the brain, but never elsewhere. For human genes, we have FAM90A7P, which was perturbed 16 times in the brain, and never elsewhere. The brain, in fact, seems to have the largest number of uniquely perturbed genes by a large margin; the first case of a non-brain gene that was uniquely perturbed was the mouse gene AI132709 (liver), which was tweaked in 8 studies in our database…201 brain-unique genes are tweaked at least as frequently. The lncRNA Lnc-CHSY1-3 was uniquely expressed in lymphocytes, albeit with a mere 4 occurrences.

The special status of the brain is also seen in the heatmap below. We took our 40 perturbation lists and performed Fisher’s exact test on all combinations of lists, for a total of 760 P-values. 


If the image is too small, you could click on it to get a bigger view. The first row is labeled “LO_COL”, which means “genes that were least frequently perturbed in the colon.” Hopefully the other 39 labels are self-explanatory. The color key shows the –log(P-values). Combinations with very significant P-values tend to make sense…highly perturbed genes in ”breast” and “gland” overlap with extreme significance, as do non-perturbed genes in the lymphocyte/dendritic categories, and perturbed genes in the colon/intestine. There are, however, some very interesting overlaps that might not be so intuitively obvious. For example:

1) genes that are rarely perturbed in the brain are rarely perturbed in stem cells.

2) genes highly perturbed in glands are rarely perturbed in the brain.

3) genes highly tweaked in the breast are rarely tweaked in stem cells.

4) genes that are rarely perturbed in the brain are also rarely perturbed in muscle and lymphocytes.

5) looking at the “high_BR” (highly perturbed in the brain) group, the best matching “highly perturbed” cell type would be “stem”, with a –log(P-value) of about 4. This is a bit of a cheat, since stem cells and brain cells are not exclusive (i.e. some brain cells are stem cells). In truth, then, highly perturbed genes in brain cells do not overlap with the highly perturbed genes of “pure” cell types with any significance.

6) unlike the brain, the rarely perturbed genes in some tissues don’t overlap rarely perturbed genes in other tissues with great significance. For example, the rarely perturbed genes in the pancreas don’t overlap with rarely perturbed genes in other tissues with any amazing significance; the best match, in fact, would be to intestines, with a -log(P) of 7.

You can tinker with the data yourself at whatismygene.com. The table below gives you the dbase IDs that allow you to perform operations with our various apps.

DBASE ID

CELLS

132346123

most frequently perturbed in the brain

132346124

least frequently perturbed in brain

132346125

most frequently perturbed in the liver

132346126

least frequently perturbed in the liver

132346127

most frequently perturbed in skin

132346128

least frequently perturbed in skin

132346129

most frequently perturbed in muscle

132346130

least frequently perturbed in muscle

132346131

most frequently perturbed in lymphocytes

132346132

least frequently perturbed in lymphocytes

132346133

most frequently perturbed in stem cells

132346134

least frequently perturbed in stem cells

132346135

most frequently perturbed in the kidney

132346136

least frequently perturbed in the kidney

132346137

most frequently perturbed in the breast

132346138

least frequently perturbed in the breast

132346139

most frequently perturbed in the colon

132346140

least frequently perturbed in the colon

132346141

most frequently perturbed in the prostate

132346142

least frequently perturbed in the prostate

132346143

most frequently perturbed in the heart

132346144

least frequently perturbed in the heart

132346145

most frequently perturbed in the lung

132346146

least frequently perturbed in the lung

132346147

most frequently perturbed in the intestines

132346148

least frequently perturbed in the intestines

132346149

most frequently perturbed in glands

132346150

least frequently perturbed in glands

132346151

most frequently perturbed in the pancreas

132346152

least frequently perturbed in the pancreas

132346153

most frequently perturbed in dendritic cells

132346154

least frequently perturbed in dendritic cells

132346155

most frequently perturbed in ovaries

132346156

least frequently perturbed in ovaries

132346157

most frequently perturbed in adipose tissue

132346158

least frequently perturbed in adipose tissue

132346159

most frequently perturbed in fibroblasts

132346160

least frequently perturbed in fibroblasts

132346161

most frequently perturbed in epithelial cells

132346162

least frequently perturbed in epithelial cells

 

We’re not finished with our dissection of the perturbome. We’ll resume the discussion in a couple weeks.


whatismygene.com 


A Preprint

It has been a while since we posted. That's largely because of the effort put into generating a paper. Check it out on BioRxiv . This is...