We’ve added a new tool to WhatIsMyGene.com called “Cell Types.” The idea is fairly simple. You enter a gene name, submit, and the output will be a list of keywords associated with your gene. The keywords primarily relate to cell type. There’s a binomial probability calculation being performed in the process, comparing the frequency of those keywords over the complete database versus the frequency of those keywords in data in which your gene appears. A high “binomial” output would represent a high positive correlation, and a strongly negative number would indicate a negative correlation. If you choose filters, both the gene-specific data and the larger database will be filtered.
It was difficult to get this tool up and running. I won’t
bore you with the details of programming. But let me know if it crashes on you.
It’s possible to get zero or minimal output by improper
filter selection. For example, our “cell type” data is largely composed of
genes that are not “upregulated” or “downregulated” on some
perturbation…that’s not the nature of the typical clustering result based on
single-cell sequencing. So if you select “Cell Type” (in the “experiment” box, NOT the new tool we're talking about),
and “upregulated” (as opposed to “Any”), you may not receive any output. It’s
also possible, of course, to enter rare genes and get zero output, or very
un-insightful output.
In keeping with our previous discussion of the “perturbome”,
please note that the output you’ll receive is probably not relevant to
abundance. Most (not all) of the lists in our database are not abundance lists.
Rather, they are tagged as “upregulated” or “downregulated” under particular
conditions. There’s little or no correlation between a list of genes that are
abundant in liver and a list of genes that are commonly perturbed in liver.
Plugging in some well-known cell-type markers, the tool
works quite nicely! Below, we take a common marker for lymphocytes, CD8, run it
through the tool, and use a few lines of R to generate a graph. See below for
the code we used. Bear in mind that the standard 0.05 cutoff for significance
would be found at binomial values of +-/- 1.3. We output 100 keywords, so an
adjusted cutoff would be +/- 3.3. In the graph below, we tossed the 25th-75th
tissues (the not-so-interesting ones).
Not surprisingly, CD8 is primarily perturbed in tissues with
keywords like “blood” or “lymphocytes” or “spleen.” Perhaps more interesting is
the fact that one rarely sees this perturbation in tissues labeled “adherent”
or “epithelial.” Stem cells just miss the adjusted cutoff for absence of CD8
perturbations.
We plugged in ACE2, a protein everyone knows to be expressed
in lungs. However, judging from output from the cell type tool, it’s not
commonly perturbed in lungs, which may offer one explanation why lung
tissue is a handy-dandy, dependable target for Covid-19.1 More
commonly, ACE2 is altered in the colon and intestine (log(P)< -10 and
-7). It’s particularly difficult to tweak in the case of blood and the brain
(both with log(P)<-4). The rarity of tweaked ACE2 in the blood and
brain may be because it’s not there to begin with. However, we know that ACE2
is found in the lungs…it’s simply difficult to alter its expression. To probe
further, one could use filters to see if drugs or knockdowns (or whatever)
alter these probabilities.
Actually, a quick peek at ACE2 expression (genecards.org)
shows that the transcripts are indeed commonly found in blood and brain. Quite
interestingly, however, ACE2 protein is rare in blood and brain, while
ACE2 protein is common in the kidney, as well as heart and ovary (which also
ranked high as tissues in which ACE2 is commonly perturbed). The pattern is
broken with the colon, however, where the protein is rare. Nevertheless, we
wonder if there’s a relationship between perturbability and protein levels that
differs from the perturbability/transcript-levels relationship.
The above ACE2 results have implications for anyone who
wishes to decrease lung ACE2 expression via some treatment. Another practical
implication would be in the choice of cell lines for experiments. If you want
to perform a knockdown of some transcript, you’ll obviously want to choose a
cell type in which the transcript is expressed. However, it might also be prudent
to choose a cell line in which the transcript can be perturbed!
We had a lot of fun entering our favorite genes into the
tool. Guess the tissue in which APP (the Alzheimer’s amyloid gene) is most difficult
to perturb! Compare the perturbability of PD-1 and PD-L1 over tissues. Compare
the HLA-I and HLA-II perturbomes.
One of my favorite genes is DDX6. I’ve oft-noted how the
genes it regulates overlap with the genes another helicase, DHX9, regulates. It
seemed a bit redundant. But the Cell-Type tool makes it fairly obvious that
this regulation happens in very different cell types. DHX9 loves to do its job
in epithelial cells and DDX6 hates it!
One idiosyncrasy is the following: cell lines are either
male or female. Huh7, for example, is male. Whenever possible, we’ve labeled
cell lines with a “male” or “female” keyword. You may thus find that your gene
is strongly enriched with the “male” designation. You may wish to ignore this,
as it may reflect the fact that the cell lines that represent certain tissues
are overwhelmingly male or female, not a broad tendency for a gene to be
perturbed, for example, only in males.
A few other keywords bear explanation. “3d” refers to
organoids. “Cancer_tissue” refers to in-vivo cancer tissue, not cell culture (after
all, the majority of cell culture lines are generated from cancers). “Resistance”
relates to studies where resistance to a treatment (e.g. cisplatin) was
examined. Such studies can be in-vitro (performing cell culture until resistant
strains emerge) or in-vivo (e.g. from studies of patients who respond, vs don’t
respond, to particular therapies).
If you don’t want to examine cell line data at all, one trick
is to exclude the keyword “ line” (include the space) in all studies. We’re
currently retroactively labeling all cell line studies (there are a lot, of
course) with this keyword…the trick won’t work optimally until we’re finished
with this task. This trick applies to many of our tools, actually. Another way
to de-emphasize cell line data is to examine only mouse data, not human data.
This is because, with the exception of blood, muscle, and cancer, it’s
difficult to access human in-vivo tissue; researchers use mice for those.
One might imagine a sort of “inverse” cell-type tool. Here,
you’d select from a list of keywords and the output would be the genes that are
most enriched (or depleted) for the keyword. I’m guessing this task would be
computationally expensive…you’d need to “stack” all the genes in the database
into a frequency table, then stack all the keyword-relevant genes into another
frequency table, merge the tables, and then perform something like a hundred
thousand binomial calculations. All this stacking would have to be performed on
the fly (as opposed to using a one-time table that resides on the hard drive),
because the user might apply filters to the database. However, we may embark on
this little exercise in the future on our local machine, and report on the
outcome. For now, the big task is to increase/refine/improve the keywords in
our database.
***
Initially, we considered outputting results in graphical
format, as opposed to a table. In the end, we decided to stick with tables. You
can generate graphics based on the table output in any way you like, rather
than being stuck with a limited set of color schemes, labels, graphics formats,
etc. If you’re familiar with R, the code below might be useful.
library(ggplot2)
tissue_data <- read.csv("D:/your_table.csv")tissue_data$stacked_tissues <- factor(tissue_data$stacked_tissues, levels = tissue_data$stacked_tissues)tissue_data$fill<- ifelse (abs(tissue_data$binomials) > 3.3,"red", ifelse(abs(tissue_data$binomials)>1.3,"purple","gray"))g <- ggplot(tissue_data, aes(x = binomials,y = stacked_tissues, fill = fill))g + geom_col()+ylab(NULL)+scale_fill_identity()
#The “factor” line prevents the table from being sorted
alphabetically.
1) Gotta be careful with this kind of logic, of course. If
the virus has the capacity to alter the expression of a target (such as ACE2),
the perturbability of the target might work to the benefit of the virus, not to
the detriment.