Nature magazine recently published a study of...biological studies. A number of questions were asked, one of them being, “which genes are most represented in the literature?” Not surprisingly, TP53 is the champion, with 9,232 publications. It’s a good read.
A question not addressed is, “What are the most
under-represented genes in the literature?” Of course, it’s trivial to find genes
that have no mentions at all. What we can do, however, is use our own database and
ask, “Which genes have the largest disparity between inclusions in our database
and inclusions in the literature?” The exercise is simple on its face, but
there are a number of technicalities that make it a bit tricky. If we were writing
an academic paper, we’d have to do 100X the work we’re putting into this post.
Basically, though, our procedure works like this: Download a list of genes ranked according to
literature mentions. Convert these gene IDs into the format used in the database.
Generate a frequency table of all genes in our database. Compare the
frequencies in our database against the frequencies in the literature.
The list of genes according to literature mentions is found
here: ftp://ftp.ncbi.nih.gov/gene/GeneRIF/generifs_basic.gz
With the understanding that there are a number of ways in
which results can be skewed, here’s a list of the most under-rated players in
the genomic universe:
RTP4 |
VSIG2 |
CLIC6 |
FAM198B |
HIST1H2BD |
MT1L |
MOXD1 |
CENPK |
ANKRD37 |
CMBL |
PLBD1 |
TUBA1C |
ARHGAP11A |
TMEM154 |
HIST1H2BI |
NMES1 |
TMEM140 |
PKIA |
ADGRL2 |
KBTBD11 |
NT5DC2 |
C15orf15 |
RSL24D1 |
RPL27A |
FAM49A |
PGM5 |
RGL1 |
CLMN |
EVI2A |
TFEC |
RPL18A |
RPL21 |
SRM |
CALML4 |
OLFML2A |
RPS8 |
ENDOD1 |
KDELR3 |
RPS11 |
GNG4 |
TMEM56 |
SH3BGRL2 |
CIART |
ENPP5 |
GBP6 |
RSRP1 |
COX6A2 |
GPRIN3 |
GPRC5C |
TMEM71 |
NRIP3 |
MFAP3L |
CPNE2 |
ABLIM3 |
SMIM14 |
HIST1H2BM |
SLC46A3 |
EVI2B |
PCP4L1 |
TRNP1 |
GBP4 |
SLC16A14 |
RBP7 |
SLFN13 |
FAM84A |
RAPGEF5 |
TM6SF1 |
NSG2 |
VAT1L |
EPPK1 |
RPL27 |
DNAJA4 |
PGAM2 |
TTC39C |
TRANK1 |
GBP7 |
N4BP2L2 |
MEGF6 |
CDH19 |
FIBIN |
TINAGL1 |
CCDC3 |
LONRF2 |
DDX60L |
MXRA7 |
GPR137B |
CENPV |
GNG12 |
CCDC85A |
GRAMD3 |
FAM105A |
STRBP |
ZNF608 |
KIAA1551 |
LRRC2 |
UAP1L1 |
MEGF9 |
EPB41L4A |
PLEKHA4 |
METTL7B |
RTP4 is the champion, with few mentions in the literature but more than 700 appearances in our database. Googling RTP4, it seems that there’s no dearth of studies on this gene, but we’re sticking with the above NIH list of literature mentions. Next on the list is VSIG2. A Google search does seem to indicate that nobody cares about this sad gene. It’s hard to even get a clue as to its function.* Nevertheless, it appears 699 times in the database; perturb a cell and there’s a decent chance you’ll alter VSIG2 expression.
We ran a Fisher analysis of an extended, 500-ID list of
undervalued genes against our entire database. As might be expected, there’s no
massive enrichment for any particular group. There does seem to be a tendency
for genes with short transcripts and genes that are depleted in P-bodies to be
represented on the list (unadjusted log(P-values) of -7.5 and -5.8). Eyeballing
the list, a number of ribosomal proteins can be seen. Perhaps folks view the
ribosome as a big unified glob, and don’t care to tinker with its individual
components.
The opposite task, that of generating a list of “overrated”
genes, is even trickier, and we won’t bother with it here. In the end, genes
like TP53 would dominate the list and, given TP53’s role in cancer, labeling it
“overrated” or “overstudied” would hardly be fair.
*Let’s say you want to know about VSIG2’s function. You can
use our tools. First, you enter VSIG2 into our Coregulation app. You’ll get a
list of coregulated genes. Take that list and enter it into the Fisher app. To
spare you this [minimum] trouble, the swarm of genes with which VSIG2 is
coexpressed looks to be hugely involved in the cell cycle, altered by a large
array of common drugs (e.g. glucosamine), and also relevant to viral
infections. Using the coregulation tool alone, individual genes that are strongly coexpressed with VSIG2 include
TRIB3, CHAC1, ASNS, and many more. You can also note that CA9, FAM111B, NREP,
and more have a fairly strong tendency to be expressed in the opposite
direction to VSIG2 (i.e. when VSIG2 is up, CA9 tends to be down).
No comments:
Post a Comment