If you want to see if your gene set overlaps significantly with some other set, what do you need? First, you’ll need your gene set |a|. Let’s say it’s a list of 100 proteins upregulated in the HepG2 cell line on IFN-A treatment. Next, you need some other set |b|. Let’s say it’s a list of 200 proteins upregulated when Huh7 cells are infected for 24 hours with influenza virus. From there, it’s easy to find the intersection between set |a| and set |b|. There might be 20 genes in common in both sets. Are you ready to perform Fisher’s exact test? No. You still need to get the joint background of those two studies. That’ll be the intersection between ALL proteins identified in HepG2 study |A| and ALL proteins identified in the Huh7 study |B|. Very roughly, the figure might be around 5,000 in a proteomic study. There are issues regarding what exactly constitutes an “identified” entity, particularly in microarray studies (briefly: under what criteria do you say that a probe had zero or questionable signal, and thus its cognate transcript was not identified?), but we’re not going to worry about that here.
Just to nip potential confusion in the bud...we can talk about the joint background of two studies, as above. We can also talk about the background of a single study, which is simply all the identified genes in that one study. You'll need two backgrounds from individual studies to derive the joint background.
As we’ve pointed out before, small errors in a background
figure might not be of importance. Large errors can, however, screw up your
results, particularly when you rely on a list of studies, ranked by P-values,
that are supposedly relevant to your own study results. If a background error
causes a study to fall from a high rank to a low rank, you may never pay
attention to a potentially important result.
Often, it’s a simple matter to calculate or estimate the joint background between two studies. However, how does one derive the intersection
between identified entities in your own study |A| and gene ontology lists (GO,
Reactome, Kegg, Panther, WikiPathways, etc)? In this case, there is no big set |B|
to play with, only little set |b|. Many of these lists are created by humans
who screen multiple papers to derive gene sets that are relevant to particular
processes, or cell types, or whatever. What’s the background behind such lists?
I’m not here to disrespect gene ontology lists. They
represent our best efforts to categorize genes. My problem is with misuse of
these lists. Without naming names, it does seem that numerous tools encourage
misuse. In some cases, a P-value calculation is performed without any
background at all. In other cases, the tool will ask for the complete list of
identified entities in the user’s study |A|…however, there’s no underlying database
of identified entities |B| behind the GO list. How could there be, when these
lists are constructed by text-mining and human screening of multiple papers? To
put it another way, there’s no larger list of entities that didn’t “make the
cut” that would be included in GO list |B|. It seems that the assumption is
made that GO lists are “perfect”…if |A| = 20,000, then |B| (all entities behind
the GO list) = 20,000, and if |A| = 3,000, then |B| = 3,000. The size of the
pool |B| from which list |b| is generated is not questioned.
I decided to apply backgrounds to GO lists. The main
assumption is this: if such a list is biased, it’s biased according to
abundance. In other words, at least some of these lists may be overweighted
with abundant entities. They don’t contain the mix of abundant and
not-so-abundant entities that you’d see in list |a| of a typical, empirical,
big-data study.
Here’s a GO list that is heavily loaded with genes that are
quite abundant in mammals: ESTABLISHMENT_OF_PROTEIN_LOCALIZATION_TO_ENDOPLASMIC_RETICULUM
(call it the “ER list”). Again, there’s nothing inherently wrong with the list.
If you look at its component genes, you’ll find a lot of ribosomal genes. If
you choose a GO analysis tool, and simply load a list of the top 500 (or so)
most abundant proteins in the human proteome, you may find that the above GO
list is ranked #1, or thereabouts, with an insanely significant P-value
(say, 10-60). So, without going into the dirty algorithmic details,
we simply asked, “what background value do we need to assign to the ER list so
that the P-value becomes insignificant when the ER list is examined
against a list of the most abundant proteins in humans?” In the case of the ER
list, we calculate that it was derived from a universe of about 900 proteins.
If you run Fisher’s exact test with a list |a| of abundant proteins, the ER
list |b|, and a background of 900, the P-value should not be very
significant.
Actually, you can quibble about the “becomes insignificant”
bit above. Is that when P>.05? With Fisher’s exact test, there are
actually two ways to get significance…by a strong overlap and by a weak
or zero overlap between lists |a| and |b|. So which P are we talking
about? Rather than spend time contemplating the precise P-value at which
a “correct” background can be derived, we looked at proteomic studies that
already have well-defined backgrounds. At what P-value do we tend to
predict a proteomic study’s background when examined against a list of the most
abundant proteins? It turns out that the background that is seen right between
the two P>.05 values above works quite well. More intuitively, it’s
the least significant P-value you can get when you hold all relevant
values constant except for the background. If anything, this choice of P-value
seems to be liberal with respect to GO lists. That is, you may still see some
weak significance between the GO list and the abundance list when you use
backgrounds derived this way.
If the above sounds crude, that’s because it is. We used an
“integrated human proteome” list from pax-db.org for our list of abundant
proteins. What about specific tissue types? What about mice and other critters?
Shouldn’t we test our algorithm against transcriptomic studies, not just
proteomic studies? Is abundance the only dimension on which GO lists can be
biased? Why not write a paper, or build a new tool? I’m really not motivated to
carry this exercise much further. For now, I’m confident that a background of
900 for the ER list is far superior to a background of 20,000 for just about
any application of Fisher’s exact test.
We’ve entered about 120 “adjusted” GO lists into our
database. These don’t include KEGG or Reactome lists…there may be licensing
issues there. We call them “adjusted” because, unseen by the user, there’s a
background figure for each list. If your own study has a background |A| of
20,000 and the GO list has a background |B| of 900, the 900 figure (not the
20,000) will be used by our “Fisher” app.
Note that we’ve added an “external list” term to the
“Experiment” filter that can be used for most of our tools. So, just as an
example, go to the “Relevant Studies” app. Enter any gene you like in the
“identifiers” box. Go to “Experiments” and select “External List”. Submit.
You’ll receive a list of GO lists in which your gene was found. Bear in mind
that there are only 120 such lists in our database, so your gene may not be
found. Or…go to the Fisher tool and enter a gene list. Go to “Experiments” and
select “External List”. You’ll get a list of GO lists that most significantly
overlap with your own list. Of course, these P-values may disappoint, as
their backgrounds may be heavily adjusted. That’s the way it is.
Another objection is this: what if my gene list |a| really is overweighted with abundant entities? In other words, IFN-A treatment actually upregulates abundant entities. Well, then, the P-value will be unfairly insignificant. Realistically, I question whether any treatment could result in a wholesale upregulation of abundant entities versus non-abundant entities. It could be quite a burden on the cell. In any case, think of our spin on GO lists as an alternative or “second opinion” to the standard approach, not the absolute most correct approach. There are plenty of tools that don’t consider a GO list’s background…try our tool also!
What are the appropriate backgrounds for common GO lists? Take a look below. If the adjusted background was higher than 20,000, we set the background at 20,000.1 There are, of course, thousands upon thousands of these lists that folks have generated. We certainly don’t intend to become yet another all-inclusive depot for them. But if there’s a particular list you’d like us to adjust and add to the database, let us know.
adjusted background |
GO list |
900 |
GOBP_ESTABLISHMENT_OF_PROTEIN_LOCALIZATION_TO_ENDOPLASMIC_RETICULUM (background-adjusted) |
1100 |
GOCC_BLOOD_MICROPARTICLE
(background-adjusted) |
1800 |
GOBP_VIRAL_GENE_EXPRESSION
(background-adjusted) |
2600 |
GOBP_AEROBIC_RESPIRATION
(background-adjusted) |
2600 |
GOBP_ACUTE_INFLAMMATORY_RESPONSE
(background-adjusted) |
2800 |
GOBP_ANAPHASE_PROMOTING_COMPLEX_DEPENDENT_CATABOLIC_PROCESS (background-adjusted) |
3400 |
50% GO poly-a RNA binding
(background-adjusted) |
3400 |
GOCC_VACUOLAR_LUMEN
(background-adjusted) |
3900 |
GOBP_TELOMERE_ORGANIZATION
(background-adjusted) |
4200 |
50% GO RNA-binding
(background-adjusted) |
4400 |
GO secretory granule
(background-adjusted) |
4500 |
GOBP_NIK_NF_KAPPAB_SIGNALING
(background-adjusted) |
4500 |
GOBP_REGULATION_OF_LIPASE_ACTIVITY
(background-adjusted) |
4700 |
GO_PROTEASOME_ACCESSORY_COMPLEX
(background-adjusted) |
4700 |
GOMF_INTEGRIN_BINDING (background-adjusted) |
4800 |
GOBP_GLUTATHIONE_METABOLIC_PROCESS (background-adjusted) |
4900 |
PID_INTEGRIN1_PATHWAY
(background-adjusted) |
4900 |
WP_ALLOGRAFT_REJECTION
(background-adjusted) |
4900 |
GOCC_MHC_PROTEIN_COMPLEX
(background-adjusted) |
5000 |
GOBP_LAMELLIPODIUM_ORGANIZATION
(background-adjusted) |
5200 |
GOMF_KINASE_INHIBITOR_ACTIVITY
(background-adjusted) |
5500 |
GOBP_CELLULAR_RESPIRATION
(background-adjusted) |
5700 |
GOBP_STEROL_BIOSYNTHETIC_PROCESS
(background-adjusted) |
5700 |
GOCC_BRUSH_BORDER
(background-adjusted) |
5900 |
GOMF_ISOMERASE_ACTIVITY
(background-adjusted) |
5900 |
GOBP_RESPONSE_TO_LEUKEMIA_INHIBITORY_FACTOR (background-adjusted) |
6000 |
WP_SENESCENCE_AND_AUTOPHAGY_IN_CANCER (background-adjusted) |
6000 |
GOBP_NEURON_PROJECTION_REGENERATION (background-adjusted) |
6100 |
GOBP_PROTEIN_TETRAMERIZATION
(background-adjusted) |
6200 |
WP_MYOMETRIAL_RELAXATION_AND_CONTRACTION_PATHWAYS (background-adjusted) |
6700 |
GO cofactor metabolic process
(background-adjusted) |
6900 |
GOBP_POSITIVE_REGULATION_OF_LIPID_METABOLIC_PROCESS (background-adjusted) |
6900 |
GOBP_REGULATION_OF_VIRAL_LIFE_CYCLE
(background-adjusted) |
7000 |
GOBP_RESPONSE_TO_ESTRADIOL
(background-adjusted) |
7100 |
GOBP_NEGATIVE_REGULATION_OF_IMMUNE_EFFECTOR_PROCESS (background-adjusted) |
7400 |
GOCC_SPECIFIC_GRANULE (background-adjusted) |
7500 |
GOMF_OXIDOREDUCTASE_ACTIVITY_ACTING_ON_NAD_P_H (background-adjusted) |
7500 |
WP_MECP2_AND_ASSOCIATED_RETT_SYNDROME (background-adjusted) |
7600 |
GOCC_I_BAND (background-adjusted) |
7700 |
GOBP_LIPID_OXIDATION
(background-adjusted) |
7700 |
GOBP_ESTABLISHMENT_OF_CELL_POLARITY
(background-adjusted) |
7700 |
GOBP_TRANSCRIPTION_COUPLED_NUCLEOTIDE_EXCISION_REPAIR (background-adjusted) |
7800 |
GOBP_TRANSITION_METAL_ION_HOMEOSTASIS (background-adjusted) |
7900 |
GO_NEGATIVE_REGULATION_OF_VIRAL_GENOME_REPLICATION (background-adjusted) |
8000 |
GOBP_COLLAGEN_METABOLIC_PROCESS
(background-adjusted) |
8000 |
GOBP_REGULATION_OF_ALCOHOL_BIOSYNTHETIC_PROCESS (background-adjusted) |
8100 |
Regulation Of Interferon-GammaProduction (GO: background-adjusted) |
8100 |
GOBP_PLASMA_MEMBRANE_ORGANIZATION
(background-adjusted) |
8200 |
WP_SPINAL_CORD_INJURY
(background-adjusted) |
8200 |
WP_GENOTOXICITY_PATHWAY
(background-adjusted) |
8300 |
GOBP_NEGATIVE_REGULATION_OF_MAPK_CASCADE (background-adjusted) |
8700 |
45% GO small molecule process
(background-adjusted) |
8700 |
GOBP_MUSCLE_ADAPTATION
(background-adjusted) |
8700 |
GOBP_ORGANOPHOSPHATE_CATABOLIC_PROCESS (background-adjusted) |
8800 |
GOBP_MRNA_TRANSPORT
(background-adjusted) |
8900 |
GOBP_RESPONSE_TO_KETONE
(background-adjusted) |
9000 |
GOBP_RESPONSE_TO_INTERFERON_GAMMA
(background-adjusted) |
9000 |
GOBP_POSITIVE_REGULATION_OF_LIPID_TRANSPORT (background-adjusted) |
9300 |
GOBP_RESPONSE_TO_XENOBIOTIC_STIMULUS
(background-adjusted) |
9600 |
GOBP_MUSCLE_CELL_DEVELOPMENT
(background-adjusted) |
9900 |
GOBP_NEURAL_CREST_CELL_DIFFERENTIATION
(background-adjusted) |
10000 |
GOMF_DNA_DEPENDENT_ATPASE_ACTIVITY
(background-adjusted) |
10000 |
GOBP_INOSITOL_PHOSPHATE_MEDIATED_SIGNALING (background-adjusted) |
10500 |
GOBP_COLLAGEN_FIBRIL_ORGANIZATION
(background-adjusted) |
10800 |
50% GO Mitochondria
(background-adjusted) |
10900 |
GOBP_REGULATION_OF_CELL_JUNCTION_ASSEMBLY (background-adjusted) |
11000 |
GOCC_MIDBODY (background-adjusted) |
11000 |
GOBP_FEMALE_GAMETE_GENERATION
(background-adjusted) |
11100 |
GOBP_THIOESTER_METABOLIC_PROCESS
(background-adjusted) |
11100 |
GO_MITOCHONDRION
(35%)(background-adjusted) |
11500 |
GOBP_REGULATION_OF_SODIUM_ION_TRANSMEMBRANE_TRANSPORT (background-adjusted) |
12000 |
WP_G1_TO_S_CELL_CYCLE_CONTROL
(background-adjusted) |
12000 |
WP_GASTRIN_SIGNALING_PATHWAY
(background-adjusted) |
12500 |
GOBP_T_CELL_MIGRATION
(background-adjusted) |
12500 |
GOBP_POSITIVE_REGULATION_OF_CELL_SUBSTRATE_ADHESION (background-adjusted) |
12700 |
GOBP_RESPONSE_TO_ESTRADIOL
(background-adjusted) |
13000 |
GO_NEURON_PROJECTION (50%)(M17462;
background-adjusted) |
13500 |
WP_TGFBETA_SIGNALING_PATHWAY
(background-adjusted) |
13500 |
GOBP_REGULATION_OF_PHOSPHATIDYLINOSITOL_3_KINASE_SIGNALING (background-adjusted) |
13700 |
GOBP_MEIOTIC_CELL_CYCLE
(background-adjusted) |
14000 |
GOBP_DNA_METHYLATION
(background-adjusted) |
14100 |
WP_P53_TRANSCRIPTIONAL_GENE_NETWORK (background-adjusted) |
14500 |
GOBP_ENDOTHELIUM_DEVELOPMENT
(background-adjusted) |
15000 |
GOBP_POSITIVE_REGULATION_OF_AXONOGENESIS (background-adjusted) |
16000 |
GOBP_CELLULAR_CARBOHYDRATE_BIOSYNTHETIC_PROCESS (background-adjusted) |
16200 |
GOBP_DEMETHYLATION
(background-adjusted) |
17000 |
GOBP_BIOMINERALIZATION
(background-adjusted) |
17500 |
GOCC_EXTRINSIC_COMPONENT_OF_PLASMA_MEMBRANE (background-adjusted) |
18000 |
WP_B_CELL_RECEPTOR_SIGNALING_PATHWAY (background-adjusted) |
19400 |
GOBP_NOTCH_SIGNALING_PATHWAY
(background-adjusted) |
20000 |
GO_ANION_TRANSMEMBRANE_TRANSPORTER_ACTIVITY (background-adjusted) |
20000 |
GO_CATION_CHANNEL_COMPLEX
(background_adjusted) |
20000 |
GO_FOREBRAIN_DEVELOPMENT
(background-adjusted) |
20000 |
GOBP_RESPONSE_TO_STARVATION
(background-adjusted) |
20000 |
GOBP_SISTER_CHROMATID_SEGREGATION
(background-adjusted) |
20000 |
GOBP_SYNAPSE_ASSEMBLY
(background-adjusted) |
20000 |
GOBP_PATHWAY_RESTRICTED_SMAD_PROTEIN_PHOSPHORYLATION (background-adjusted) |
20000 |
WP_COPPER_HOMEOSTASIS (background-adjusted) |
20000 |
GOBP_VESICLE_MEDIATED_TRANSPORT_IN_SYNAPSE (background-adjusted) |
20000 |
GOBP_METANEPHROS_DEVELOPMENT
(background-adjusted) |
20000 |
GOBP_REGULATION_OF_SYNAPTIC_PLASTICITY
(background-adjusted) |
20000 |
WP_EPITHELIAL_TO_MESENCHYMAL_TRANSITION_IN_ COLORECTAL_CANCER
(background-adjusted) |
20000 |
GOBP_EMBRYONIC_SKELETAL_SYSTEM_MORPHOGENESIS (background-adjusted) |
20000 |
GOCC_CULLIN_RING_UBIQUITIN_LIGASE_COMPLEX (background-adjusted) |
20000 |
GOBP_REGULATION_OF_BMP_SIGNALING_PATHWAY (background-adjusted) |
20000 |
GOBP_APPENDAGE_MORPHOGENESIS
(background-adjusted) |
20000 |
GOBP_SPHINGOLIPID_METABOLIC_PROCESS
(background-adjusted) |
20000 |
GOBP_DNA_DEPENDENT_DNA_REPLICATION
(background-adjusted) |
20000 |
GOCC_CENTRIOLE
(background-adjusted) |
20000 |
GOBP_BILE_ACID_METABOLIC_PROCESS
(background-adjusted) |
20000 |
GOBP_CARDIAC_CHAMBER_DEVELOPMENT
(background-adjusted) |
20000 |
GOMF_VOLTAGE_GATED_ION_CHANNEL_ACTIVITY (background-adjusted) |
20000 |
GOBP_AMINE_TRANSPORT
(background-adjusted) |
20000 |
GOBP_ODONTOGENESIS (background-adjusted) |
20000 |
GOBP_CIRCADIAN_REGULATION_OF_GENE_EXPRESSION (background-adjusted) |
1) Again, crude logic. We simply find it difficult to
believe that a gene list generated by scanning papers would have an effective
background greater than 20,000. I guess it’s possible, perhaps in the case of
processes involving cascades that are initiated by entities of low abundance.