WhatIsMyGene: October 2021

If you want to see if your gene set overlaps significantly with some other set, what do you need? First, you’ll need your gene set |a|. Let’s say it’s a list of 100 proteins upregulated in the HepG2 cell line on IFN-A treatment. Next, you need some other set |b|. Let’s say it’s a list of 200 proteins upregulated when Huh7 cells are infected for 24 hours with influenza virus. From there, it’s easy to find the intersection between set |a| and set |b|. There might be 20 genes in common in both sets. Are you ready to perform Fisher’s exact test? No. You still need to get the joint background of those two studies. That’ll be the intersection between ALL proteins identified in HepG2 study |A| and ALL proteins identified in the Huh7 study |B|. Very roughly, the figure might be around 5,000 in a proteomic study. There are issues regarding what exactly constitutes an “identified” entity, particularly in microarray studies (briefly: under what criteria do you say that a probe had zero or questionable signal, and thus its cognate transcript was not identified?), but we’re not going to worry about that here.

Just to nip potential confusion in the bud...we can talk about the joint background of two studies, as above. We can also talk about the background of a single study, which is simply all the identified genes in that one study. You'll need two backgrounds from individual studies to derive the joint background.

As we’ve pointed out before, small errors in a background figure might not be of importance. Large errors can, however, screw up your results, particularly when you rely on a list of studies, ranked by P-values, that are supposedly relevant to your own study results. If a background error causes a study to fall from a high rank to a low rank, you may never pay attention to a potentially important result.

Often, it’s a simple matter to calculate or estimate the joint background between two studies. However, how does one derive the intersection between identified entities in your own study |A| and gene ontology lists (GO, Reactome, Kegg, Panther, WikiPathways, etc)? In this case, there is no big set |B| to play with, only little set |b|. Many of these lists are created by humans who screen multiple papers to derive gene sets that are relevant to particular processes, or cell types, or whatever. What’s the background behind such lists?

I’m not here to disrespect gene ontology lists. They represent our best efforts to categorize genes. My problem is with misuse of these lists. Without naming names, it does seem that numerous tools encourage misuse. In some cases, a P-value calculation is performed without any background at all. In other cases, the tool will ask for the complete list of identified entities in the user’s study |A|…however, there’s no underlying database of identified entities |B| behind the GO list. How could there be, when these lists are constructed by text-mining and human screening of multiple papers? To put it another way, there’s no larger list of entities that didn’t “make the cut” that would be included in GO list |B|. It seems that the assumption is made that GO lists are “perfect”…if |A| = 20,000, then |B| (all entities behind the GO list) = 20,000, and if |A| = 3,000, then |B| = 3,000. The size of the pool |B| from which list |b| is generated is not questioned.

I decided to apply backgrounds to GO lists. The main assumption is this: if such a list is biased, it’s biased according to abundance. In other words, at least some of these lists may be overweighted with abundant entities. They don’t contain the mix of abundant and not-so-abundant entities that you’d see in list |a| of a typical, empirical, big-data study.

Here’s a GO list that is heavily loaded with genes that are quite abundant in mammals: ESTABLISHMENT_OF_PROTEIN_LOCALIZATION_TO_ENDOPLASMIC_RETICULUM (call it the “ER list”). Again, there’s nothing inherently wrong with the list. If you look at its component genes, you’ll find a lot of ribosomal genes. If you choose a GO analysis tool, and simply load a list of the top 500 (or so) most abundant proteins in the human proteome, you may find that the above GO list is ranked #1, or thereabouts, with an insanely significant P-value (say, 10^-60). So, without going into the dirty algorithmic details, we simply asked, “what background value do we need to assign to the ER list so that the P-value becomes insignificant when the ER list is examined against a list of the most abundant proteins in humans?” In the case of the ER list, we calculate that it was derived from a universe of about 900 proteins. If you run Fisher’s exact test with a list |a| of abundant proteins, the ER list |b|, and a background of 900, the P-value should not be very significant.

Actually, you can quibble about the “becomes insignificant” bit above. Is that when P>.05? With Fisher’s exact test, there are actually two ways to get significance…by a strong overlap and by a weak or zero overlap between lists |a| and |b|. So which P are we talking about? Rather than spend time contemplating the precise P-value at which a “correct” background can be derived, we looked at proteomic studies that already have well-defined backgrounds. At what P-value do we tend to predict a proteomic study’s background when examined against a list of the most abundant proteins? It turns out that the background that is seen right between the two P>.05 values above works quite well. More intuitively, it’s the least significant P-value you can get when you hold all relevant values constant except for the background. If anything, this choice of P-value seems to be liberal with respect to GO lists. That is, you may still see some weak significance between the GO list and the abundance list when you use backgrounds derived this way.

If the above sounds crude, that’s because it is. We used an “integrated human proteome” list from pax-db.org for our list of abundant proteins. What about specific tissue types? What about mice and other critters? Shouldn’t we test our algorithm against transcriptomic studies, not just proteomic studies? Is abundance the only dimension on which GO lists can be biased? Why not write a paper, or build a new tool? I’m really not motivated to carry this exercise much further. For now, I’m confident that a background of 900 for the ER list is far superior to a background of 20,000 for just about any application of Fisher’s exact test.

We’ve entered about 120 “adjusted” GO lists into our database. These don’t include KEGG or Reactome lists…there may be licensing issues there. We call them “adjusted” because, unseen by the user, there’s a background figure for each list. If your own study has a background |A| of 20,000 and the GO list has a background |B| of 900, the 900 figure (not the 20,000) will be used by our “Fisher” app.

Note that we’ve added an “external list” term to the “Experiment” filter that can be used for most of our tools. So, just as an example, go to the “Relevant Studies” app. Enter any gene you like in the “identifiers” box. Go to “Experiments” and select “External List”. Submit. You’ll receive a list of GO lists in which your gene was found. Bear in mind that there are only 120 such lists in our database, so your gene may not be found. Or…go to the Fisher tool and enter a gene list. Go to “Experiments” and select “External List”. You’ll get a list of GO lists that most significantly overlap with your own list. Of course, these P-values may disappoint, as their backgrounds may be heavily adjusted. That’s the way it is.

Another objection is this: what if my gene list |a| really is overweighted with abundant entities? In other words, IFN-A treatment actually upregulates abundant entities. Well, then, the P-value will be unfairly insignificant. Realistically, I question whether any treatment could result in a wholesale upregulation of abundant entities versus non-abundant entities. It could be quite a burden on the cell. In any case, think of our spin on GO lists as an alternative or “second opinion” to the standard approach, not the absolute most correct approach. There are plenty of tools that don’t consider a GO list’s background…try our tool also!

What are the appropriate backgrounds for common GO lists? Take a look below. If the adjusted background was higher than 20,000, we set the background at 20,000.¹ There are, of course, thousands upon thousands of these lists that folks have generated. We certainly don’t intend to become yet another all-inclusive depot for them. But if there’s a particular list you’d like us to adjust and add to the database, let us know.

adjusted background	GO list
900	GOBP_ESTABLISHMENT_OF_PROTEIN_LOCALIZATION_TO_ENDOPLASMIC_RETICULUM (background-adjusted)
1100	GOCC_BLOOD_MICROPARTICLE (background-adjusted)
1800	GOBP_VIRAL_GENE_EXPRESSION (background-adjusted)
2600	GOBP_AEROBIC_RESPIRATION (background-adjusted)
2600	GOBP_ACUTE_INFLAMMATORY_RESPONSE (background-adjusted)
2800	GOBP_ANAPHASE_PROMOTING_COMPLEX_DEPENDENT_CATABOLIC_PROCESS (background-adjusted)
3400	50% GO poly-a RNA binding (background-adjusted)
3400	GOCC_VACUOLAR_LUMEN (background-adjusted)
3900	GOBP_TELOMERE_ORGANIZATION (background-adjusted)
4200	50% GO RNA-binding (background-adjusted)
4400	GO secretory granule (background-adjusted)
4500	GOBP_NIK_NF_KAPPAB_SIGNALING (background-adjusted)
4500	GOBP_REGULATION_OF_LIPASE_ACTIVITY (background-adjusted)
4700	GO_PROTEASOME_ACCESSORY_COMPLEX (background-adjusted)
4700	GOMF_INTEGRIN_BINDING (background-adjusted)
4800	GOBP_GLUTATHIONE_METABOLIC_PROCESS (background-adjusted)
4900	PID_INTEGRIN1_PATHWAY (background-adjusted)
4900	WP_ALLOGRAFT_REJECTION (background-adjusted)
4900	GOCC_MHC_PROTEIN_COMPLEX (background-adjusted)
5000	GOBP_LAMELLIPODIUM_ORGANIZATION (background-adjusted)
5200	GOMF_KINASE_INHIBITOR_ACTIVITY (background-adjusted)
5500	GOBP_CELLULAR_RESPIRATION (background-adjusted)
5700	GOBP_STEROL_BIOSYNTHETIC_PROCESS (background-adjusted)
5700	GOCC_BRUSH_BORDER (background-adjusted)
5900	GOMF_ISOMERASE_ACTIVITY (background-adjusted)
5900	GOBP_RESPONSE_TO_LEUKEMIA_INHIBITORY_FACTOR (background-adjusted)
6000	WP_SENESCENCE_AND_AUTOPHAGY_IN_CANCER (background-adjusted)
6000	GOBP_NEURON_PROJECTION_REGENERATION (background-adjusted)
6100	GOBP_PROTEIN_TETRAMERIZATION (background-adjusted)
6200	WP_MYOMETRIAL_RELAXATION_AND_CONTRACTION_PATHWAYS (background-adjusted)
6700	GO cofactor metabolic process (background-adjusted)
6900	GOBP_POSITIVE_REGULATION_OF_LIPID_METABOLIC_PROCESS (background-adjusted)
6900	GOBP_REGULATION_OF_VIRAL_LIFE_CYCLE (background-adjusted)
7000	GOBP_RESPONSE_TO_ESTRADIOL (background-adjusted)
7100	GOBP_NEGATIVE_REGULATION_OF_IMMUNE_EFFECTOR_PROCESS (background-adjusted)
7400	GOCC_SPECIFIC_GRANULE (background-adjusted)
7500	GOMF_OXIDOREDUCTASE_ACTIVITY_ACTING_ON_NAD_P_H (background-adjusted)
7500	WP_MECP2_AND_ASSOCIATED_RETT_SYNDROME (background-adjusted)
7600	GOCC_I_BAND (background-adjusted)
7700	GOBP_LIPID_OXIDATION (background-adjusted)
7700	GOBP_ESTABLISHMENT_OF_CELL_POLARITY (background-adjusted)
7700	GOBP_TRANSCRIPTION_COUPLED_NUCLEOTIDE_EXCISION_REPAIR (background-adjusted)
7800	GOBP_TRANSITION_METAL_ION_HOMEOSTASIS (background-adjusted)
7900	GO_NEGATIVE_REGULATION_OF_VIRAL_GENOME_REPLICATION (background-adjusted)
8000	GOBP_COLLAGEN_METABOLIC_PROCESS (background-adjusted)
8000	GOBP_REGULATION_OF_ALCOHOL_BIOSYNTHETIC_PROCESS (background-adjusted)
8100	Regulation Of Interferon-GammaProduction (GO: background-adjusted)
8100	GOBP_PLASMA_MEMBRANE_ORGANIZATION (background-adjusted)
8200	WP_SPINAL_CORD_INJURY (background-adjusted)
8200	WP_GENOTOXICITY_PATHWAY (background-adjusted)
8300	GOBP_NEGATIVE_REGULATION_OF_MAPK_CASCADE (background-adjusted)
8700	45% GO small molecule process (background-adjusted)
8700	GOBP_MUSCLE_ADAPTATION (background-adjusted)
8700	GOBP_ORGANOPHOSPHATE_CATABOLIC_PROCESS (background-adjusted)
8800	GOBP_MRNA_TRANSPORT (background-adjusted)
8900	GOBP_RESPONSE_TO_KETONE (background-adjusted)
9000	GOBP_RESPONSE_TO_INTERFERON_GAMMA (background-adjusted)
9000	GOBP_POSITIVE_REGULATION_OF_LIPID_TRANSPORT (background-adjusted)
9300	GOBP_RESPONSE_TO_XENOBIOTIC_STIMULUS (background-adjusted)
9600	GOBP_MUSCLE_CELL_DEVELOPMENT (background-adjusted)
9900	GOBP_NEURAL_CREST_CELL_DIFFERENTIATION (background-adjusted)
10000	GOMF_DNA_DEPENDENT_ATPASE_ACTIVITY (background-adjusted)
10000	GOBP_INOSITOL_PHOSPHATE_MEDIATED_SIGNALING (background-adjusted)
10500	GOBP_COLLAGEN_FIBRIL_ORGANIZATION (background-adjusted)
10800	50% GO Mitochondria (background-adjusted)
10900	GOBP_REGULATION_OF_CELL_JUNCTION_ASSEMBLY (background-adjusted)
11000	GOCC_MIDBODY (background-adjusted)
11000	GOBP_FEMALE_GAMETE_GENERATION (background-adjusted)
11100	GOBP_THIOESTER_METABOLIC_PROCESS (background-adjusted)
11100	GO_MITOCHONDRION (35%)(background-adjusted)
11500	GOBP_REGULATION_OF_SODIUM_ION_TRANSMEMBRANE_TRANSPORT (background-adjusted)
12000	WP_G1_TO_S_CELL_CYCLE_CONTROL (background-adjusted)
12000	WP_GASTRIN_SIGNALING_PATHWAY (background-adjusted)
12500	GOBP_T_CELL_MIGRATION (background-adjusted)
12500	GOBP_POSITIVE_REGULATION_OF_CELL_SUBSTRATE_ADHESION (background-adjusted)
12700	GOBP_RESPONSE_TO_ESTRADIOL (background-adjusted)
13000	GO_NEURON_PROJECTION (50%)(M17462; background-adjusted)
13500	WP_TGFBETA_SIGNALING_PATHWAY (background-adjusted)
13500	GOBP_REGULATION_OF_PHOSPHATIDYLINOSITOL_3_KINASE_SIGNALING (background-adjusted)
13700	GOBP_MEIOTIC_CELL_CYCLE (background-adjusted)
14000	GOBP_DNA_METHYLATION (background-adjusted)
14100	WP_P53_TRANSCRIPTIONAL_GENE_NETWORK (background-adjusted)
14500	GOBP_ENDOTHELIUM_DEVELOPMENT (background-adjusted)
15000	GOBP_POSITIVE_REGULATION_OF_AXONOGENESIS (background-adjusted)
16000	GOBP_CELLULAR_CARBOHYDRATE_BIOSYNTHETIC_PROCESS (background-adjusted)
16200	GOBP_DEMETHYLATION (background-adjusted)
17000	GOBP_BIOMINERALIZATION (background-adjusted)
17500	GOCC_EXTRINSIC_COMPONENT_OF_PLASMA_MEMBRANE (background-adjusted)
18000	WP_B_CELL_RECEPTOR_SIGNALING_PATHWAY (background-adjusted)
19400	GOBP_NOTCH_SIGNALING_PATHWAY (background-adjusted)
20000	GO_ANION_TRANSMEMBRANE_TRANSPORTER_ACTIVITY (background-adjusted)
20000	GO_CATION_CHANNEL_COMPLEX (background_adjusted)
20000	GO_FOREBRAIN_DEVELOPMENT (background-adjusted)
20000	GOBP_RESPONSE_TO_STARVATION (background-adjusted)
20000	GOBP_SISTER_CHROMATID_SEGREGATION (background-adjusted)
20000	GOBP_SYNAPSE_ASSEMBLY (background-adjusted)
20000	GOBP_PATHWAY_RESTRICTED_SMAD_PROTEIN_PHOSPHORYLATION (background-adjusted)
20000	WP_COPPER_HOMEOSTASIS (background-adjusted)
20000	GOBP_VESICLE_MEDIATED_TRANSPORT_IN_SYNAPSE (background-adjusted)
20000	GOBP_METANEPHROS_DEVELOPMENT (background-adjusted)
20000	GOBP_REGULATION_OF_SYNAPTIC_PLASTICITY (background-adjusted)
20000	WP_EPITHELIAL_TO_MESENCHYMAL_TRANSITION_IN_ COLORECTAL_CANCER (background-adjusted)
20000	GOBP_EMBRYONIC_SKELETAL_SYSTEM_MORPHOGENESIS (background-adjusted)
20000	GOCC_CULLIN_RING_UBIQUITIN_LIGASE_COMPLEX (background-adjusted)
20000	GOBP_REGULATION_OF_BMP_SIGNALING_PATHWAY (background-adjusted)
20000	GOBP_APPENDAGE_MORPHOGENESIS (background-adjusted)
20000	GOBP_SPHINGOLIPID_METABOLIC_PROCESS (background-adjusted)
20000	GOBP_DNA_DEPENDENT_DNA_REPLICATION (background-adjusted)
20000	GOCC_CENTRIOLE (background-adjusted)
20000	GOBP_BILE_ACID_METABOLIC_PROCESS (background-adjusted)
20000	GOBP_CARDIAC_CHAMBER_DEVELOPMENT (background-adjusted)
20000	GOMF_VOLTAGE_GATED_ION_CHANNEL_ACTIVITY (background-adjusted)
20000	GOBP_AMINE_TRANSPORT (background-adjusted)
20000	GOBP_ODONTOGENESIS (background-adjusted)
20000	GOBP_CIRCADIAN_REGULATION_OF_GENE_EXPRESSION (background-adjusted)

1) Again, crude logic. We simply find it difficult to believe that a gene list generated by scanning papers would have an effective background greater than 20,000. I guess it’s possible, perhaps in the case of processes involving cascades that are initiated by entities of low abundance.

whatismygene.com

Become a Patron!

WhatIsMyGene

Thursday, October 21, 2021

Adjusting GO Lists for Background

A Preprint

Report Abuse