Thursday, October 21, 2021

Adjusting GO Lists for Background

If you want to see if your gene set overlaps significantly with some other set, what do you need? First, you’ll need your gene set |a|. Let’s say it’s a list of 100 proteins upregulated in the HepG2 cell line on IFN-A treatment. Next, you need some other set |b|. Let’s say it’s a list of 200 proteins upregulated when Huh7 cells are infected for 24 hours with influenza virus. From there, it’s easy to find the intersection between set |a| and set |b|. There might be 20 genes in common in both sets. Are you ready to perform Fisher’s exact test? No. You still need to get the joint background of those two studies. That’ll be the intersection between ALL proteins identified in HepG2 study |A| and ALL proteins identified in the Huh7 study |B|. Very roughly, the figure might be around 5,000 in a proteomic study. There are issues regarding what exactly constitutes an “identified” entity, particularly in microarray studies (briefly: under what criteria do you say that a probe had zero or questionable signal, and thus its cognate transcript was not identified?), but we’re not going to worry about that here.

Just to nip potential confusion in the bud...we can talk about the joint background of two studies, as above. We can also talk about the background of a single study, which is simply all the identified genes in that one study. You'll need two backgrounds from individual studies to derive the joint background.

As we’ve pointed out before, small errors in a background figure might not be of importance. Large errors can, however, screw up your results, particularly when you rely on a list of studies, ranked by P-values, that are supposedly relevant to your own study results. If a background error causes a study to fall from a high rank to a low rank, you may never pay attention to a potentially important result.

Often, it’s a simple matter to calculate or estimate the joint background between two studies. However, how does one derive the intersection between identified entities in your own study |A| and gene ontology lists (GO, Reactome, Kegg, Panther, WikiPathways, etc)? In this case, there is no big set |B| to play with, only little set |b|. Many of these lists are created by humans who screen multiple papers to derive gene sets that are relevant to particular processes, or cell types, or whatever. What’s the background behind such lists?

I’m not here to disrespect gene ontology lists. They represent our best efforts to categorize genes. My problem is with misuse of these lists. Without naming names, it does seem that numerous tools encourage misuse. In some cases, a P-value calculation is performed without any background at all. In other cases, the tool will ask for the complete list of identified entities in the user’s study |A|…however, there’s no underlying database of identified entities |B| behind the GO list. How could there be, when these lists are constructed by text-mining and human screening of multiple papers? To put it another way, there’s no larger list of entities that didn’t “make the cut” that would be included in GO list |B|. It seems that the assumption is made that GO lists are “perfect”…if |A| = 20,000, then |B| (all entities behind the GO list) = 20,000, and if |A| = 3,000, then |B| = 3,000. The size of the pool |B| from which list |b| is generated is not questioned.

I decided to apply backgrounds to GO lists. The main assumption is this: if such a list is biased, it’s biased according to abundance. In other words, at least some of these lists may be overweighted with abundant entities. They don’t contain the mix of abundant and not-so-abundant entities that you’d see in list |a| of a typical, empirical, big-data study.

Here’s a GO list that is heavily loaded with genes that are quite abundant in mammals: ESTABLISHMENT_OF_PROTEIN_LOCALIZATION_TO_ENDOPLASMIC_RETICULUM (call it the “ER list”). Again, there’s nothing inherently wrong with the list. If you look at its component genes, you’ll find a lot of ribosomal genes. If you choose a GO analysis tool, and simply load a list of the top 500 (or so) most abundant proteins in the human proteome, you may find that the above GO list is ranked #1, or thereabouts, with an insanely significant P-value (say, 10-60). So, without going into the dirty algorithmic details, we simply asked, “what background value do we need to assign to the ER list so that the P-value becomes insignificant when the ER list is examined against a list of the most abundant proteins in humans?” In the case of the ER list, we calculate that it was derived from a universe of about 900 proteins. If you run Fisher’s exact test with a list |a| of abundant proteins, the ER list |b|, and a background of 900, the P-value should not be very significant.

Actually, you can quibble about the “becomes insignificant” bit above. Is that when P>.05? With Fisher’s exact test, there are actually two ways to get significance…by a strong overlap and by a weak or zero overlap between lists |a| and |b|. So which P are we talking about? Rather than spend time contemplating the precise P-value at which a “correct” background can be derived, we looked at proteomic studies that already have well-defined backgrounds. At what P-value do we tend to predict a proteomic study’s background when examined against a list of the most abundant proteins? It turns out that the background that is seen right between the two P>.05 values above works quite well. More intuitively, it’s the least significant P-value you can get when you hold all relevant values constant except for the background. If anything, this choice of P-value seems to be liberal with respect to GO lists. That is, you may still see some weak significance between the GO list and the abundance list when you use backgrounds derived this way.

If the above sounds crude, that’s because it is. We used an “integrated human proteome” list from pax-db.org for our list of abundant proteins. What about specific tissue types? What about mice and other critters? Shouldn’t we test our algorithm against transcriptomic studies, not just proteomic studies? Is abundance the only dimension on which GO lists can be biased? Why not write a paper, or build a new tool? I’m really not motivated to carry this exercise much further. For now, I’m confident that a background of 900 for the ER list is far superior to a background of 20,000 for just about any application of Fisher’s exact test.

We’ve entered about 120 “adjusted” GO lists into our database. These don’t include KEGG or Reactome lists…there may be licensing issues there. We call them “adjusted” because, unseen by the user, there’s a background figure for each list. If your own study has a background |A| of 20,000 and the GO list has a background |B| of 900, the 900 figure (not the 20,000) will be used by our “Fisher” app.

Note that we’ve added an “external list” term to the “Experiment” filter that can be used for most of our tools. So, just as an example, go to the “Relevant Studies” app. Enter any gene you like in the “identifiers” box. Go to “Experiments” and select “External List”. Submit. You’ll receive a list of GO lists in which your gene was found. Bear in mind that there are only 120 such lists in our database, so your gene may not be found. Or…go to the Fisher tool and enter a gene list. Go to “Experiments” and select “External List”. You’ll get a list of GO lists that most significantly overlap with your own list. Of course, these P-values may disappoint, as their backgrounds may be heavily adjusted. That’s the way it is.

Another objection is this: what if my gene list |a| really is overweighted with abundant entities? In other words, IFN-A treatment actually upregulates abundant entities. Well, then, the P-value will be unfairly insignificant. Realistically, I question whether any treatment could result in a wholesale upregulation of abundant entities versus non-abundant entities. It could be quite a burden on the cell. In any case, think of our spin on GO lists as an alternative or “second opinion” to the standard approach, not the absolute most correct approach. There are plenty of tools that don’t consider a GO list’s background…try our tool also!

What are the appropriate backgrounds for common GO lists? Take a look below. If the adjusted background was higher than 20,000, we set the background at 20,000.1 There are, of course, thousands upon thousands of these lists that folks have generated. We certainly don’t intend to become yet another all-inclusive depot for them. But if there’s a particular list you’d like us to adjust and add to the database, let us know. 


adjusted background

GO list

900

GOBP_ESTABLISHMENT_OF_PROTEIN_LOCALIZATION_TO_ENDOPLASMIC_RETICULUM (background-adjusted)

1100

GOCC_BLOOD_MICROPARTICLE (background-adjusted)

1800

GOBP_VIRAL_GENE_EXPRESSION (background-adjusted)

2600

GOBP_AEROBIC_RESPIRATION (background-adjusted)

2600

GOBP_ACUTE_INFLAMMATORY_RESPONSE (background-adjusted)

2800

GOBP_ANAPHASE_PROMOTING_COMPLEX_DEPENDENT_CATABOLIC_PROCESS (background-adjusted)

3400

50% GO poly-a RNA binding (background-adjusted)

3400

GOCC_VACUOLAR_LUMEN (background-adjusted)

3900

GOBP_TELOMERE_ORGANIZATION (background-adjusted)

4200

50% GO RNA-binding (background-adjusted)

4400

GO secretory granule (background-adjusted)

4500

GOBP_NIK_NF_KAPPAB_SIGNALING (background-adjusted)

4500

GOBP_REGULATION_OF_LIPASE_ACTIVITY (background-adjusted)

4700

GO_PROTEASOME_ACCESSORY_COMPLEX (background-adjusted)

4700

GOMF_INTEGRIN_BINDING (background-adjusted)

4800

GOBP_GLUTATHIONE_METABOLIC_PROCESS (background-adjusted)

4900

PID_INTEGRIN1_PATHWAY (background-adjusted)

4900

WP_ALLOGRAFT_REJECTION (background-adjusted)

4900

GOCC_MHC_PROTEIN_COMPLEX (background-adjusted)

5000

GOBP_LAMELLIPODIUM_ORGANIZATION (background-adjusted)

5200

GOMF_KINASE_INHIBITOR_ACTIVITY (background-adjusted)

5500

GOBP_CELLULAR_RESPIRATION (background-adjusted)

5700

GOBP_STEROL_BIOSYNTHETIC_PROCESS (background-adjusted)

5700

GOCC_BRUSH_BORDER (background-adjusted)

5900

GOMF_ISOMERASE_ACTIVITY (background-adjusted)

5900

GOBP_RESPONSE_TO_LEUKEMIA_INHIBITORY_FACTOR (background-adjusted)

6000

WP_SENESCENCE_AND_AUTOPHAGY_IN_CANCER (background-adjusted)

6000

GOBP_NEURON_PROJECTION_REGENERATION (background-adjusted)

6100

GOBP_PROTEIN_TETRAMERIZATION (background-adjusted)

6200

WP_MYOMETRIAL_RELAXATION_AND_CONTRACTION_PATHWAYS (background-adjusted)

6700

GO cofactor metabolic process (background-adjusted)

6900

GOBP_POSITIVE_REGULATION_OF_LIPID_METABOLIC_PROCESS (background-adjusted)

6900

GOBP_REGULATION_OF_VIRAL_LIFE_CYCLE (background-adjusted)

7000

GOBP_RESPONSE_TO_ESTRADIOL (background-adjusted)

7100

GOBP_NEGATIVE_REGULATION_OF_IMMUNE_EFFECTOR_PROCESS (background-adjusted)

7400

GOCC_SPECIFIC_GRANULE (background-adjusted)

7500

GOMF_OXIDOREDUCTASE_ACTIVITY_ACTING_ON_NAD_P_H (background-adjusted)

7500

WP_MECP2_AND_ASSOCIATED_RETT_SYNDROME (background-adjusted)

7600

GOCC_I_BAND (background-adjusted)

7700

GOBP_LIPID_OXIDATION (background-adjusted)

7700

GOBP_ESTABLISHMENT_OF_CELL_POLARITY (background-adjusted)

7700

GOBP_TRANSCRIPTION_COUPLED_NUCLEOTIDE_EXCISION_REPAIR (background-adjusted)

7800

GOBP_TRANSITION_METAL_ION_HOMEOSTASIS (background-adjusted)

7900

GO_NEGATIVE_REGULATION_OF_VIRAL_GENOME_REPLICATION (background-adjusted)

8000

GOBP_COLLAGEN_METABOLIC_PROCESS (background-adjusted)

8000

GOBP_REGULATION_OF_ALCOHOL_BIOSYNTHETIC_PROCESS (background-adjusted)

8100

Regulation Of Interferon-GammaProduction (GO: background-adjusted)

8100

GOBP_PLASMA_MEMBRANE_ORGANIZATION (background-adjusted)

8200

WP_SPINAL_CORD_INJURY (background-adjusted)

8200

WP_GENOTOXICITY_PATHWAY (background-adjusted)

8300

GOBP_NEGATIVE_REGULATION_OF_MAPK_CASCADE (background-adjusted)

8700

45% GO small molecule process (background-adjusted)

8700

GOBP_MUSCLE_ADAPTATION (background-adjusted)

8700

GOBP_ORGANOPHOSPHATE_CATABOLIC_PROCESS (background-adjusted)

8800

GOBP_MRNA_TRANSPORT (background-adjusted)

8900

GOBP_RESPONSE_TO_KETONE (background-adjusted)

9000

GOBP_RESPONSE_TO_INTERFERON_GAMMA (background-adjusted)

9000

GOBP_POSITIVE_REGULATION_OF_LIPID_TRANSPORT (background-adjusted)

9300

GOBP_RESPONSE_TO_XENOBIOTIC_STIMULUS (background-adjusted)

9600

GOBP_MUSCLE_CELL_DEVELOPMENT (background-adjusted)

9900

GOBP_NEURAL_CREST_CELL_DIFFERENTIATION (background-adjusted)

10000

GOMF_DNA_DEPENDENT_ATPASE_ACTIVITY (background-adjusted)

10000

GOBP_INOSITOL_PHOSPHATE_MEDIATED_SIGNALING (background-adjusted)

10500

GOBP_COLLAGEN_FIBRIL_ORGANIZATION (background-adjusted)

10800

50% GO Mitochondria (background-adjusted)

10900

GOBP_REGULATION_OF_CELL_JUNCTION_ASSEMBLY (background-adjusted)

11000

GOCC_MIDBODY (background-adjusted)

11000

GOBP_FEMALE_GAMETE_GENERATION (background-adjusted)

11100

GOBP_THIOESTER_METABOLIC_PROCESS (background-adjusted)

11100

GO_MITOCHONDRION (35%)(background-adjusted)

11500

GOBP_REGULATION_OF_SODIUM_ION_TRANSMEMBRANE_TRANSPORT (background-adjusted)

12000

WP_G1_TO_S_CELL_CYCLE_CONTROL (background-adjusted)

12000

WP_GASTRIN_SIGNALING_PATHWAY (background-adjusted)

12500

GOBP_T_CELL_MIGRATION (background-adjusted)

12500

GOBP_POSITIVE_REGULATION_OF_CELL_SUBSTRATE_ADHESION (background-adjusted)

12700

GOBP_RESPONSE_TO_ESTRADIOL (background-adjusted)

13000

GO_NEURON_PROJECTION (50%)(M17462; background-adjusted)

13500

WP_TGFBETA_SIGNALING_PATHWAY (background-adjusted)

13500

GOBP_REGULATION_OF_PHOSPHATIDYLINOSITOL_3_KINASE_SIGNALING (background-adjusted)

13700

GOBP_MEIOTIC_CELL_CYCLE (background-adjusted)

14000

GOBP_DNA_METHYLATION (background-adjusted)

14100

WP_P53_TRANSCRIPTIONAL_GENE_NETWORK (background-adjusted)

14500

GOBP_ENDOTHELIUM_DEVELOPMENT (background-adjusted)

15000

GOBP_POSITIVE_REGULATION_OF_AXONOGENESIS (background-adjusted)

16000

GOBP_CELLULAR_CARBOHYDRATE_BIOSYNTHETIC_PROCESS (background-adjusted)

16200

GOBP_DEMETHYLATION (background-adjusted)

17000

GOBP_BIOMINERALIZATION (background-adjusted)

17500

GOCC_EXTRINSIC_COMPONENT_OF_PLASMA_MEMBRANE (background-adjusted)

18000

WP_B_CELL_RECEPTOR_SIGNALING_PATHWAY (background-adjusted)

19400

GOBP_NOTCH_SIGNALING_PATHWAY (background-adjusted)

20000

GO_ANION_TRANSMEMBRANE_TRANSPORTER_ACTIVITY (background-adjusted)

20000

GO_CATION_CHANNEL_COMPLEX (background_adjusted)

20000

GO_FOREBRAIN_DEVELOPMENT (background-adjusted)

20000

GOBP_RESPONSE_TO_STARVATION (background-adjusted)

20000

GOBP_SISTER_CHROMATID_SEGREGATION (background-adjusted)

20000

GOBP_SYNAPSE_ASSEMBLY (background-adjusted)

20000

GOBP_PATHWAY_RESTRICTED_SMAD_PROTEIN_PHOSPHORYLATION (background-adjusted)

20000

WP_COPPER_HOMEOSTASIS (background-adjusted)

20000

GOBP_VESICLE_MEDIATED_TRANSPORT_IN_SYNAPSE (background-adjusted)

20000

GOBP_METANEPHROS_DEVELOPMENT (background-adjusted)

20000

GOBP_REGULATION_OF_SYNAPTIC_PLASTICITY (background-adjusted)

20000

WP_EPITHELIAL_TO_MESENCHYMAL_TRANSITION_IN_

COLORECTAL_CANCER (background-adjusted)

20000

GOBP_EMBRYONIC_SKELETAL_SYSTEM_MORPHOGENESIS (background-adjusted)

20000

GOCC_CULLIN_RING_UBIQUITIN_LIGASE_COMPLEX (background-adjusted)

20000

GOBP_REGULATION_OF_BMP_SIGNALING_PATHWAY (background-adjusted)

20000

GOBP_APPENDAGE_MORPHOGENESIS (background-adjusted)

20000

GOBP_SPHINGOLIPID_METABOLIC_PROCESS (background-adjusted)

20000

GOBP_DNA_DEPENDENT_DNA_REPLICATION (background-adjusted)

20000

GOCC_CENTRIOLE (background-adjusted)

20000

GOBP_BILE_ACID_METABOLIC_PROCESS (background-adjusted)

20000

GOBP_CARDIAC_CHAMBER_DEVELOPMENT (background-adjusted)

20000

GOMF_VOLTAGE_GATED_ION_CHANNEL_ACTIVITY (background-adjusted)

20000

GOBP_AMINE_TRANSPORT (background-adjusted)

20000

GOBP_ODONTOGENESIS (background-adjusted)

20000

GOBP_CIRCADIAN_REGULATION_OF_GENE_EXPRESSION (background-adjusted)

 

1) Again, crude logic. We simply find it difficult to believe that a gene list generated by scanning papers would have an effective background greater than 20,000. I guess it’s possible, perhaps in the case of processes involving cascades that are initiated by entities of low abundance.


whatismygene.com 


A Preprint

It has been a while since we posted. That's largely because of the effort put into generating a paper. Check it out on BioRxiv . This is...