Friday, July 15, 2022

Perturb-Seq in Our Database

We've updated our database with data from a recent, massive, perturb-seq study in which more than 8,000 transcripts were knocked-down using crispr-i. See our previous post, and Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq for more info. This new data means that a total of about 85,000 gene lists can now be found in our database.

The study actually generated two massive knockdown datasets, one involving K562 cells, and the other involving the RPE1 line. We've added both of these datasets to the database. As we pointed out in the previous post, the K562 matrix can be read as either a series of knockdown studies (e.g. what is upregulated when gene XYZ is knocked down?), or as a crispr screen (e.g. what are the genes whose knockdown downregulates XYZ?). As you'll see below, you can examine the K562 dataset in either way. In the case of the RPE1 set, only the results of knockdowns can be examined. This is because "only" about 2,000 genes were knocked down in the RPE1 study, meaning that the background of the RPE1 crispr-screen results is fairly small. We like big backgrounds. Looking at our previous post, the K562 results can be represented as a fairly square-shaped matrix, but the RPE1 results are a long rectangle.

Integrating this data into our database wasn't entirely straightforward. The sheer volume of the data can slow down processing considerably. Also, the presence of massive data generated from a single lab, with a single protocol, with only two cell lines, wreaks havoc on the output of some of our programs. Thus, we had to tweak several of our apps to accommodate the new data.

For the Relevant Studies app, the addition of the data is not problematic. You may not be interested in these perturb-seq results, however, so a "standard" search excludes perturb-seq results. You can include perturb-seq results by scrolling down to the bottom of the black bar on the left side of the page and making a choice in the "Database" box. If, for some reason, you're only interested in perturb-seq, you can type "perturb-seq" in the "Keyword" box (be sure, of course, to choose perturb-seq in the Database box as well).

For the Co-Expression app, we don't want to mix the results of a single monster study involving two cell lines/one lab/one protocol with our standard database. All the pitfalls of the perturb-seq approach* would come to the fore and dominate results if we mixed databases. Thus you can go to "Database" and choose the standard database, the perturb-seq knockdown database, or the perturb-seq crispr-screen database, but you can't combine these datasets**. Since the "Regulation" app also performs co-expression analysis, the same database limitations apply. 

For the Fisher app, Match Studies app, and Third Set app, our main concern is speed. Thus, as with the Relevant studies app, the default is to use our standard database. Again, you can change that in the Database box.

No insight is gained by including the perturb-seq studies in our "Cell Types" app. Thus only the standard database is interrogated when you ask the question, "in what cell types does my gene of interest tend to be perturbed?"

Presumably, the frequency with which these sorts of massive datasets are released will increase in the future. We'll add these studies to the database as they come.


*Unmentioned in our previous post is the following: the results of the K562 and RPE1 knockdowns overlap quite weakly. That is, if you choose a gene that was knocked down in both studies, and look for correlation in the two altered transcriptomes, you'll probably be disappointed. The perturb-seq paper devotes a paragraph to this result, but does not really offer insight into why such weak correlation is seen. Hopefully, our previous post makes it somewhat clear that overlap between transcriptomic studies that target the same gene, but use different cell lines and targeting approaches (knockout/down, drug, etc) is not unusual or unexpected.


**You can't combine the crispr-i and crispr-screen results, either. My underlying thinking is that since one is just the transpose of the other, it's a bit incestuous to mix the two sets and then perform analysis. There's probably a name for an associated statistical error here. Admittedly, I haven't thought this out fully, but it just seems safer to keep the two sets separate.


whatismygene.com 


No comments:

Post a Comment

A Preprint

It has been a while since we posted. That's largely because of the effort put into generating a paper. Check it out on BioRxiv . This is...