Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRISPRi validation of rare variant associations #71

Open
mtegtmey opened this issue Sep 8, 2022 · 47 comments
Open

CRISPRi validation of rare variant associations #71

mtegtmey opened this issue Sep 8, 2022 · 47 comments

Comments

@mtegtmey
Copy link
Contributor

mtegtmey commented Sep 8, 2022

New data has been generated and transferred to @bethac07.

Here is the per well metadata for the plate.
cmQTL_CRISPRi_metadata.xlsx

@mtegtmey mtegtmey changed the title CRISPRi validation of rare variants associations CRISPRi validation of rare variant associations Sep 8, 2022
@shntnu
Copy link
Collaborator

shntnu commented Sep 8, 2022

@bethac07 asked

  • Do you want us to use the old pipeline and CP3, or proceed as if this was a new batch?
  • Likewise, do you want us to use the old R based workflow downstream, or is it ok to use the recipe and pycytominer? I did check that at the time of your notes, the default aggregation in cytominer_scripts was mean (as it currently is in collate.py) , which I would think/hope would be the major difference.

@AnneCarpenter replied

I think the goal here is to spot check a few genes vs a few features, so I think it's fine to use any version of CP that's convenient and I don't think it's important for the profiling to be identical either.

@mtegtmey -- the goal (as stated by Anne) sounds right to me but please confirm

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Sep 8, 2022

@bethac07 @shntnu I agree with Anne. Any version of CP should be fine since we have strong priors going into these validations.

@bethac07
Copy link
Contributor

bethac07 commented Sep 9, 2022

Something I wanted to flag- some of these nuclei look pretty weird/bad, at least compared to what I'm used to looking at (cancer cells) - note the range of brightnesses, that one that's got weird holes in it, etc. (Ignore the bad segmentation for now, that's fixable). It's been literally years though since I did the original assay dev on these, so it's possible I'm misremembering, AND I don't know exactly how these were treated - is there any reason we should expect this? I assume I should try to keep everything?

image

@AnneCarpenter
Copy link

My first thought is that it's physiological and related to differentiation state but I don't really know why that popped up for me, I've no evidence/knowledge! Curious if @mtegtmey remembers anything.

@bethac07
Copy link
Contributor

bethac07 commented Sep 9, 2022

Definitely possible Anne! My first thought is it's an incomplete drug selection, so knowing if selection was used here would be super helpful.

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Sep 10, 2022 via email

@bethac07
Copy link
Contributor

bethac07 commented Sep 15, 2022

I've had a chance to dig into it a bit further now, and thankfully, it seems to just have been a technical artifact from a bad analysis setting. Briefly, we nearly always after whole-plate illumination correction do an "enhancement" on the nuclei channel - it helps sharpen nuclear edges a bit, removes large debris, etc. In this particular case though, it seemed to be removing real signal and leading to this effect (for at least many of the cases). I switched to a "gentler" method of doing background removal and it seems to be performing much better on this particular data! I'm hopeful I can get analysis started today and backends by tomorrow. (cc @shntnu)

image

@bethac07
Copy link
Contributor

There are some QC issues with the plate that I'm noticing - there are a few wells just with standard "schmutz" (the bright blue/yellow bits), but additionally something was in the light path for a good chunk of the plate, and unfortunately, not statically, but moving around. It shows below as a red "band", but it's not actually bright, but it's that it seems to block the signal more in the blue/green part of the spectrum, so it's a "hole" we have to fill instead, which is much harder (if it were truly bright we can just block it out, which is my plan for the schmutz). Doing per field background calculation can help a bit with it but not fully. I don't think there's anything to do here but proceed, but FYI.

image

image

@bethac07
Copy link
Contributor

bethac07 commented Sep 15, 2022

I chatted with the other image analysts, and none of us could come up with a good way to solve this floating debris issue, nor had we come across it before - we've had images where the debris was fixed and static, but never this particular issue. It does seem to be relatively consistent across the sites within a given well, just not across wells on the plate.

Basically, there are going to be a couple of options -

  1. Live with it - if you're planning to look at per-well aggregated features, we could consider median-aggregating rather than mean aggregating. If you're looking at single-cell data, consider clipping the ends of the cell distribution (in feature space) for each well.
  2. Image again. Since there also seems to be some in-well debris, maybe doing an extra change/wash of PBS (or whatever you're storing in) will help a bit, but more importantly we should hopefully not see this floating debris issue again.

@mtegtmey Is 2 possible and/or plausible? Or are our strong priors strong enough that we feel ok going ok with the images we currently have? Let me know how you want me to proceed.

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Sep 15, 2022 via email

@bethac07
Copy link
Contributor

Unfortunately, the floaty thing is NOT consistent across wells- it's present in about the top quarter of the plate, and nowhere else. Since your replicates are pretty tightly grouped physically across the plate, that means it will affect some samples in every well and some not at all. If we're median aggregating, it's PROBABLY fine, because it definitely doesn't cover more than 50% of the cells, BUT if the features you're hoping to look at don't involve "100% of cells get 10% higher/lower (x)" but rather "10% of cells get 100% higher/lower (x)", by switching to median aggregation we lose the ability to detect that - does that make sense?

If you have a backup plate, or even if it's possible to image that same plate again (the behavior of the floaty thing says to me it was some piece of dust that was on the bottom of the plate/fell onto the objective from the air, so as long as whomever is imaging it gives the plate bottom a quick swipe with some lens paper, we shouldn't have the same issue again - like I said, I've never seen it before in 6 + years analyzing plates from that microscope!), I think running the imaging one more time is our best chance of getting maximum-quality data. I don't think the data quality we're going to have even with the data we have now is going to be BAD, I'm just saying what's ideal. In practice, though, we rarely ever have ideal data, so if the decision is "live with it, and maybe consider median aggregating, but just keep an eye out for our phenotypes in that corner" (we can always run it both ways - mean AND median aggregated), I don't think it's ruined by any means- we should just keep in mind that it might be somewhat aberrated as we then look downstream).

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Sep 15, 2022

Ok - I've booked the Phenix for Monday morning. I'm working on coordinating a handoff to IP folks who can set up and run the imaging for the backup plate and transfer the images to you.

It could also be possible to simply remove any/all impacted wells from the downstream processing. Each condition has 28 replicate wells on the plate (56 for controls), so even chopping 1/4 of the total wells should still leave us with plenty per condition to accomplish our goals In this experiment.

@bethac07
Copy link
Contributor

It could also be possible to simply remove any/all impacted wells from the downstream processing. Each condition has 28 replicate wells on the plate (56 for controls), so even chopping 1/4 of the total wells should still leave us with plenty per condition to accomplish our goals In this experiment.

Wow, I hadn't realized there were so many - I still think it's worth re-imaging the backup plate (and thank you for arranging that to be true!), because I still think given how much work it take to get to this point if we can get cleaner data, we should get cleaner data, but if for some reason you decided not to re-image, that's good to know. (We might also think about using it as an internal test case to see how much an artifact of this case REALLY messes with our ability to detect phenotypes.

@mtegtmey
Copy link
Contributor Author

@bethac07 new images should get transferred to /imaging/analysis/2018_06_05_cmQTL later this afternoon! I imagine everything would be all set for analysis by tomorrow morning. I'll update if there are any issues.

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Oct 6, 2022

@bethac07 were you able to see if the new image set has the same floating debris issue?

@bethac07
Copy link
Contributor

bethac07 commented Oct 6, 2022

It does have some of them still, so I think they must have been inside the wells - I think possibly one or more solutions must not have been filtered fully. The second plate was definitely less bad. I analyzed both plates, just to get a sense of how much if at all this is going to affect the profiles - I should have profiles to Shantanu later today.

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Oct 6, 2022

ok thanks for the update! I wonder whats going on. Did it still tend to be in the upper corner?

@bethac07
Copy link
Contributor

bethac07 commented Oct 6, 2022

Yup, it was the same part of the plate. I can post the plate images later today.

@bethac07
Copy link
Contributor

bethac07 commented Oct 6, 2022

So the two plates look - pretty similar! Note that I had to remove the Costes features in both batches - I don't know why those are misbehave-y again, but FYI. If you want to play with these in Morpheus yourself, I've uploaded GCT files with Costes features removed. I know overall profile similarity isn't the goal here, but just wanted to graph it.

Original run, sorted by "sample ID"
image

Rerun, same
image

Top right plate corner, original run (I can send the full files and they are also on AWS, but since they're ~300MB each I can't attach them here).
image

Rerun
image

@AnneCarpenter
Copy link

Great! What's the next step/handoff to whom?

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Oct 7, 2022

I think the clustering by gene target is promising! If the profiles are ready to go, I'm happy to take them over and run the analysis. I'm not sure if they needed some more fine tuning on @shntnu's end.

@bethac07
Copy link
Contributor

Given that the clustering results seem to show reasonable signal, I don't think there's any harm in going forward with this data, it's not PERFECT data but it is likely well within the range of "can get reproducible results from". I believe your plan was to query particular features, so should be good to go for those; if you want to do more clustering work, the only thing I'd personally recommend is removing the Costes colocalization features, since they seemed to be poorly behaved in both batches here (this is not the first time we've seen this with Costes features). I removed them in the GCT files that I uploaded for playing with in Morpehus, I just didn't want to mess with the underlying profiles themselves. If @shntnu signs off, I think we're good!

@AnneCarpenter
Copy link

Who will be the one to check the genes vs the features? @MarziehHaghighi did such an analysis in our ORF data and might be called in (she's to be an author on this cmQTL paper) to do the exact same analysis here. Beth would it be clear to her where the files are? It's a bummer we are meeting at noon today because it would really help to know if this worked out or not to decide next steps for submitting the paper!

@AnneCarpenter
Copy link

(in a pinch we could look at the 3 features in Morpheus and just rank-order samples by each of those 3 and see if the hoped-for gene names are at the top/bottom of the list):
Cytoplasm_AreaShape_Zernike_9_3
Cells_RadialDistribution_RadialCV_Mito_1of4
Cytoplasm_Granularity_3_RNA

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Oct 11, 2022 via email

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Oct 11, 2022

OK, here are the results running a simple Welchs Two Sample T-Test for those features which we had rare variant burden. There are two genes, PRLR and KCNK6 where the specific features weren't present in the most recent run. (I have only checked plate two, so I will see of those feature come in Plate 1). These are very promising results! ZNF436 has only a very subtle change in this feature, but from the gene expression data, the knockdown efficiency was only about 10%.
Screen Shot 2022-10-11 at 9 00 54 AM

@bethac07
Copy link
Contributor

There are two genes, PRLR and KCNK6 where the specific features weren't present in the most recent run. (I have only checked plate two, so I will see of those feature come in Plate 1).

What were the features? Were you looking in the normalized.csv or the feature_selected.csv? Everything measured should be present in the normalized; a couple feature names changed slightly between 3 and 4, so if they aren't present there, LMK and I can try to help you find the matching ones.

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Oct 11, 2022 via email

@shntnu
Copy link
Collaborator

shntnu commented Oct 11, 2022

normalized.csv

@bethac07 Thanks again for taking this off my plate 🙏

  • Could you clarify what this was normalized to? The controls, presumably?
  • How did you run the profiling – was it the profiling recipe? If so, which version of the recipe? (this is just for our notes)

@mtegtmey
Copy link
Contributor Author

OK, I dug the other two feature associations from the normalized.csv data. We're 5/5 it seems on validating some of our top hits!
Screen Shot 2022-10-11 at 10 26 49 AM

@AnneCarpenter
Copy link

I know I shouldn't be SHOCKED, but that's truly fantastic news!

@bethac07
Copy link
Contributor

@shntnu
I did whole plate normalization, due to the small number of just overall samples (and because that's how previous batches were run, because there weren't negative controls), but I could rerun with normalize_negcon if we we wanted. The version of the recipe/template and the config file are already committed to the repo.
@mtegtmey Yay!

@shntnu
Copy link
Collaborator

shntnu commented Oct 11, 2022

@mtegtmey

Wow! If it's easy, would it be possible to plot all 5 features x all 5 genes for the CRISPRi data? (and maybe later for the iPSC data)

It will be reassuring to known that we're seeing gene-specific effects here (although the clustering is already reassuring in that sense)

@shntnu
Copy link
Collaborator

shntnu commented Oct 11, 2022

I did whole plate normalization, due to the small number of just overall samples (and because that's how previous batches were run, because there weren't negative controls), but I could rerun with normalize_negcon if we we wanted.

I think that makes sense, Beth, because (IIUC) the genes are not expected to be related in any way (if they were, whole plate would not be a good idea)

The version of the recipe/template and the config file are already committed to the repo.

Thank you!

@mtegtmey
Copy link
Contributor Author

This may make things less exciting, but it does appear that many of these features change across the various genes. We are reassured that they seem to cluster by gene target in morpheus, but we should think about this results. I suppose it's possible that knocking down these specific genes could impact each of these individual features.

From the wet-lab perspective, each of the cells were treated identically minus the different gene targets. The control samples are also infected with non-targeting sgRNAs, so they are exposed to the same chemical selection, as well as having free-floating dCas9 in the nucleus (which I'm sure causes some phenotype).

Screen Shot 2022-10-11 at 10 46 14 AM

@mtegtmey
Copy link
Contributor Author

Do we have any strong feelings or thoughts on this?

Though we do see each perturbation impacting these features, the direction of the association with the change in the feature is the same, which is promising.

WASF2, PRLR, and TSPAN15 are all known to regulate proliferation/cell adhesion to varying degrees. So, I think completely knocking out these genes could very likely impact this specific set of features. But we could think about an alternative way to normalize the data if we feel a little uneasy about this.

@AnneCarpenter
Copy link

I don't have a clear picture of what to think. It comes down to how confident we are that the negative control is a reliable/good neg control (and not itself a weird outlier for some reason). I don't think you have any reason to think it isn't a good control. In a perfect world we'd have dozens of other genes in this plot (or other kinds of neg control) to reassure ourselves that the genes of interest is relatively unusual in its feature of interest; but we don't have this kind of data.

I agree, it's possible that these 5 genes are not expected to give random/different phenotypes if they share some biological functions. We already know it's the case that when adhesion/proliferation are impacted then tons of features all change.

It's reassuring that at least in the two left-most cases, there's at least one sample that doesn't look like the others and instead looks closer to the neg control, reassuring it would not be the case that ALL gene knockdowns cause the given phenotype.

Is each dot here a well, btw?

Our analysis up to this point says "changes in this gene cause changes in this phenotype" but we did not explicitly aim to choose examples where "this phenotype" would be super unique relative to all other genes (right?) So I guess it's possible to get a phenotype that's also impacted by lots of other genes (esp if that phenotype is something 'generic' like cell growth... I don't recall whether we felt that these 5 genes gave generic vs unique phenotypes in general/by eye?)

(I have no actual conclusion, just thinking out loud).

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Oct 12, 2022

Is each dot here a well, btw?

Right, I was using per well level data for the analysis.

Our analysis up to this point says "changes in this gene cause changes in this phenotype" but we did not explicitly aim to choose examples where "this phenotype" would be super unique relative to all other genes (right?) So I guess it's possible to get a phenotype that's also impacted by lots of other genes (esp if that phenotype is something 'generic' like cell growth... I don't recall whether we felt that these 5 genes gave generic vs unique phenotypes in general/by eye?)

They were honestly chosen because we didn't have anything else to explore. They were the rare variant burdens with either suggestive or significant associations to features that were the most easily interpretable from a biological sense (the genes themselves).

@AnneCarpenter
Copy link

Yep, ok, so it's not necessarily discouraging that many of them impacted similar phenotypes (since we never aimed at choosing any that were distinctive)

@mtegtmey
Copy link
Contributor Author

mtegtmey commented Oct 13, 2022

I'm pulling together images to see if we can visualize the changes by eye. Just looking for feedback to see if I am approaching this the right way. Below are representative images from a control well and one with cells where I've knocked down TSPAN15. The feature associates with this gene is Cytoplasm_Granularity_3_RNA, so I'm looking at the SYTO stain and for what my brain thinks is 'granularity' in the cell body.

I think on average, cells have more little holes/crevices in the knockdown sample, but I'm sure I'm am just trying to convince myself that's the case. In the differential test the TSPAN15 sample has a higher score for this feature relative to the control.

Screen Shot 2022-10-13 at 8 36 07 AM

@bethac07
Copy link
Contributor

I think those holes are bigger than what Granularity_3 would likely be measuring. But one way to check if there's actually a visual difference is to scramble up a bunch of control and treated images and see what your blinded classification accuracy is.

For my money, I would bet that the only 2 potentially visible of the 5 phenotypes are the Eccentricity and the Radial CV.

@AnneCarpenter
Copy link

At a fine resolution based on this one pair of images, it does seem TSPAN15 is a bit blurrier in the cytoplasm, there are smaller textures in the control... but not sure if that's the right direction for this granularity metric (TSPAN15 KD is higher than control in the metric)

@bethac07
Copy link
Contributor

So Granularity 3 will mean "after removing larger features and subsampling (I would have to check the pipeline to recall if by a factor of 2 or 4), and remove the dots that are 1 pixel across or 2 pixels across, a relatively higher fraction of the data that's left is 3 pixels across and is removed when we remove dots 3 pixels across". I don't think a human brain can see that. (Erin has tried looking at some Granularity stuff by eye, unsuccessfully).

@mtegtmey
Copy link
Contributor Author

OK, this makes sense. I will follow your advice Beth and see about training a classifier on the images.

I did look at some wells comparing PRLR and controls (feature of interest is Cells_RadialDistribution_RadialCV_Mito_1of4). I do feel like here I can eyeball a difference in where the mitotracker is staining throughout the cell. We expect the PRLR_sgRNA to have a lower score in this feature relation to the Control_sgRNA

Screen Shot 2022-10-13 at 12 43 02 PM

@bethac07
Copy link
Contributor

bethac07 commented Oct 13, 2022

I don't think you have to get as fancy as training a classifier, just scramble some images (by hiding the file names or renaming them - this is some code I wrote to randomize data, copy the mapping to an excel sheet, and then you could just hide the real name in the Excel file, write down your guesses, and then un-hide the real name to see if you're right.

RE: PRLR: That feature means that we expect in the mutant the mitochondrial distribution in the inside 1/4 of the cell (aka, immediately perinuclear) to be more symmetric (in the sense of not all the mitochondria on one side of the nucleus, but evenly distributed on all sides of the nucleus). I guess I COULD see that in the two images you posted, but there's a very good chance it's my brain tricking me, I'm not sure if I could pass a scramble test on it.

@mtegtmey
Copy link
Contributor Author

I randomly sampled 8 images (4 from a control sgRNA and 4 from PRLR sgRNA). I was able to pick the four images from each condition without knowing their source. Here are the particular images I sampled in case anyone wants to check my sanity of distinguishing them.

Screen Shot 2022-10-13 at 2 05 04 PM

@AnneCarpenter
Copy link

Matt tells me I got a perfect score too :D but didn't put my answer here so as not to contaminate anyone else who wants to try. FWIW, I was looking primarily at the 'stringiness' vs 'blobbiness' of mito esp in that ring around the nucleus, which I guess makes sense it (roughly) corresponds to being evenly distributed vs not as much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants