Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifications about Cell Painting data #32

Open
shntnu opened this issue Mar 27, 2020 · 17 comments
Open

Clarifications about Cell Painting data #32

shntnu opened this issue Mar 27, 2020 · 17 comments
Labels
Discussion and Notes Documenting ideas/discussions

Comments

@shntnu
Copy link
Collaborator

shntnu commented Mar 27, 2020

This thread is to address general questions about Cell Painting data. Discuss dataset-specific and analysis-specific issues in a separate thread.

@sasgari asked:

Besides the missing features, some features have zero or negative values? I am not sure if these zeros and/or negative values valid measurements or if they should be treated as missing.

cc @jatinarora-upmc

@shntnu shntnu added the Discussion and Notes Documenting ideas/discussions label Mar 27, 2020
@shntnu
Copy link
Collaborator Author

shntnu commented Mar 27, 2020

Besides the missing features, some features have zero or negative values? I am not sure if these zeros and/or negative values valid measurements or if they should be treated as missing.

Feature values can indeed be negative. In fact they can have very different distributions at the single-cell level
e.g. see this figure

image

@sasgari
Copy link
Collaborator

sasgari commented Mar 28, 2020

Thanks @shntnu!

@shntnu
Copy link
Collaborator Author

shntnu commented Mar 28, 2020

@sasgari Note that this is for single-cell level data of course. Aggregated or "psuedo-bulk" profiles will have a different distribution (they would have sampling distributions of the corresponding statistics e.g. mean or median)

@shntnu
Copy link
Collaborator Author

shntnu commented Apr 6, 2020

@jatinarora-upmc asked - what are Costes features?

These are features used to measure the correlation between channels (in Cell Painting, each channel corresponds to one stain, except for the AGP channel, which corresponds to two stains).

There are many methods to measure correlation between channels. The Costes' method evaluates the correlation in pixels below each threshold in the data, and then selects the threshold with the minimum correlation or highest threshold with a non-positive correlation (from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5200903/).

@shntnu
Copy link
Collaborator Author

shntnu commented Apr 6, 2020

@jatinarora-upmc asked - Nucleus is identified using DNA channel, but cell is identified using nucleus and cytoplasmic RNA channel. I wonder why cells are not identified using plasma membrane channel?

From: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5223290/:

First, the nuclei are identified from the Hoechst image because it is a high-contrast stain for a well-separated organelle; subsequently, the nucleus along with an appropriate channel is used to delineate the cell body49. We have found the SYTO 14 image is the most amenable for finding cell edges, as it has fairly distinct boundaries between touching cells.

We did use AGP in the past but switched to RNA later.

@shntnu
Copy link
Collaborator Author

shntnu commented May 19, 2020

What do Cell Painting features mean? Learn more here.

@jatinarora-upmc
Copy link
Collaborator

jatinarora-upmc commented Jun 12, 2020

@shntnu the number of adjacent neighbors (Cells_Neighbors_NumberOfNeighbors_Adjacent) for isolate cells is 0, but Cells_Neighbors_PercentTouching_Adjacent is not 0. This is confusing as both should be 0. Could you please help us to understand this?
Keeping @sasgari also in loop.

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 12, 2020

I assume you are looking at single cells? Because that won't hold at the aggregate level.

For single cells data, I looked at the sample of 4994 cells in this repo, I found this anomaly is observed only once. Do you see that more often? If so, I can probe further.

sampled_cells %>%
  select(
    Cells_Neighbors_NumberOfNeighbors_Adjacent,
    Cells_Neighbors_PercentTouching_Adjacent
  ) %>%
  filter(
    xor(
      Cells_Neighbors_NumberOfNeighbors_Adjacent == 0,
      Cells_Neighbors_PercentTouching_Adjacent == 0
    )
  ) %>%
  pivot_longer(everything())
name value
Cells_Neighbors_NumberOfNeighbors_Adjacent 0.0000000
Cells_Neighbors_PercentTouching_Adjacent 0.5747126

@jatinarora-upmc
Copy link
Collaborator

Actually, I averaged the data from single cells to donor level for each plate individually, and Cells_Neighbors_PercentTouching_Adjacent is non-0 for isolate cells on all plates.

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 12, 2020

Actually, I averaged the data from single cells to donor level for each plate individually, and Cells_Neighbors_PercentTouching_Adjacent is non-0 for isolate cells on all plates.

That's definitely odd, but I wonder if it might be something in your code? As you see below, that anomaly occurs only once in the 287 isolated cells (I can't explain that without more digging, but it is certainly is a rare event; < 0.5% in this sample)

sampled_cells %>% tally()
n
4994
sampled_cells %>%
  select(
    Cells_Neighbors_NumberOfNeighbors_Adjacent,
    Cells_Neighbors_PercentTouching_Adjacent
  ) %>%
  filter(Cells_Neighbors_NumberOfNeighbors_Adjacent == 0) %>%
  group_by(Cells_Neighbors_PercentTouching_Adjacent) %>%
  tally() 
Cells_Neighbors_PercentTouching_Adjacent n
0.0000000 287
0.5747126 1

@jatinarora-upmc
Copy link
Collaborator

You are right. I checked one plate, cmqtlpl261-2019, and it has ~22k isolate cells (Cells_Neighbors_NumberOfNeighbors_Adjacent == 0) and 145 cells with anomaly (Cells_Neighbors_PercentTouching_Adjacent != 0) present in all donor/cell lines. When I average the single cell level features to donor level, Cells_Neighbors_PercentTouching_Adjacent becomes non-0. So, all set for now. BTW, what is the reason for this anomaly?

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 13, 2020

I think it's to do with their position. This one cell seems to lie on the edge of the image and something funky must be happening to the calculation of the percentage. You can safely ignore this case (i.e. consider Cells_Neighbors_NumberOfNeighbors_Adjacent to be correct, and ignore Cells_Neighbors_PercentTouching_Adjacent)

sampled_cells %>% 
  filter(Cells_Neighbors_NumberOfNeighbors_Adjacent == 0) %>%
  ggplot(aes(Cells_Neighbors_PercentTouching_Adjacent == 0,
             Cells_Location_Center_X)) + 
  geom_boxplot()

image

@bethac07
Copy link
Contributor

bethac07 commented Jun 15, 2020 via email

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 15, 2020

Thanks @bethac07

Here's the measureobjectneighbors documentation for our reference.

NumberOfNeighbors: Number of neighbor objects.
PercentTouching: Percent of the object’s boundary pixels that touch neighbors, after the objects have been expanded to the specified distance. Note that this measurement is only available if you use the same set of objects for both objects and neighbors.

@jatinarora-upmc Given that this is an edge case (literally as well!), it doesn't really matter how we handle it. But if you wanted to be really rigorous, you'd modify the definition of isolated to be Cells_Neighbors_PercentTouching_Adjacent == 0

@shntnu
Copy link
Collaborator Author

shntnu commented Jun 23, 2020

Soumya had asked what Zernike features mean. Here are my quick notes that I sent via email.


Briefly, these features represent subtle properties of shape, and the higher the index, the more nuanced the shape (e.g. Zernike 9 is more nuanced than Zernike 8).

Less briefly:
You can represent any 2D function as a linear combination of the orthogonal basis defined by Zernike polynomials (all the way below). Both, a cell, as well as its shape can be thought of as a 2D function.

Take any cell below, and

  1. look at the top image: you can think of this as a 2D function where X,Y is the location of a pixel, and f(X,Y) is the brightness of the Pixel.
  2. look at the corresponding cell in the bottom image: you can also think of this as a 2D function where X,Y is the location of a pixel, and f(X,Y) is 1 if you are inside the cell and 0 if you are outside the cell.

Cells AreaShape Zernike 9 1 is a shape feature, so you have shape –  a binary (0 or 1) 2D function – that needs to be decomposed into its components using the Zernike basis. CellProfiler does that for you and gives you the coefficients as shape features.

Another intuition that's helpful is the regular notion of moments in stats: you can use higher-order moments to describe more nuanced aspects of a distribution; same thing with shape.

Yet another (precise) intuition is that you are doing a power series expansion of a 2D function



@AnneCarpenter's explanation from #63 (comment)

Q2: Here is a guide to the Zernikes: https://en.wikipedia.org/wiki/File:Zernike_polynomials2.png
Zernike0_0 should honestly have almost perfect correlation with one of the more commonly named shape metrics because it's really asking whether the cell matches a circle shape. For 3_1 you look at that pyramid for the one that says Z with a 1 on top and a 3 on the bottom (I think). You can see it has a red and blue stripe at the edges, and a red and blue blob in the middle. What this means: picture the shape of the cell superimposed on top… it will score high for this Zernike the more blue is covered and the more red you see - our cells aren't allowed to have holes in them, so i can imagine two cell shapes that would score highly: one is almost a perfect circle but just a little flattened at the red side. The other would be almost a crescent such that the middle red blob is exposed (but it’s not a great fit because a big chunk wouldn’t align well).
6_4 isn’t shown but you can follow the right hand side of the pyramid and see it would be mostly a circle with wiggly edges (probably not far off from a circle!). I'm a bit surprised that they'd be anticorrelated to 0_0, really.

@jatinarora-upmc
Copy link
Collaborator

Hi @shntnu , I was wondering if i could skip RadialDistribution features (all or some such as FractAd), as they show distribution of total intensity, but i can not decide since i don’t have much functional interpretation of these features. What would be your recommendation?

@shntnu
Copy link
Collaborator Author

shntnu commented Jul 13, 2020

RadialDistribution features have been pretty informative in past experiments so I would not advise dropping. See https://forum.image.sc/t/radial-distribution-module/17272 for an explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion and Notes Documenting ideas/discussions
Projects
None yet
Development

No branches or pull requests

4 participants