Clarifications about Cell Painting data #32

shntnu · 2020-03-27T23:19:01Z

This thread is to address general questions about Cell Painting data. Discuss dataset-specific and analysis-specific issues in a separate thread.

@sasgari asked:

Besides the missing features, some features have zero or negative values? I am not sure if these zeros and/or negative values valid measurements or if they should be treated as missing.

cc @jatinarora-upmc

shntnu · 2020-03-27T23:23:03Z

Besides the missing features, some features have zero or negative values? I am not sure if these zeros and/or negative values valid measurements or if they should be treated as missing.

Feature values can indeed be negative. In fact they can have very different distributions at the single-cell level
e.g. see this figure

sasgari · 2020-03-28T14:02:06Z

Thanks @shntnu!

shntnu · 2020-03-28T16:30:28Z

@sasgari Note that this is for single-cell level data of course. Aggregated or "psuedo-bulk" profiles will have a different distribution (they would have sampling distributions of the corresponding statistics e.g. mean or median)

shntnu · 2020-04-06T13:25:02Z

@jatinarora-upmc asked - what are Costes features?

These are features used to measure the correlation between channels (in Cell Painting, each channel corresponds to one stain, except for the AGP channel, which corresponds to two stains).

There are many methods to measure correlation between channels. The Costes' method evaluates the correlation in pixels below each threshold in the data, and then selects the threshold with the minimum correlation or highest threshold with a non-positive correlation (from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5200903/).

shntnu · 2020-04-06T13:30:19Z

@jatinarora-upmc asked - Nucleus is identified using DNA channel, but cell is identified using nucleus and cytoplasmic RNA channel. I wonder why cells are not identified using plasma membrane channel?

From: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5223290/:

First, the nuclei are identified from the Hoechst image because it is a high-contrast stain for a well-separated organelle; subsequently, the nucleus along with an appropriate channel is used to delineate the cell body49. We have found the SYTO 14 image is the most amenable for finding cell edges, as it has fairly distinct boundaries between touching cells.

We did use AGP in the past but switched to RNA later.

shntnu · 2020-05-19T15:52:18Z

What do Cell Painting features mean? Learn more here.

jatinarora-upmc · 2020-06-12T16:14:19Z

@shntnu the number of adjacent neighbors (Cells_Neighbors_NumberOfNeighbors_Adjacent) for isolate cells is 0, but Cells_Neighbors_PercentTouching_Adjacent is not 0. This is confusing as both should be 0. Could you please help us to understand this?
Keeping @sasgari also in loop.

shntnu · 2020-06-12T17:07:29Z

I assume you are looking at single cells? Because that won't hold at the aggregate level.

For single cells data, I looked at the sample of 4994 cells in this repo, I found this anomaly is observed only once. Do you see that more often? If so, I can probe further.

sampled_cells %>%
  select(
    Cells_Neighbors_NumberOfNeighbors_Adjacent,
    Cells_Neighbors_PercentTouching_Adjacent
  ) %>%
  filter(
    xor(
      Cells_Neighbors_NumberOfNeighbors_Adjacent == 0,
      Cells_Neighbors_PercentTouching_Adjacent == 0
    )
  ) %>%
  pivot_longer(everything())

name	value
Cells_Neighbors_NumberOfNeighbors_Adjacent	0.0000000
Cells_Neighbors_PercentTouching_Adjacent	0.5747126

jatinarora-upmc · 2020-06-12T17:57:46Z

Actually, I averaged the data from single cells to donor level for each plate individually, and Cells_Neighbors_PercentTouching_Adjacent is non-0 for isolate cells on all plates.

shntnu · 2020-06-12T20:29:03Z

Actually, I averaged the data from single cells to donor level for each plate individually, and Cells_Neighbors_PercentTouching_Adjacent is non-0 for isolate cells on all plates.

That's definitely odd, but I wonder if it might be something in your code? As you see below, that anomaly occurs only once in the 287 isolated cells (I can't explain that without more digging, but it is certainly is a rare event; < 0.5% in this sample)

sampled_cells %>% tally()

n
4994

sampled_cells %>%
  select(
    Cells_Neighbors_NumberOfNeighbors_Adjacent,
    Cells_Neighbors_PercentTouching_Adjacent
  ) %>%
  filter(Cells_Neighbors_NumberOfNeighbors_Adjacent == 0) %>%
  group_by(Cells_Neighbors_PercentTouching_Adjacent) %>%
  tally()

Cells_Neighbors_PercentTouching_Adjacent	n
0.0000000	287
0.5747126	1

jatinarora-upmc · 2020-06-12T22:25:42Z

You are right. I checked one plate, cmqtlpl261-2019, and it has ~22k isolate cells (Cells_Neighbors_NumberOfNeighbors_Adjacent == 0) and 145 cells with anomaly (Cells_Neighbors_PercentTouching_Adjacent != 0) present in all donor/cell lines. When I average the single cell level features to donor level, Cells_Neighbors_PercentTouching_Adjacent becomes non-0. So, all set for now. BTW, what is the reason for this anomaly?

shntnu · 2020-06-13T01:03:54Z

I think it's to do with their position. This one cell seems to lie on the edge of the image and something funky must be happening to the calculation of the percentage. You can safely ignore this case (i.e. consider Cells_Neighbors_NumberOfNeighbors_Adjacent to be correct, and ignore Cells_Neighbors_PercentTouching_Adjacent)

sampled_cells %>% 
  filter(Cells_Neighbors_NumberOfNeighbors_Adjacent == 0) %>%
  ggplot(aes(Cells_Neighbors_PercentTouching_Adjacent == 0,
             Cells_Location_Center_X)) + 
  geom_boxplot()

bethac07 · 2020-06-15T15:25:37Z

So based on this conversation, this is my guess- Those cells DO have neighbors, but those neighbors are cells that are ultimately excluded for touching the edge of the image, so the cell does indeed have 1) some % of its border touching another cell but also 2) 0 "accepted" neighbors.

…

On Fri, Jun 12, 2020 at 9:04 PM Shantanu Singh ***@***.***> wrote: I think it's to do with their position. This one cell seems to lie on the edge of the image and something funky must be happening to the calculation. sampled_cells %>% filter(Cells_Neighbors_NumberOfNeighbors_Adjacent == 0) %>% ggplot(aes(Cells_Neighbors_PercentTouching_Adjacent == 0, Cells_Location_Center_X)) + geom_boxplot() [image: image] <https://user-images.githubusercontent.com/1210428/84556359-24568580-acf0-11ea-95ad-efcf150f07a8.png> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#32 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTI72ZZK4ZIMNNE7KRWY7TRWLGALANCNFSM4LVLEQLA> .

-- Beth Cimini, PhD Senior Computational Biologist, Imaging Platform Broad Institute 415 Main St Room 5011 Cambridge, MA 02142 Current office number- (617) 714-8189 Pronouns - She/her/hers I will sometimes send or respond to emails outside of my local office hours, but I never expect responses outside of your local office hours.

shntnu · 2020-06-15T22:53:23Z

Thanks @bethac07

Here's the measureobjectneighbors documentation for our reference.

NumberOfNeighbors: Number of neighbor objects.
PercentTouching: Percent of the object’s boundary pixels that touch neighbors, after the objects have been expanded to the specified distance. Note that this measurement is only available if you use the same set of objects for both objects and neighbors.

@jatinarora-upmc Given that this is an edge case (literally as well!), it doesn't really matter how we handle it. But if you wanted to be really rigorous, you'd modify the definition of isolated to be Cells_Neighbors_PercentTouching_Adjacent == 0

shntnu · 2020-06-23T20:50:22Z

Soumya had asked what Zernike features mean. Here are my quick notes that I sent via email.

Briefly, these features represent subtle properties of shape, and the higher the index, the more nuanced the shape (e.g. Zernike 9 is more nuanced than Zernike 8).

Less briefly:
You can represent any 2D function as a linear combination of the orthogonal basis defined by Zernike polynomials (all the way below). Both, a cell, as well as its shape can be thought of as a 2D function.

Take any cell below, and

look at the top image: you can think of this as a 2D function where X,Y is the location of a pixel, and f(X,Y) is the brightness of the Pixel.
look at the corresponding cell in the bottom image: you can also think of this as a 2D function where X,Y is the location of a pixel, and f(X,Y) is 1 if you are inside the cell and 0 if you are outside the cell.

Cells AreaShape Zernike 9 1 is a shape feature, so you have shape – a binary (0 or 1) 2D function – that needs to be decomposed into its components using the Zernike basis. CellProfiler does that for you and gives you the coefficients as shape features.

Another intuition that's helpful is the regular notion of moments in stats: you can use higher-order moments to describe more nuanced aspects of a distribution; same thing with shape.

Yet another (precise) intuition is that you are doing a power series expansion of a 2D function

@AnneCarpenter's explanation from #63 (comment)

Q2: Here is a guide to the Zernikes: https://en.wikipedia.org/wiki/File:Zernike_polynomials2.png
Zernike0_0 should honestly have almost perfect correlation with one of the more commonly named shape metrics because it's really asking whether the cell matches a circle shape. For 3_1 you look at that pyramid for the one that says Z with a 1 on top and a 3 on the bottom (I think). You can see it has a red and blue stripe at the edges, and a red and blue blob in the middle. What this means: picture the shape of the cell superimposed on top… it will score high for this Zernike the more blue is covered and the more red you see - our cells aren't allowed to have holes in them, so i can imagine two cell shapes that would score highly: one is almost a perfect circle but just a little flattened at the red side. The other would be almost a crescent such that the middle red blob is exposed (but it’s not a great fit because a big chunk wouldn’t align well).
6_4 isn’t shown but you can follow the right hand side of the pyramid and see it would be mostly a circle with wiggly edges (probably not far off from a circle!). I'm a bit surprised that they'd be anticorrelated to 0_0, really.

jatinarora-upmc · 2020-07-13T14:54:04Z

Hi @shntnu , I was wondering if i could skip RadialDistribution features (all or some such as FractAd), as they show distribution of total intensity, but i can not decide since i don’t have much functional interpretation of these features. What would be your recommendation?

shntnu · 2020-07-13T14:56:38Z

RadialDistribution features have been pretty informative in past experiments so I would not advise dropping. See https://forum.image.sc/t/radial-distribution-module/17272 for an explanation.

shntnu added the Discussion and Notes Documenting ideas/discussions label Mar 27, 2020

broadinstitute deleted a comment from jatinarora-upmc Apr 23, 2020

shntnu mentioned this issue Jul 20, 2020

July 20 2020 Discussions (cell count confounders, cell health predictions) #47

Closed

shntnu mentioned this issue Dec 10, 2020

Nov 2020 Discussions (associations with znf436) #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarifications about Cell Painting data #32

Clarifications about Cell Painting data #32

shntnu commented Mar 27, 2020 •

edited

Loading

shntnu commented Mar 27, 2020

sasgari commented Mar 28, 2020

shntnu commented Mar 28, 2020

shntnu commented Apr 6, 2020

shntnu commented Apr 6, 2020

shntnu commented May 19, 2020

jatinarora-upmc commented Jun 12, 2020 •

edited

Loading

shntnu commented Jun 12, 2020

jatinarora-upmc commented Jun 12, 2020

shntnu commented Jun 12, 2020 •

edited

Loading

jatinarora-upmc commented Jun 12, 2020

shntnu commented Jun 13, 2020 •

edited

Loading

bethac07 commented Jun 15, 2020 via email

shntnu commented Jun 15, 2020

shntnu commented Jun 23, 2020 •

edited

Loading

jatinarora-upmc commented Jul 13, 2020

shntnu commented Jul 13, 2020

Clarifications about Cell Painting data #32

Clarifications about Cell Painting data #32

Comments

shntnu commented Mar 27, 2020 • edited Loading

shntnu commented Mar 27, 2020

sasgari commented Mar 28, 2020

shntnu commented Mar 28, 2020

shntnu commented Apr 6, 2020

shntnu commented Apr 6, 2020

shntnu commented May 19, 2020

jatinarora-upmc commented Jun 12, 2020 • edited Loading

shntnu commented Jun 12, 2020

jatinarora-upmc commented Jun 12, 2020

shntnu commented Jun 12, 2020 • edited Loading

jatinarora-upmc commented Jun 12, 2020

shntnu commented Jun 13, 2020 • edited Loading

bethac07 commented Jun 15, 2020 via email

shntnu commented Jun 15, 2020

shntnu commented Jun 23, 2020 • edited Loading

jatinarora-upmc commented Jul 13, 2020

shntnu commented Jul 13, 2020

shntnu commented Mar 27, 2020 •

edited

Loading

jatinarora-upmc commented Jun 12, 2020 •

edited

Loading

shntnu commented Jun 12, 2020 •

edited

Loading

shntnu commented Jun 13, 2020 •

edited

Loading

shntnu commented Jun 23, 2020 •

edited

Loading