Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calculating insulation score get inconsistant results #350

Open
jiangshan529 opened this issue Mar 25, 2024 · 6 comments
Open

calculating insulation score get inconsistant results #350

jiangshan529 opened this issue Mar 25, 2024 · 6 comments

Comments

@jiangshan529
Copy link

Hi, I am using GENOVA to calculate insulation score. the code is

hic <- load_contacts(signal_path = './hela_s3_insitu_hic_hg38_4DNFIAS8LV1C_5k.cool')
insulation <- insulation_score(hic,
window = 25
)
data <- setNames(
insulation$insula_score[-4],
c("chrom", "start", "end", "value")
)
data <- data[is.finite(data$value),]
gr <- with(data, GRanges(chrom, IRanges(start + 1, end)))
score(gr) <- data$value
export.bedGraph(gr, here::here("my_bedgraph.bedGraph")).

Previously, when I use window 25 for resolution of 5k, it always gives me narrow windows of insulation score switching between negative and positive values. But this time when I use the same code for another file of 5k insulation(around 500M reads), it gives me very wide window, and the most negative values cannot align to TAD boundaries in hic map. Could you please give me some hints on what's the problem here?

image
the first track is the insulation score this time, the second track is previous calculated insulation score. Thanks!

@teunbrand
Copy link
Collaborator

Is this specific to this region or genome wide?
Also have you looked at the contact maps themselves and do they noticably differ near the diagonal?

@jiangshan529
Copy link
Author

Is this specific to this region or genome wide? Also have you looked at the contact maps themselves and do they noticably differ near the diagonal?

It's genome-wide. And I extracted matrix from some regions and plot the hic heatmap, it looks alright.

@teunbrand
Copy link
Collaborator

So the only difference between the top and bottom is the data itself, right? No differences in data resolution, preprocessing, analysis code such as parameters, versions of software, etc?

@jiangshan529
Copy link
Author

So the only difference between the top and bottom is the data itself, right? No differences in data resolution, preprocessing, analysis code such as parameters, versions of software, etc?

yeah, for the ins calculation itself, I also tested the code with my previous data, there's no difference. However, I also cannot find what' the difference for the data preprocessing itself. I use bwa mem to map, then pairtools for contact matrics, and then juicertools to generate .hic file, then hic2cool to extract 5kb cool file, then cooler balance to normalize the data, then use GENOVA to calculate ins. I also tried not to use cooler balance, but it still gives me the same wired ins value.

@teunbrand
Copy link
Collaborator

I sympathise with the detective work, but if this is a data issue and not a software issue, there is little I can do. There isn't anything obvious in the tracks that I recognise as a particular problem with a clear solution.

@jiangshan529
Copy link
Author

I sympathise with the detective work, but if this is a data issue and not a software issue, there is little I can do. There isn't anything obvious in the tracks that I recognise as a particular problem with a clear solution.

Hi, Teun. Now I have a hic file with 1.7 billion reads, it repots error when i calculate ins score:

insulation <- insulation_score(hic, window = 25)
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 772086250 rows; more than 46325175 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
hic
A GENOVA contacts object named 'wt_merge_all_1.48B_UU_dedup.pairs.mcool' at a resolution of 5 kb.
Contains the following slots:

  • MAT : Triplet format matrix containing 676777675 informative bins.
  • IDX : 617669 genomic indices in BED format.
  • CHRS : A vector of 24 chromosome names.
  • CENTROMERES: Locations of 24 centromeres.
    This object is assigned the colour 'black'
    0 bins are masked.
    Some chromosomes have been removed.
    The data have not been Z-score normalised.
    The original data were loaded in as balanced data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants