Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partition.pl vs partition_gmap.pl vs partition_gmap.py #165

Open
diriano opened this issue Jul 15, 2023 · 1 comment
Open

partition.pl vs partition_gmap.pl vs partition_gmap.py #165

diriano opened this issue Jul 15, 2023 · 1 comment

Comments

@diriano
Copy link

diriano commented Jul 15, 2023

Dear @tangerzhang and @tanghaibao,

I see some inconsistencies between the files partition.pl vs partition_gmap.pl vs partition_gmap.py. I would like to know whether this is intended or whether it is an error.

In partition_gmap.pl (d5bb1e5) line 65 reads

next if(exists($rdb{$rname})); ### only retain single-end reads

The comment is not appropriate, as this will not keep single-end reads, I will only keep the first alignment of a given read.
The same line is present in partition.pl (b49ddea) line 51.

Nothing like that is present in partition_gmap.py (80ce6a9). This, I understand, would be the correct behavior, keeping all alignments from contigs originating from the same chromosome.

I have been using partition.pl with a sugarcane genome of 11Gb and a pruned bam of 250GB, this ran out of memory, on a machine with 0.5TB RAM. I have rewritten partition.pl to use a lot less memory, at least 20x less than your version. This version uses bioperl to index the assembled genome, and goes a single time through the streamed BAM file (without loading it in memory), available here. And another version that also streams the BAM file, both only for a set of contigs in a given chromosome

I will appreciate your comments on this.
Thanks a lot in advance.
Best,
Diego

@diriano
Copy link
Author

diriano commented Jul 15, 2023

Another point. in partition.pl, line 76

next if($count>1);

Only allow to assign a contig to one and only one chromosome from the reference. Each time the script is run, a different chromosome could be chosen. Perhaps a better way to chose would be to the one with the highest number of hits, something similar to what is done in partition_gmap.py in lines 53 - 57.
Any thoughts?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant