Error in step 8 of run_genespace() #171

melop · 2024-10-23T03:27:27Z

Thank you for developing genespace! I have been using it in many projects.
Previously when I ran a similar dataset it was fine, until I added another species, the following error happened:

############################ 8. Constructing syntenic pan-gene sets ... **WARNING**: genomes Aquifoliales_Ilex_paraguariensis, Escalloniales_Escallonia_herrerae have < 75% of genes on chromosomes that contain > 10 genes. Synteny is not a useful metric for these genomes. Be very careful with your pan-gene sets. Camellia_lanceoleosa : Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice. Calls: run_genespace ... merge -> merge.data.table -> [ -> [.data.table -> vecseq Execution halted

These are 41 eudicot genomes that have a pretty deep divergence.
Thank you for any pointers.

The text was updated successfully, but these errors were encountered:

LovellHAGSC · 2024-10-23T20:03:49Z

I have seen this issue before and it can be caused by a couple different (but rare) situations, typically when a single gene is assigned to many many blocks (maybe the plants you are working with have gone through a lot of WGDs?), yet ploidy is set to 1. Is it possible that you have some genomes that are flagged as ploidy = 1 in init_genespace but have several WGDs on the branch to other genomes in the phylogeny? If this is the case, try resetting the ploidy parameter to accurately reflect the WGDs in the phylogeny. If not, then I don't have a good solution to this problem, but we can probably get what you need out of the software - are you just hoping to get a riparian plot out of it, or do you want the pan-gene sets too?

melop · 2024-10-24T09:50:21Z

I am hoping to use the pan-gene sets too. The situation is that WGD definitely happened but then the plants went through rediplodization.

LovellHAGSC · 2024-10-24T15:09:54Z

I see. It might be worth reading the very last methods section in the genespace paper, which details why using deeply diverged genomes with histories of nested whole genome duplications can be challenging. The take-home is that even if your genomes have apparently diploidized, they almost always contain syntenic blocks from both homeologs. So, you have to set your ploidy to reflect the WGD history.
For example, if you compare arabidopsis to common bean, which are both diploid species with haploid assemblies, you would treat common bean as 2x (1 WGD) and arabidopsis as 4x (two nested WGD) for syntenic comparisons. Indeed, you get 2x - 4x dotplots in this comparison.

LovellHAGSC · 2024-10-24T15:12:23Z

In your case, I bet there are no genomes that are truly 1x relative to all the others. This makes the underlying graph structure very complex and causes this particular error. I have yet to be able to recreate it myself, but likely this is because I haven't tried as complex a run as you have tried here.
So - long story short, I won't be able to resolve this issue. I'd suggest either running GENESPACE on smaller subsets of genomes, or to include a genome that is truly 1x in the run and give the ploidy as the phylogenetically expected copy number given WGDs in your set of genomes.

LovellHAGSC · 2024-10-24T21:19:32Z

Oh, I also didn't see this note: Aquifoliales_Ilex_paraguariensis, Escalloniales_Escallonia_herrerae have < 75% of genes on chromosomes that contain > 10 genes. Synteny is not a useful metric for these genomes. Be very careful with your pan-gene sets. Camellia_lanceoleosa ... do you have some genomes that are not scaffolded?

melop · 2024-10-25T03:15:54Z

Thank you for the explanation! Right - it seems that plants underwent many rounds of WGD and indeed most of these genomes are probably ancient tetraploids of some sort. It makes sense. Right, I noticed these notes too. After removing three species that are unscaffolded, the run was ok.

LovellHAGSC · 2024-10-29T23:05:22Z

Good to hear!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in step 8 of run_genespace() #171

Error in step 8 of run_genespace() #171

melop commented Oct 23, 2024

LovellHAGSC commented Oct 23, 2024

melop commented Oct 24, 2024

LovellHAGSC commented Oct 24, 2024

LovellHAGSC commented Oct 24, 2024

LovellHAGSC commented Oct 24, 2024

melop commented Oct 25, 2024

LovellHAGSC commented Oct 29, 2024

Error in step 8 of run_genespace() #171

Error in step 8 of run_genespace() #171

Comments

melop commented Oct 23, 2024

LovellHAGSC commented Oct 23, 2024

melop commented Oct 24, 2024

LovellHAGSC commented Oct 24, 2024

LovellHAGSC commented Oct 24, 2024

LovellHAGSC commented Oct 24, 2024

melop commented Oct 25, 2024

LovellHAGSC commented Oct 29, 2024