Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart from vlsv does not work #2

Open
rjarvinen opened this issue Jan 9, 2018 · 7 comments
Open

restart from vlsv does not work #2

rjarvinen opened this issue Jan 9, 2018 · 7 comments

Comments

@rjarvinen
Copy link
Member

restarting rhybrid from a vlsv file produces errors:
(VLSV) ERROR: VLSV::getMPIDatatype called with datatype::UNKNOWN datatype, returning MPI_DATATYPE_NULL!

Not sure yet where the problem is: vlsv, corsair or rhybrid user code.

@sandroos
Copy link
Contributor

sandroos commented Jan 9, 2018

It's https://github.com/fmihpc/vlsv/blob/master/vlsv_common_mpi.cpp#L104 that is failing, you could do a quick check of the VLSV file footer and check that all arrays have sensible datatype and byte size.

@rjarvinen
Copy link
Member Author

Yes, that's it. particle arrays have "unknown" datatype in the restart file.

<DYNAMIC arraysize="29311" datasize="56" datatype="unknown" mesh="SpatialGrid" name="sw_H+_particles" type="celldata" vectorsize="1">1199269</DYNAMIC>
<DYNAMIC arraysize="1418" datasize="56" datatype="unknown" mesh="SpatialGrid" name="sw_He++_particles" type="celldata" vectorsize="1">2847597</DYNAMIC>
<DYNAMIC arraysize="3067" datasize="56" datatype="unknown" mesh="SpatialGrid" name="iono_O+_particles" type="celldata" vectorsize="1">2933917</DYNAMIC>
<DYNAMIC arraysize="3794" datasize="56" datatype="unknown" mesh="SpatialGrid" name="iono_O2+_particles" type="celldata" vectorsize="1">3112581</DYNAMIC>
<DYNAMIC arraysize="1099" datasize="56" datatype="unknown" mesh="SpatialGrid" name="exo_H+_particles" type="celldata" vectorsize="1">3331957</DYNAMIC>
<DYNAMIC arraysize="1841" datasize="56" datatype="unknown" mesh="SpatialGrid" name="exo_O+_particles" type="celldata" vectorsize="1">3400413</DYNAMIC>

@sandroos
Copy link
Contributor

I don't have any code compiled at the moment nor a virtualbox installed, so I can't quickly check this myself. Do you have any old restart files around? I think those particle pops have always had 'unknown' datatype, so the change must be in vlsv.

@sandroos
Copy link
Contributor

Hi, sorry for the delay, I found the issue in VLSV library. I'll have a pull request ready soon so that you can test it. I can't promise 100% that the old restart data is useful, however.

@sandroos
Copy link
Contributor

sandroos commented Jan 27, 2018

Please try this fmihpc/vlsv#32

@sandroos
Copy link
Contributor

Marked this as 'help wanted' and 'wontfix' as the issue is in VLSV library and tracked by issue fmihpc/vlsv#31

@rjarvinen
Copy link
Member Author

rjarvinen commented Jun 6, 2018

Some update on the breakpointing issue. I am using fmihpc/vlsv#32 patch for VLSV.

Starting a Corsair/RHybrid run from a breakpoint file results sometimes in few corrupted particles in particle lists. The corruptions include floating point quantities of a particle (x,y,z,vx,vy,vz,w) mixed to each other (e.g. vx gets the value of w or vice versa) or uninitialized floating points values (1e-300 etc).

These corrupted particles are not present in particle lists when restart files are written in the original run. I believe they must occur in the writing or reading the restart file. Not sure if it happens because of Corsair or VLSV.

I have tried applying updatePartitioning function of particle_list_skeleton.h just before writeRestart is called. This did not help. The corrupted particles associated with a restart seem to occur less frequently (or maybe not at all) if the mesh is not repartitioned during the original run.

I think one possible issue could arise if a SIZE(DYNAMIC) array would include for some reason a wrong number of particles in a cell/block. Or some other issue with writing or reading dynamic user arrays by Corsair. Or maybe this is still related to the bytesize issue, which was fixed in the VLSV "corsair_restart_fix" patch fmihpc/vlsv#32.

I have not been able to construct a (very) minimal example outside of our Cray system yet. So this is not an easy one to debug.

run03_crop

Figure caption: An example of how the corrupted particles are distributed by the MPI processes. On the right is the particle density after restarting with corrupted particles removed (by this method: fmihpc/rhybrid@f85cb77) and on the left are the MPI process ranks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants