-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restart from vlsv does not work #2
Comments
It's https://github.com/fmihpc/vlsv/blob/master/vlsv_common_mpi.cpp#L104 that is failing, you could do a quick check of the VLSV file footer and check that all arrays have sensible datatype and byte size. |
Yes, that's it. particle arrays have "unknown" datatype in the restart file.
|
I don't have any code compiled at the moment nor a virtualbox installed, so I can't quickly check this myself. Do you have any old restart files around? I think those particle pops have always had 'unknown' datatype, so the change must be in vlsv. |
Hi, sorry for the delay, I found the issue in VLSV library. I'll have a pull request ready soon so that you can test it. I can't promise 100% that the old restart data is useful, however. |
Please try this fmihpc/vlsv#32 |
Marked this as 'help wanted' and 'wontfix' as the issue is in VLSV library and tracked by issue fmihpc/vlsv#31 |
Some update on the breakpointing issue. I am using fmihpc/vlsv#32 patch for VLSV. Starting a Corsair/RHybrid run from a breakpoint file results sometimes in few corrupted particles in particle lists. The corruptions include floating point quantities of a particle (x,y,z,vx,vy,vz,w) mixed to each other (e.g. vx gets the value of w or vice versa) or uninitialized floating points values (1e-300 etc). These corrupted particles are not present in particle lists when restart files are written in the original run. I believe they must occur in the writing or reading the restart file. Not sure if it happens because of Corsair or VLSV. I have tried applying updatePartitioning function of particle_list_skeleton.h just before writeRestart is called. This did not help. The corrupted particles associated with a restart seem to occur less frequently (or maybe not at all) if the mesh is not repartitioned during the original run. I think one possible issue could arise if a SIZE(DYNAMIC) array would include for some reason a wrong number of particles in a cell/block. Or some other issue with writing or reading dynamic user arrays by Corsair. Or maybe this is still related to the bytesize issue, which was fixed in the VLSV "corsair_restart_fix" patch fmihpc/vlsv#32. I have not been able to construct a (very) minimal example outside of our Cray system yet. So this is not an easy one to debug. Figure caption: An example of how the corrupted particles are distributed by the MPI processes. On the right is the particle density after restarting with corrupted particles removed (by this method: fmihpc/rhybrid@f85cb77) and on the left are the MPI process ranks. |
restarting rhybrid from a vlsv file produces errors:
(VLSV) ERROR: VLSV::getMPIDatatype called with datatype::UNKNOWN datatype, returning MPI_DATATYPE_NULL!
Not sure yet where the problem is: vlsv, corsair or rhybrid user code.
The text was updated successfully, but these errors were encountered: