restart from vlsv does not work #2

rjarvinen · 2018-01-09T12:25:38Z

restarting rhybrid from a vlsv file produces errors:
(VLSV) ERROR: VLSV::getMPIDatatype called with datatype::UNKNOWN datatype, returning MPI_DATATYPE_NULL!

Not sure yet where the problem is: vlsv, corsair or rhybrid user code.

sandroos · 2018-01-09T18:10:27Z

It's https://github.com/fmihpc/vlsv/blob/master/vlsv_common_mpi.cpp#L104 that is failing, you could do a quick check of the VLSV file footer and check that all arrays have sensible datatype and byte size.

rjarvinen · 2018-01-10T09:14:23Z

Yes, that's it. particle arrays have "unknown" datatype in the restart file.

<DYNAMIC arraysize="29311" datasize="56" datatype="unknown" mesh="SpatialGrid" name="sw_H+_particles" type="celldata" vectorsize="1">1199269</DYNAMIC>
<DYNAMIC arraysize="1418" datasize="56" datatype="unknown" mesh="SpatialGrid" name="sw_He++_particles" type="celldata" vectorsize="1">2847597</DYNAMIC>
<DYNAMIC arraysize="3067" datasize="56" datatype="unknown" mesh="SpatialGrid" name="iono_O+_particles" type="celldata" vectorsize="1">2933917</DYNAMIC>
<DYNAMIC arraysize="3794" datasize="56" datatype="unknown" mesh="SpatialGrid" name="iono_O2+_particles" type="celldata" vectorsize="1">3112581</DYNAMIC>
<DYNAMIC arraysize="1099" datasize="56" datatype="unknown" mesh="SpatialGrid" name="exo_H+_particles" type="celldata" vectorsize="1">3331957</DYNAMIC>
<DYNAMIC arraysize="1841" datasize="56" datatype="unknown" mesh="SpatialGrid" name="exo_O+_particles" type="celldata" vectorsize="1">3400413</DYNAMIC>

sandroos · 2018-01-11T19:33:57Z

I don't have any code compiled at the moment nor a virtualbox installed, so I can't quickly check this myself. Do you have any old restart files around? I think those particle pops have always had 'unknown' datatype, so the change must be in vlsv.

sandroos · 2018-01-27T21:14:48Z

Hi, sorry for the delay, I found the issue in VLSV library. I'll have a pull request ready soon so that you can test it. I can't promise 100% that the old restart data is useful, however.

sandroos · 2018-01-27T21:31:21Z

Please try this fmihpc/vlsv#32

sandroos · 2018-01-27T21:32:30Z

Marked this as 'help wanted' and 'wontfix' as the issue is in VLSV library and tracked by issue fmihpc/vlsv#31

rjarvinen · 2018-06-06T12:50:10Z

Some update on the breakpointing issue. I am using fmihpc/vlsv#32 patch for VLSV.

Starting a Corsair/RHybrid run from a breakpoint file results sometimes in few corrupted particles in particle lists. The corruptions include floating point quantities of a particle (x,y,z,vx,vy,vz,w) mixed to each other (e.g. vx gets the value of w or vice versa) or uninitialized floating points values (1e-300 etc).

These corrupted particles are not present in particle lists when restart files are written in the original run. I believe they must occur in the writing or reading the restart file. Not sure if it happens because of Corsair or VLSV.

I have tried applying updatePartitioning function of particle_list_skeleton.h just before writeRestart is called. This did not help. The corrupted particles associated with a restart seem to occur less frequently (or maybe not at all) if the mesh is not repartitioned during the original run.

I think one possible issue could arise if a SIZE(DYNAMIC) array would include for some reason a wrong number of particles in a cell/block. Or some other issue with writing or reading dynamic user arrays by Corsair. Or maybe this is still related to the bytesize issue, which was fixed in the VLSV "corsair_restart_fix" patch fmihpc/vlsv#32.

I have not been able to construct a (very) minimal example outside of our Cray system yet. So this is not an easy one to debug.

Figure caption: An example of how the corrupted particles are distributed by the MPI processes. On the right is the particle density after restarting with corrupted particles removed (by this method: fmihpc/rhybrid@f85cb77) and on the left are the MPI process ranks.

sandroos self-assigned this Jan 27, 2018

sandroos added help wanted wontfix labels Jan 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restart from vlsv does not work #2

restart from vlsv does not work #2

rjarvinen commented Jan 9, 2018

sandroos commented Jan 9, 2018

rjarvinen commented Jan 10, 2018

sandroos commented Jan 11, 2018

sandroos commented Jan 27, 2018

sandroos commented Jan 27, 2018 •

edited

Loading

sandroos commented Jan 27, 2018

rjarvinen commented Jun 6, 2018 •

edited

Loading

restart from vlsv does not work #2

restart from vlsv does not work #2

Comments

rjarvinen commented Jan 9, 2018

sandroos commented Jan 9, 2018

rjarvinen commented Jan 10, 2018

sandroos commented Jan 11, 2018

sandroos commented Jan 27, 2018

sandroos commented Jan 27, 2018 • edited Loading

sandroos commented Jan 27, 2018

rjarvinen commented Jun 6, 2018 • edited Loading

sandroos commented Jan 27, 2018 •

edited

Loading

rjarvinen commented Jun 6, 2018 •

edited

Loading