-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HRv4 hangs on orion and hercules #2486
Comments
This happens at high ATM resolution C1152. |
I made a HRv4 test run on orion as well. As reported previously, it hung at the beginning of the run. The log file is at /work2/noaa/stmp/rsun/ROTDIRS/HRv4 HOMEgfs=/work/noaa/global/rsun/git/global-workflow.hr.v4 (source) |
@RuiyuSun Denise reports that the privacy settings on your directories are preventing her from accessing them. Could you check on that and report back when it's fixed so others can look at your forecast? |
@DeniseWorthen I made the changes. Please try again. |
I've made a few test runs on my end and here are some observations:
Consistently all runs I have made, also the same as @RuiyuSun runs stall out here:
With high resolution runs (C768 & C1152) for various machines we've had to use different number of write grid tasks. I've tried a few and all are stalling though. This is using ESMF managed threading, so one thing to try might be moving away from that? To run a high res test case:
Change C1152 to C768 to run that resolution and also change your HPC_ACCOUNT, pslot, as desired. Lastly, if you want to turn off waves, you change that in C1152_S2SW.yaml. If you want to change resources, look in global-workflow/parm/config/gfs/config.ufs in the C768/C1152 section. If you want to run S2S only, change the app in global-workflow/ci/cases/hires/C1152_S2SW.yaml My latest run log files can be found at: |
@GeorgeVandenberghe-NOAA suggested trying 2 write groups with 240 tasks in them. I meant to try that but tried 2 write groups with 360 tasks per group unintentionally, but I did turn on all PET files as @LarissaReames-NOAA thought that might have helpful info. The rundirectory is here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/STMP/RUNDIRS/C1152t06/gfs.2019120300/gfsfcst.2019120300/fcst.272800 The log file is here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t06/COMROOT/C1152t06/logs/2019120300/gfs_fcst_seg0.log The PET logs to me also point to write group issues. Any help with this would be greatly appreciated. Tagging @aerorahul for awareness. |
Thanks to everyone for the work on this. Has anyone tried this configuration with the write component off? That might help isolate where there problem is (hopefully) and then we can direct this accordingly for further debugging. |
I have not tried this without the write component. |
@JessicaMeixner-NOAA and others, I grabbed the run directory from the last experiment you ran (/work2/noaa/marine/jmeixner/wavesforhr5/test01/STMP/RUNDIRS/C1152t06/gfs.2019120300/gfsfcst.2019120300/fcst.272800), changed it to run just ATM component and converted it to run with traditional threading. It is currently running in /work2/noaa/stmp/djovic/stmp/fcst.272800, and it passed the initialization phase and finished writing 000 and 003 hour outputs successfully. I submitted the job with just 30 min wall-clock time limit, so it will fail soon. I suggest you try running full coupled version with traditional threading if it's easy to reconfigure it. |
some good news: |
my 48hr run finished |
@DusanJovic-NOAA I tried running without ESMF threading - but am struggling to get it set-up correctly and go through. @aerorahul is it expected that turning off esmf managed threading in the workflow should work? I'm also trying on hercules to replicated @jiandewang's success but with S2SW. |
I also lanched one S2SW but it's still in pending status |
WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 with S2S did not work on orion: /work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t03/COMROOT/C1152t03/logs/2019120300/gfs_fcst_seg0.log |
mine is on hercules |
@JessicaMeixner-NOAA my gut feeling is the issue is related to the memory/node, hercules has more than orion. Maybe you can try 5 on orion |
Traditional threading is not yet supported in the global-workflow as an option. We have the toggle for it, but it requires a different set of ufs_configure files and I think we are waiting for that kind of work to be in the ufs-weather-model repo. @DusanJovic-NOAA |
I only changed ufs.configure:
And, I added job_card by copying one of the job_card from regression test run and changed:
80 is then number of cores on hercules compute nodes |
Ok. Yes. That makes sense for the atm-only. ATM_omp_num_threads: @[atm_omp_num_threads]
The original value for |
OMP_NUM_THREADS performance is i*nconsistent and generally poor if*
ATM_omp_num_threads: @[atm_omp_num_threads]
is not removed when esmf managed threading is set to false.
…On Fri, Nov 8, 2024 at 7:52 PM Rahul Mahajan ***@***.***> wrote:
I only changed ufs.configure:
1. remove all components except ATM
2. change globalResourceControl: from true to false
3. change ATM_petlist_bounds: to be 0 3023 - this numbers are lowe and
upper bounds of MPI ranks used by the ATM model, in this case 24_16_6 +
2_360, where 24 and 16 are layout values from input.nml and 2_360 are write
comp values from model_configure
And, I added job_card by copying one of the job_card from regression test
run and changed:
1. export OMP_NUM_THREADS=4 - where 4 is a number of OMP threads
2. srun --label -n 3024 --cpus-per-task=4 ./ufs_model.x - here 3024 is
a number of MPI ranks, 4 is a number of threads
3. #SBATCH --nodes=152
#SBATCH --ntasks-per-node=80
80 is then number of cores on hercules compute nodes 152 is the minimal
number of nodes such that 152*80 >= 3024
Ok. Yes. That makes sense for the atm-only.
Does your ufs.configure have a line for
ATM_omp_num_threads: @[atm_omp_num_threads]
@[atm_omp_num_threads] would have been 4. Did you remove it? Or does it
not matter since globalResourceControl is set to false?
The original value for ATM_petlist_bounds must have been 0 755 that you
changed to 0 3023, I am assuming.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FR2UXPLHUID674GWZLZ7UI7BAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2DCMBSGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
I just fixed my comment about The original value for |
Yes ESMF managed threading requires several times more ranks and ESMF fails
when rank count goes above 21000 or so. This is a VERY serious issue
for resolution increases unless it is fixed.. reported in February.
…On Fri, Nov 8, 2024 at 7:56 PM Dusan Jovic ***@***.***> wrote:
I just fixed my comment about ATM_omp_num_threads:. I set it to 1 from 4,
I'm not sure if it's ignored when globalResourceControl is set to false
The original value for ATM_petlist_bounds was something like 12 thousand
or something like that, that included MPI ranks times 4 threads.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FW3WOMQFATDADHXU53Z7UJQJAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2TAMRYGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@JessicaMeixner-NOAA BLUF, you/someone from the applications team could try traditional threading and we could gain some insight on performance at those resolutions. Thanks~ |
I have MANY test cases that use traditional threading and have converted
others from managed to traditional threading. It's generally
needed at high resolution to get decent run rates.
…On Fri, Nov 8, 2024 at 8:02 PM Rahul Mahajan ***@***.***> wrote:
@JessicaMeixner-NOAA <https://github.com/JessicaMeixner-NOAA>
I think the global-workflow is coded to use the correct ufs_configure
template and set the appropriate values for PETLIST_BOUNDS and
OMP_NUM_THREADS in the ufs_configure file.
The default in the global-workflow is to use ESMF_THREADING = YES. I am
pretty sure one could use traditional threading as well, but is an
unconfirmed fact as there was still work being done to confirm traditional
threading will work on WCOSS2 with the slignshot updates and whatnot.
Details on that are fuzzy to me at the moment.
BLUF, you/someone from the applications team could try traditional
threading and we could gain some insight on performance at those
resolutions. Thanks~
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FVGPKZCGQO7R37N6HLZ7UKE5AVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2TQMJYHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Ok. @GeorgeVandenberghe-NOAA. Where do we employ traditional threading C768 and up? If so, we can set a flag in the global-workflow for those resolutions to use traditional threading. It should be easy enough to set that up. |
I don't know because I usually get CWD testcases from others and work from
there but yes that's an excellent idea. We probably though should
also use a multiple stanza MPI launcher for the different components to
minimize core wastage for components that don't thread, particularly WAVE
…On Fri, Nov 8, 2024 at 8:11 PM Rahul Mahajan ***@***.***> wrote:
Ok. @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>.
Where do we employ traditional threading C768 and up? If so, we can set a
flag in the global-workflow for those resolutions to use traditional
threading. It should be easy enough to set that up.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FQG5MHORVYQWBE3TY3Z7ULE7AVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY3TANRTGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Unfortunately I was unable to replicate @jiandewang hercules success for HR4 tag with the top of develop. Moreover, 10 write tasks per group was not a lucky number for orion either. |
Note this was with added waves - so this might have also failed for @jiandewang if he has used waves. |
summary for more tests I did on HERCULES: |
@DeniseWorthen Thanks so much for your efforts. Please proceed to return to the grid imprint issue (#2466). @JessicaMeixner-NOAA I think the ability to run with traditional threading (no managed threading) was added to GW earlier this year (see GW Issue 2277). However, I'm not sure if it's working. If it's not, I'd recommend proceeding with opening a new issue for this feature. Since something might already exist, hopefully it's not too much of a lift to get it going. This will hopefully get you working in the short-ish term. Now, there's still something going on that we need understand. @GeorgeVandenberghe-NOAA Would you be able to continue digging into this issue? |
@JacobCarley-NOAA a comment from @aerorahul earlier in this thread:
I'll open a g-w issue (update: g-w issue: NOAA-EMC/global-workflow#3122) |
I intend to but if I encounter hangs I need people who know the component
codes to figure out where and why the hangs
are occurring. Debugging is very slow on Orion where I have encountered a
hang with 7008 mpi ranks, 1400 wave ranks and 24x32 atm decomposition
WITHOUT esmf managed threading. It looks like an issue with large numbers
of ranks which we get first with ESMF managed threading but eventually at
higher resolution, without this setting too. This is DIFFERENT from the
ESMF bug where we still can't spawn more than 21K ranks without a segfault
in the ESMF code somewhere.
…On Fri, Nov 22, 2024 at 8:10 PM JacobCarley-NOAA ***@***.***> wrote:
@DeniseWorthen <https://github.com/DeniseWorthen> Thanks so much for your
efforts. Please proceed to return to the grid imprint issue (#2466
<#2466>).
@JessicaMeixner-NOAA <https://github.com/JessicaMeixner-NOAA> I *think*
the ability to run with traditional threading (no managed threading) was
added to GW earlier this year (see GW Issue 2277
<NOAA-EMC/global-workflow#2277>). However, I'm
not sure if it's working. If it's not, I'd recommend proceeding with
opening a new issue for this feature. Since something might already exist,
hopefully it's not too much of a lift to get it going. This will hopefully
get you working in the short-ish term.
Now, there's still something going on that we need understand.
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
Would you be able to continue digging into this issue?
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FTJAR5H5W2G2BW2YPT2B6FUFAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJUG4YTKMBWGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Thanks @GeorgeVandenberghe-NOAA! Just send me a quick note offline (email is fine) when you need a component expert to jump in and I'll be happy to coordinate accordingly. |
It looks like the hangs are related to the total number of WAVE tasks but are also related to total resource usage. I have verified that a 16x16 decomposition (ATM) with traditional threads (two per rank) and 1400 wave ranks does not hang on either Orion or Hercules but a 24x32 decomposition with 1400 wave ranks does. 998 rank runs do get through with a 24x32 decomposition. So it looks like total job resources is a contributing issue. It isn't just a hard barrier that we can't run 1400 wave tasks on orion or hercules. |
@RuiyuSun |
I have gotten ESMF managed threading cases to work with low resource usage
on both Hercules and Orion.. this since ESMF managed threading remains
easier to support in the workflow. With higher resource usage I am
seeing hangs either upon initiation of the wave model or somewhere vaguely
in ESMF not involving the wave model. I will capture these cases and
report. We can run retrospectives with traditional threading but not with
esmf managed threading. The latter is too slow but I am still looking for
special cases where it might be fast enough. Turnaround on both Hercules
and Orion is very slow but C1152 coupled is also a large resource intensive
system and the R&D machines, are not really large enough to support these
retrospectives anyway. WCOSS2 and Gaea C5 and C6 are large enough.
…On Wed, Dec 11, 2024 at 2:50 PM Rahul Mahajan ***@***.***> wrote:
@RuiyuSun <https://github.com/RuiyuSun>
I have implemented a traditional threading option in the global-workflow
with suggestions from @junwang-noaa <https://github.com/junwang-noaa> and
@DusanJovic-NOAA <https://github.com/DusanJovic-NOAA>. global-workflow PR
3149 <NOAA-EMC/global-workflow#3149> is under
review.
I have tested the case of C768 S2SW on Hercules. Please see the details
and changes in the open PR.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FU2HWSFU6IOJNBVFKT2FBGJ3AVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZWGIYDSOJZGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
I am curious, has the HRv4 configuration (with ESMF-managed threading) been run on Gaea and/or WCOSS2. If so, does it also hang in the same way? Sorry if this was discussed already, and I overlooked it in the discussion. |
No. It is much more reliable on WCOSS2 and Gaea. Orion and Hercules are
the two systems we see these hangs. It would likely happen on hera too
but hera is too small and busy to even try this on that system.
The constraint on ESMF managed threading on Gaea and WCOSS2 is the
inability to spawn more than about 21000 MPI ranks so we have to go to
traditional threading for high resolution and fast runtimes to stay under
21000 MPI ranks.
…On Wed, Dec 11, 2024 at 4:19 PM Gerhard Theurich ***@***.***> wrote:
I am curious, has the HRv4 configuration (with ESMF-managed threading)
been run on Gaea and/or WCOSS2. If so, does it also hang in the same way?
Sorry if this was discussed already, and I overlooked it in the discussion.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FSJQEMW2KJS7TS5I3L2FBQX3AVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZWGQ2DSNBUGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@theurich we ran HR4 on WCOSS2. I can't remember if we had trouble finding a node combination that worked, but all of HR4 ran reliably - although @jiandewang could correct if I'm wrong. I don't think HR4 was run on gaea but George might have run tests there. |
@theurich I don't want to muddy the water, but we have been seeing issues w/ ESMF-MT on Gaea-C6 for just the RTs. See #2448 (comment) and #2448 (comment) |
That is good to know. So an interesting twist here is that we think that ESMF v8.8.0b09 addresses the issue with the higher MPI task count on GAEA and WCOSS2. In fact it was sort of our hope of pushing this beta out for UFS... but then we learned about this new issue on Hercules and Orion with HRv4 and ESMF-MT. ... which we are working on now, but seems like a separate issue. Bottom line, if someone could try a larger > 21kPET job on GAEA or WCOSS2 with ESMF v8.8.0b09 that would be interesting. The recommendation with this beta is to NOT switch to UCX, but use the default OFI on Cray Slingshot. |
@DeniseWorthen interesting... I will read through those issues. Thanks! |
I can try it on Gaea but it has to be compatible with MAPL/2.40.3. I have
to build it myself on Gaea.
We cannot build ESMF tests on WCOSS2. Anyone who does so will be fired
and separated from NOAA. Policy
…On Wed, Dec 11, 2024 at 4:29 PM Gerhard Theurich ***@***.***> wrote:
No. It is much more reliable on WCOSS2 and Gaea. Orion and Hercules are
the two systems we see these hangs. It would likely happen on hera too but
hera is too small and busy to even try this on that system. The constraint
on ESMF managed threading on Gaea and WCOSS2 is the inability to spawn more
than about 21000 MPI ranks so we have to go to traditional threading for
high resolution and fast runtimes to stay under 21000 MPI ranks.
That is good to know. So an interesting twist here is that we think that
ESMF v8.8.0b09 addresses the issue with the higher MPI task count on GAEA
and WCOSS2. In fact it was sort of our hope of pushing this beta out for
UFS... but then we learned about this new issue on Hercules and Orion with
HRv4 and ESMF-MT. ... which we are working on now, but seems like a
separate issue.
Bottom line, if someone could try a larger > 21kPET job on GAEA or WCOSS2
with ESMF v8.8.0b09 that would be interesting. The recommendation with this
beta is to NOT switch to UCX, but use the default OFI on Cray Slingshot.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FU7FWSQR5PQXWY26HT2FBR7NAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZWGQ4DKMRZGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
The hangs on Hercules and Orion appear to be resource (rank count total and
between components, both are terms in reliability probability) related and
occur with traditional
threads as well as ESMF threads but at higher node counts with traditional
threads. They have been reported for eight weeks and the only mitigation
so far has been to find successful combinations
of resource use between the components that do not hang. We have
successful combinations for traditional threads and I am still looking for
one that is fast enough with ESMF managed threads on both Orion and
Hercules. We do have ESMF managed threads configurations that run on both
systems but they are too slow for retrospectives.
…On Wed, Dec 11, 2024 at 4:29 PM Gerhard Theurich ***@***.***> wrote:
No. It is much more reliable on WCOSS2 and Gaea. Orion and Hercules are
the two systems we see these hangs. It would likely happen on hera too but
hera is too small and busy to even try this on that system. The constraint
on ESMF managed threading on Gaea and WCOSS2 is the inability to spawn more
than about 21000 MPI ranks so we have to go to traditional threading for
high resolution and fast runtimes to stay under 21000 MPI ranks.
That is good to know. So an interesting twist here is that we think that
ESMF v8.8.0b09 addresses the issue with the higher MPI task count on GAEA
and WCOSS2. In fact it was sort of our hope of pushing this beta out for
UFS... but then we learned about this new issue on Hercules and Orion with
HRv4 and ESMF-MT. ... which we are working on now, but seems like a
separate issue.
Bottom line, if someone could try a larger > 21kPET job on GAEA or WCOSS2
with ESMF v8.8.0b09 that would be interesting. The recommendation with this
beta is to NOT switch to UCX, but use the default OFI on Cray Slingshot.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FU7FWSQR5PQXWY26HT2FBR7NAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZWGQ4DKMRZGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
we ran hundreds of HR4 cases on wcoss2 and didn't have any hanging case, and we adjusted resources for FV3 for speed and job turn around puropse (not for the purpose of node combination) without issue. |
I have run hr4 and hr4.v2 on Gaea. Memory is an issue and I have to
specify more memory for the I/O ranks by spawning more ranks per I/O
group. Most of my tests also disabled ESMF managed threading but I think I
did have a few where that worked at low resource usage. I have found the
Gaea software dependency stacks to be unreliable and unstable and build my
own as an alternative.
…On Wed, Dec 11, 2024 at 5:06 PM jiandewang ***@***.***> wrote:
I am curious, has the HRv4 configuration (with ESMF-managed threading)
been run on Gaea and/or WCOSS2. If so, does it also hang in the same way?
Sorry if this was discussed already, and I overlooked it in the discussion.
@theurich <https://github.com/theurich> we ran HR4 on WCOSS2. I can't
remember if we had trouble finding a node combination that worked, but all
of HR4 ran reliably - although @jiandewang <https://github.com/jiandewang>
could correct if I'm wrong. I don't think HR4 was run on gaea but George
might have run tests there.
we ran hundreds of HR4 cases on wcoss2 and didn't have any hanging case,
and we adjusted resources for FV3 for speed and job turn around puropse
(not for the purpose of node combination) without issue.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FUN4NUHKE42X4VWRGD2FBWJRAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZWGU3TKNBWGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
The following ESMF MANAGED THREADING combination works on Orion
sh decompose 8 16 256 2 240 120 984 hr4 32
8x16 decomposition. Two ATM threads 256 ranks per I/O group, two I/O
groups, 240 OCN, 120 ICE 984 WAVE and 32 ranks per node
The same combination works on Hercules but wastes nodes. Attempts to run
64 ranks per node on Hercules with the Orion numbers above, hang.
…On Wed, Dec 11, 2024 at 4:25 PM Jessica Meixner ***@***.***> wrote:
I am curious, has the HRv4 configuration (with ESMF-managed threading)
been run on Gaea and/or WCOSS2. If so, does it also hang in the same way?
Sorry if this was discussed already, and I overlooked it in the discussion.
@theurich <https://github.com/theurich> we ran HR4 on WCOSS2. I can't
remember if we had trouble finding a node combination that worked, but all
of HR4 ran reliably - although @jiandewang <https://github.com/jiandewang>
could correct if I'm wrong. I don't think HR4 was run on gaea but George
might have run tests there.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FU7AZD7MWWUUJEBSMT2FBRPDAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZWGQ3DMNRVGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Gerhard (@theurich) suggests we have this variable |
Will try for my failed cases
…On Thu, Dec 12, 2024 at 8:42 AM Dusan Jovic ***@***.***> wrote:
Gerhard ***@***.*** <https://github.com/theurich>) suggests we have this
variable FI_MLX_INJECT_LIMIT=0 set in job scripts on Hercules (and
probably Orion). I tried c1152s2sw test on Hercules, using ESMF managed
threading, and it works fine.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FTUHBHKODLB2GFJTRD2FGHDVAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZYHE4DAOBUHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
This also seems to help me - I'll post log files later (it's still pretty slow, so I am running for shorter so the full run finishes). |
@JessicaMeixner-NOAA do you have ESMF profiling enabled for these runs. If so, I would be interested in looking at the profile summary to see if anything obvious sticks out wrt performance. |
@theurich I do - that's actually why I shortened the forecast length so we could actually get the report and not go over the wallclock. I'll post the location here when complete. |
What does this do and how did you find it.
On a side note, EVERY MPI implementation I have encountered has had these
not well documented "trick level" settings we need to get
BASIC stuff to run. And the MPI we do for NWP is pretty basic,
alltoalls, broadcasts, scatters, reductions and large message send
receives with or without buffering* (The MPI standard*
*states by the way a "correct" code must work with zero buffering.)*
…On Thu, Dec 12, 2024 at 4:03 PM Gerhard Theurich ***@***.***> wrote:
Gerhard ***@***.*** <https://github.com/theurich>) suggests we have this
variable FI_MLX_INJECT_LIMIT=0 set in job scripts on Hercules (and
probably Orion). I tried c1152s2sw test on Hercules, using ESMF managed
threading, and it works fine.
This also seems to help me - I'll post log files later (it's still pretty
slow, so I am running for shorter so the full run finishes).
@JessicaMeixner-NOAA <https://github.com/JessicaMeixner-NOAA> do you have
ESMF profiling enabled for these runs. If so, I would be interested in
looking at the profile summary to see if anything obvious sticks out wrt
performance.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FSXD6N6PA6COYDISPD2FGXVHAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZZGM3TGOJXGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
On a side note, this is much more reliable on GAEA. However we cannot run
with ESMF managed threading and full node usage because we cannot run 128
MPI ranks per node on Gaea. When I try it fails witn an error, probably
memory exhaustion and this happens even with post and history writes turned
off. *I have no idea where the memory exhaustion is happening (the nodes
have 256 G of memory!!). * ESMF managed threading disables conventional
threading so if we have to run sparsely (64 ranks per node works). we
will only effectively use 64 cores. With traditional threading I can run
the ATM component efficiently, for example 32 ranks per node and 4 threads
per rank. Components that don't thread are of course inefficient this way.
…On Thu, Dec 12, 2024 at 4:05 PM Jessica Meixner ***@***.***> wrote:
@theurich <https://github.com/theurich> I do - that's actually why I
shortened the forecast length so we could actually get the report and not
go over the wallclock. I'll post the location here when complete.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FXSMJLWOTXBTXBNZUT2FGX3DAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZZGM3TONJTGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@theurich In this configuration the wave component is slow - I'll try increasing wave nodes and using PIO. Note both are using a branch of WW3 as I was trying to test initialization speedup (which we do see in the second): /work2/noaa/marine/jmeixner/hercules/TestNewGridNewThreads/t03/COMROOT/t03/gfs.20200913/00/model/atmos/history/ESMF_Profile.summary /work2/noaa/marine/jmeixner/hercules/TestNewGridNewThreads/t04/COMROOT/t04/gfs.20200913/00/model/atmos/history/ESMF_Profile.summary Edit: Wave is slower than [fv3_fcst] RunPhase1 , but not [ATM] RunPhase1 |
I agree that using PIO for WW3 might help. See the memory figures I posted here, the middle panels (VmRSS) and the drop in memory req'd when not loading everything onto the last PET for binary restarts. That said, I can't explain why the memory sizes are << node memory, regardless. It seems like there should be plenty of memory, even w/o PIO+WW3. |
@JessicaMeixner-NOAA Just to let you know that I don't have an account on Hercules. Will need a copy of the |
@theurich theurich They are on hera here: |
George V. noticed that The HRv4 does not work on Hercules or Orion. It hangs sometime after WW3 starts. No relevant message in the log files about the hanging.
To Reproduce: Run an HRv4 experiment on Hercules or Orion
Additional context
Output
The text was updated successfully, but these errors were encountered: