Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JEDI bundle run fails on Hera (Rocky8) #78

Closed
chan-hoo opened this issue Mar 25, 2024 · 6 comments · Fixed by #102
Closed

JEDI bundle run fails on Hera (Rocky8) #78

chan-hoo opened this issue Mar 25, 2024 · 6 comments · Fixed by #102
Assignees
Labels
bug Something isn't working

Comments

@chan-hoo
Copy link
Collaborator

chan-hoo commented Mar 25, 2024

The task run_ana (or analysis) fails with the following error on JEDI bundle after the Rocky8 transition on Hera:

==== backtrace (tid:2140722) ====
 0 0x00000000000534e9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000012cf0 __funlockfile()  :0
 2 0x0000000000030068 eckit::mpi::Parallel::create()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.0/cache/build_stage/spack-stage-eckit-1.24.4-vg3tc4msqqmo7nlok3wwgn3iwvwwcnzw/spack-src/src/eckit/mpi/Parallel.cc:702
 3 0x000000000014ae28 eckit::mpi::Comm::gather<double>()  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.3.0/envs/unified-env/install/intel/2021.5.0/eckit-1.20.2-3cx4lx2/include/eckit/mpi/Comm.h:541
 4 0x000000000014ae28 util::printRunStats()  /scratch1/NCEPDEV/stmp2/Rhaesung.Kim/jedi/jedi-bundle/oops/src/oops/util/printRunStats.cc:35
 5 0x00000000000fcd67 oops::Run::execute()  /scratch1/NCEPDEV/stmp2/Rhaesung.Kim/jedi/jedi-bundle/oops/src/oops/runs/Run.cc:180
 6 0x0000000000484152 main()  /scratch1/NCEPDEV/stmp2/Rhaesung.Kim/jedi/jedi-bundle/fv3-jedi/src/mains/fv3jediLETKF.cc:22
 7 0x000000000003ad85 __libc_start_main()  ???:0
 8 0x0000000000483fe9 _start()  ???:0
=================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 2140722 RUNNING AT h5c53
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

@chan-hoo chan-hoo self-assigned this Mar 25, 2024
@chan-hoo chan-hoo added the bug Something isn't working label Mar 25, 2024
@chan-hoo chan-hoo moved this from Todo to In Progress in land-DA_workflow_management Mar 25, 2024
@natalie-perlin
Copy link
Collaborator

@chan-hoo
there appears to be a mix of spack-stack and eckit package versions. These are the spack-stacks built for Rocky8:
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.3.0/envs/unified-env-rocky8
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.3.1/envs/unified-env-rocky8
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.0/envs/unified-env-rocky8
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/ufs-pio-2.5.10-rocky8
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env-rocky8
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.0/envs/unified-env-rocky8
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/fms-test-mar24-rocky8
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/gsi-addon-env-rocky8
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env-rocky8
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/gsi-addon-dev-rocky8
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8

The stacks 1.3.0 and 1.3.1 might be incomplete (per Ratko's comment), so it would be best to update the stack versions and paths

@natalie-perlin
Copy link
Collaborator

@chan-hoo -
How do I reproduce the error you were seeing?

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Mar 27, 2024

@natalie-perlin, I updated the version of the spack-stack with 1.5.0 for land-DA_workflow in my feature branch, and the above error came out when I tested my feature branch on Hera. It worked well on Orion. So I guess this might be related to Rocky8. As you mentioned, the best way would be to update the version of the spack-stack in JEDI-bundle. I'd like to know if the SI team can make it.

  • To reproduce the above error:
  1. Check out my feature on Hera:
git clone -b bugfix/mod_update --recursive https://github.com/chan-hoo/land-DA_workflow
  1. Build the app and set configuration: you can follow the steps in the user's guide (2.1.3): https://land-da-workflow.readthedocs.io/en/develop/BuildingRunningTesting/BuildRunLandDA.html
  2. You can copy my land_analysis.yaml file and just change EXP_BASEDIR and ACCOUNT in it
/scratch2/NCEPDEV/fv3-cam/Chan-hoo.Jeon/landda_test/land-DA_workflow/parm
  1. Run rocotorun.
  2. You can check the error file run_ana.log:
/scratch2/NCEPDEV/fv3-cam/Chan-hoo.Jeon/landda_test/com/output/logs/run_era5/run_ana.log

If you have any questions, please let me know.

@chan-hoo
Copy link
Collaborator Author

@natalie-perlin, as you can see in land_analysis.yaml, the land-DA_workflow just uses the pre-compiled JEDI-bundle in JEDI_INSTALL: "/scratch2/NAGAPE/epic/UFS_Land-DA/jedi". Therefore, this version should be updated.

@natalie-perlin
Copy link
Collaborator

@chan-hoo - yes, you are right, JEDI bundle needs to be rebuilt on Rocky OS.

@ulmononian
Copy link
Collaborator

@natalie-perlin @chan-hoo @jkbk2004 i will try to build the jedi bundle with the rocky8 hera stacks (1.5.0 and probably 1.3.0). right now, though, ecflow is not loading from /scratch1/NCEPDEV/jcsda/jedipara/spack-stack/modulefiles, so looking into this.

@chan-hoo chan-hoo moved this from In Progress to On Hold in land-DA_workflow_management May 16, 2024
@chan-hoo chan-hoo moved this from On Hold to In Progress in land-DA_workflow_management May 20, 2024
@chan-hoo chan-hoo linked a pull request May 21, 2024 that will close this issue
15 tasks
@chan-hoo chan-hoo moved this from In Progress to In Review in land-DA_workflow_management May 21, 2024
@github-project-automation github-project-automation bot moved this from In Review to Done in land-DA_workflow_management May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants