Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCHP integration test not properly flagging failed run #2527

Open
msulprizio opened this issue Oct 17, 2024 · 2 comments
Open

GCHP integration test not properly flagging failed run #2527

msulprizio opened this issue Oct 17, 2024 · 2 comments
Assignees
Labels
category: Bug Something isn't working never stale Never label this issue as stale topic: Benchmarking and Testing Related to CI, integration tests, or scientific benchmarking

Comments

@msulprizio
Copy link
Contributor

Your name

Melissa Sulprizio

Your affiliation

Harvard

What happened? What did you expect to happen?

My latest GCHP integration tests for PR #2510 indicate all simulations passed:

==============================================================================
GCHP: Execution Test Results

CodeDir       : 0f0aa59 Cloud-J submodule update for 8.0.1 release
MAPL          : 231d53cc Merge pull request #36 from geoschem/feature/improve_hflux_regridding
GMAO_Shared   : 4ddb3ec Merge pull request #2 from geoschem/feature/mapl-upgrade
ESMA_cmake    : ad5deba Added ecbuild as a submodule of ESMA_cmake
gFTL-shared   : 4b82492 Merge branch 'upstream_v1.5.0' into feature/v1.5.0
FMS           : 259759d Merge pull request #3 from geoschem/feature/update_gmao_libs
FVdycoreCubed : af42462 Merge PR #8 (Add PLEadv diagnostic for offline advection in GCHP)
geos-chem     : e29adfd37 Update HEMCO_Config.rc for carbon simulations read chemistry inputs based on species defines
HEMCO         : a3d0c9a Merge PR #289 containing a fix for the stale issue Github workflow
yaFyaml       : 19afe50 Merge branch 'upstream_v1.0.4' into feature/v1.0.4
pFlogger      : 2c4b724 Merge branch 'upstream_v1.9.1' into feature/v1.9.1
Cloud-J       : f8a2b7f Update version number for 8.0.1 release
HETP          : 2a99b24 Merge pull request #2 from geoschem/bugfix/initialize_local_variables

Number of execution tests: 12

Submitted as SLURM job: 52096271
==============================================================================
 
Execution tests:
------------------------------------------------------------------------------
gchp_merra2_carbon..................................Execute Simulation....PASS
gchp_merra2_carbon_CH4..............................Execute Simulation....PASS
gchp_merra2_carbon_CO...............................Execute Simulation....PASS
gchp_merra2_carbon_CO2..............................Execute Simulation....PASS
gchp_merra2_carbon_OCS..............................Execute Simulation....PASS
gchp_merra2_fullchem................................Execute Simulation....PASS
gchp_merra2_fullchem_alldiags.......................Execute Simulation....PASS
gchp_merra2_fullchem_benchmark......................Execute Simulation....PASS
gchp_merra2_fullchem_RRTMG..........................Execute Simulation....PASS
gchp_merra2_fullchem_TOMAS15........................Execute Simulation....PASS
gchp_merra2_tagO3...................................Execute Simulation....PASS
gchp_merra2_TransportTracers........................Execute Simulation....PASS
 
Summary of test results:
------------------------------------------------------------------------------
Execution tests passed: 12
Execution tests failed: 0
Execution tests not yet completed: 0

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%  All execution tests passed!  %%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

However, upon further investigation the gchp_merra2_carbon_CO2 actually failed. There is no output in gchp_merra2_carbon_CO2/OutputDir/ aside from geoschem_species_metadata.yml. The log file execute.gchp_merra2_carbon_CO2.log indicates the model crashed with an ExtData.rc error (which I will debug separately).

I would have expected the gchp_merra2_carbon_CO2 simulation to be reported as failed in results.execute.log

What are the steps to reproduce the bug?

Using the branch for PR #2510 within GCHP wrapper, execute a GCHP integration test. My command:

./integrationTest.sh -d ~/RD/TestCarbon/GCHP_IntTest -t all -e ~/envs/gnu10/gchp.rocky+gnu10.minimal.env 

Please attach any relevant configuration and log files.

No response

What GEOS-Chem version were you using?

bugfix/carbon_co2 branch (14.5.0 with CO2 fixes)

What environment were you running GEOS-Chem on?

Local cluster

What compiler and version were you using?

gcc 10.2.0

Will you be addressing this bug yourself?

Yes

In what configuration were you running GEOS-Chem?

GCHP

What simulation were you running?

Carbon

As what resolution were you running GEOS-Chem?

4x5

What meterology fields did you use?

MERRA-2

Additional information

No response

@msulprizio msulprizio added the category: Bug Something isn't working label Oct 17, 2024
@msulprizio msulprizio self-assigned this Oct 17, 2024
@lizziel
Copy link
Contributor

lizziel commented Oct 17, 2024

It looks like the integration test is considered failure on Cannon if the return code of srun is 0. I wonder if the return code from srun can be non-zero when there is an error. Would you be able to reproduce the problem, but running this time with a print of return code ($?) in this part of the code?

@msulprizio
Copy link
Contributor Author

I reran the integration tests at the same commit and printed out the error code. It returns 0 for all simulations. See
slurm-52325154.out.txt.

I confirmed I'm still getting errors in execute.gchp_merra2_carbon_CO2.log as expected since I haven't pushed any fixes for the ExtData issue:

                                                             Mem/Swap Used (MB) at HISTMAP
L_GenericInitialize=  1.572E+05  0.000E+00
                                                          Mem/Swap Used (MB) at EXTDATAMAPL_GenericInitialize=  1.572E+05  0.000E+00
pe=00000 FAIL at line=00804    ExtDataGridCompMod.F90                   <Found   1 unfulfilled imports in extdata>
pe=00000 FAIL at line=01807    MAPL_Generic.F90                         <status=1>
pe=00000 FAIL at line=00808    MAPL_CapGridComp.F90                     <status=1>
pe=00000 FAIL at line=00661    MAPL_CapGridComp.F90                     <status=1>
pe=00000 FAIL at line=00959    MAPL_CapGridComp.F90                     <status=1>
pe=00000 FAIL at line=00311    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00258    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00192    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00169    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00029    GCHPctm.F90                              <status=1>
...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 23 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
In: PMI_Abort(0, N/A)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

@yantosca yantosca added the topic: Benchmarking and Testing Related to CI, integration tests, or scientific benchmarking label Oct 29, 2024
@yantosca yantosca added the never stale Never label this issue as stale label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Bug Something isn't working never stale Never label this issue as stale topic: Benchmarking and Testing Related to CI, integration tests, or scientific benchmarking
Projects
None yet
Development

No branches or pull requests

3 participants