4.2. Frequently Asked Questions (FAQ)
4.2.1. My tasks went DEAD. Why might this be?
The most common reason for the first few tasks to go DEAD is an improper path in the config.yaml file that gets propagated to the land_analysis.xml Rocoto XML file.
In particular, exp_basedir must be set to the directory above land-DA_workflow. For example, if land-DA_workflow resides at Users/Jane.Doe/landda/land-DA_workflow, then exp_basedir must be set to Users/Jane.Doe/landda. After correcting config.yaml, users will need to regenerate the workflow XML by running:
./setup_wflow_env.py -p=<platform>
Then, rewind the DEAD tasks as described below using rocotorewind, and use rocotorun/rocotostat to advance/check on the workflow (see Section 2.1.5.2 for how to do this).
If the first few tasks run successfully, but future tasks go DEAD, users will need to check the experiment log files, located at $EXP_BASEDIR/ptmp/<envir>/com/output/logs. It may also be useful to check that the JEDI directory and other paths and values are correct in config.yaml.
4.2.2. How do I restart a DEAD task?
On platforms that utilize Rocoto workflow software (including Ursa and Hercules), if something goes wrong with the workflow, a task may end up in the DEAD state:
$ rocotostat -w land_analysis.xml -d land_analysis.db
CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION
==============================================================================================
202501190000 jcb 8215490 SUCCEEDED 0 1 6.0
202501190000 prep_data 8215491 SUCCEEDED 0 1 21.0
202501190000 pre_anal 8215492 SUCCEEDED 0 1 6.0
202501190000 analysis 8215496 SUCCEEDED 0 1 152.0
202501190000 post_anal 8215519 SUCCEEDED 0 1 23.0
202501190000 forecast 8215551 DEAD 256 1 -
202501190000 plot_stats - - - - -
This means that the DEAD task has not completed successfully, so the workflow has stopped. Once the issue has been identified and fixed (e.g., by referencing the log files in ${BASEDIR}/ptmp/<envir>/com/output/logs), users can rewind, or “undo,” the failed task using the rocotorewind command:
rocotorewind -w land_analysis.xml -d land_analysis.db -v 10 -c 202501190000 -t forecast
where -c specifies the cycle date (first column of rocotostat output) and -t represents the task name
(second column of rocotostat output). This will set the number of tries to 0, as though the task has not been run. After using rocotorewind, the next time rocotorun is used to advance the workflow, the job will be resubmitted.