Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Submitting jobs with row

Overview

This section explains how to submit jobs to the scheduler with row.

Preliminary steps

Row has built-in support for a number of clusters:

  • Andes (OLCF)
  • Anvil (Purdue)
  • Delta (NCSA)
  • Frontier (OLCF)
  • Great Lakes (University of Michigan)

You can skip to the next heading if you are using one of these clusters.

If not, then you need to create a configuration file that describe your cluster. You may also need to define launchers specific to your cluster.

  • $HOME/.config/row/clusters.toml gives your cluster a name, instructions on how to identify it, and lists the partitions your cluster provides.
  • $HOME/.config/row/launchers.toml defines how the launcher command prefixes (e.g. MPI, OpenMP) expand. The default for MPI is to use srun. If this doesn't work on your cluster, write launchers.toml to use a different command and/or options.

Many clusters have separate partitions for different resources (e.g. shared, whole node, GPU, etc...). Your final script must request the correct --partition to execute the command and charge accounts properly. clusters.toml describes rules by which row automatically selects partitions when it generates job scripts.

tip

Feel free to ask on the discussion board if you need help writing configuration files for your cluster.

Check that the output of row show cluster and row show launchers is what you expect before continuing.

Checking your job script

For demonstration purposes, this guide will continue using the Hello, workflow example. In fact, you already learned how to submit jobs in that section.

However, you should always check that the job script is correct before you submit on a cluster. The --dry-run option prints the submission script (or scripts) instead of submitting with sbatch:

row submit --dry-run

Remember, YOU ARE RESPONSIBLE for the content of the scripts that you submit. Make sure that the script is requesting the correct resources and is routed to the correct partition.

For example, the example workflow generates this job script on Anvil:

#!/bin/bash
#SBATCH --job-name=hello-directory0+2
#SBATCH --partition=shared
#SBATCH --ntasks=1
#SBATCH --time=180

directories=(
'directory0'
'directory1'
'directory2'
)

export ACTION_CLUSTER="anvil"
export ACTION_NAME="hello"
export ACTION_PROCESSES="1"
export ACTION_WALLTIME_IN_MINUTES="180"

trap 'printf %s\\n "${directories[@]}" | /home/x-joaander/.cargo/bin/row scan --no-progress -a hello - || exit 3' EXIT
for directory in "${directories[@]}"
do
    echo "Hello, $directory!" || { >&2 echo "[ERROR row::action] Error executing command."; exit 2; }
done

Notice the selection of 1 task on the shared partition. This is correct for Anvil, where the shared partition allows jobs smaller than one node and charges based on the number of CPU cores quested.

note

When using row on one of the built-in clusters row should always select the correct partition for your jobs. Please open an issue if it does not.

Submitting jobs

When you are sure that the job script is correct, submit it with:

row submit

If your cluster does not default to the correct account, you can set it in workflow.toml:

[default.action.submit_options.<cluster name>]
account = "<my account>"

The submitted status

Row tracks the Job IDs that it submits. Every time you execute row show status (or just about any row command), it will execute squeue in the background to see which jobs are still submitted.

Use the row show family of commands to query details about submitted jobs. For the hello workflow:

row show status

will show:

Action Completed Submitted Eligible Waiting Remaining cost
hello          0         3        0       0    3 CPU-hours

Similarly,

row show directories --action hello

will show something like:

Directory  Status    Job ID
directory0 submitted anvil/5044933
directory1 submitted anvil/5044933
directory2 submitted anvil/5044933

row submit is safe to use while submitted jobs remain in the queue. Submitted directories are not eligible for execution, so row submit will not submit them again.

Wait a moment for the job to finish executing (you can verify with squeue --me). Then row show status should indicate that the jobs are eligible once more (recall that the hello example creates no products, so it will never complete).

Next steps

Now you know how to use all the features of row. You are ready to deploy it and responsibly execute jobs on thousands of directories in your workflows. Read on in the next section if you would like to learn how to use signac to manage your workspace and/or write your action commands in Python.


Development of row is led by the Glotzer Group at the University of Michigan.

Copyright © 2024-2025 The Regents of the University of Michigan.