Submitting jobs manually

Overview

This section gives background information on clusters, nodes, and job schedulers. It also outlines your responsibilities when using row on a shared resource.

If you are interested in using row, you probably have access to a cluster where you can execute the jobs in your workflows. Row is a tool that makes it easy to generate thousands of jobs. Please use it responsibly.

DO NOT ASSUME that the jobs that row generates are always correct. It is YOUR RESPONSIBILITY to understand the contents of a proper job and validate what row generates before submitting a large number of them. If not, you could easily burn through your entire allocation with jobs that cost 100 times what you expected them to.

With that warning out of the way, let's cover some of the basics you need to know.

note

This guide is generic and covers only the topics directly related to row. You can find more information in your cluster's documentation.

Clusters are large groups of computers called nodes. When you use log in to a cluster, you are given direct access to a login node. A typical cluster might have 2-4 login nodes. Login nodes are SHARED RESOURCES that many others actively use. You should use login nodes to edit text files, submit jobs, check on job status, and maybe compile source code. In general, you should restrict your login node usage to commands that will execute and complete immediately (or within a minute).

You should execute everything that takes longer than a minute or otherwise uses extensive resources on one or more compute nodes. Typical clusters have thousands of compute nodes.

Job scheduler

The job scheduler controls access to the compute nodes. It ensures that each job gets exclusive access to the resources it needs to execute. To see what jobs are currently scheduled, run

squeue

on a login node.

note

This guide assumes your cluster uses Slurm. Refer to your cluster's documentation for equivalent commands if it uses a different scheduler.

You will likely see some PENDING and RUNNING jobs. RUNNING jobs have been assigned a number of (possibly fractional) compute nodes and are currently executing on those resources. PENDING jobs are waiting for the resources that they request to become available.

Submitting a job

You should understand how to submit a job manually before you use row to automate the process. Start with a "Hello, world" job. Place this text in a file called job.sh:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=1

echo "Hello, World!"
taskset -cp $$

The first line of the script tells Slurm that this is a bash script. The next two are options that will be processed by the scheduler:

--ntasks=1 requests that the job scheduler allocate at least 1 CPU core (it may allocate and charge your account for more, see below).
--time=1 indicates that the script will execute in 1 minute or less.

The last two lines are the body of our script. This example prints "Hello, World!" and then the list of CPU cores the job is allowed to execute on.

To submit the job to the scheduler, execute:

sbatch job.sh

important

Check the documentation for your cluster before submitting this job. If sbatch reported an error, you may also need to set --account, --partition, or other options.

When sbatch successfully submits, it will inform you of the job's ID. You can monitor the status of the job with:

squeue --me

The job will show first in the PENDING state. Once there is a compute node available with the requested resources, the scheduler will start the job executing on that node. squeue will then report that the job is RUNNING. It should complete after a few moments, at which point squeue will no longer list the job.

At this time, you should see a file slurm-<Job ID>.out appear in your current directory. Inspect its contents to see the output of the script. For example:

Hello, World!
pid 830675's current affinity list: 99

note

If you see more than one number in the affinity list (e.g. 0-127), then the scheduler gave your job access to more CPU cores than --ntasks=1 asks for. This may be because your cluster allocates whole nodes to jobs. Refer to your cluster's documentation to see specific details on how jobs are allocated to nodes and charged for resource usage. Remember, it is YOUR RESPONSIBILITY (not row's) to understand whether --ntasks=1 costs 1 CPU-hour per hour or more (e.g. 128 CPU-hours per hour). If your cluster lacks a shared partition, then you need to structure your actions and groups in such a way to use all the cores you are given or else the resources are wasted.

Requesting resources

There are many types of resources that you can request in a job script. One is time. The above example requested 1 minute (--time=1). The --time option is a promise to the scheduler that your job will complete in less than the given time. The scheduler will use this information to efficiently plan other jobs to run after yours. If your job is still running after the specified time limit, the scheduler will terminate your job.

Another resource you can request is more CPU cores. For example, add --cpus-per-task=4 to the above script:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=1

echo "Hello, World!"
taskset -cp $$

Submit this script and see if the output is what you expect.

You can also request GPUs, memory, licenses, and others. In the next section, you will learn how to use row to automatically generate job scripts that request CPUs, GPUs, and time. You can set custom submit options to request others.

Most clusters also have separate partitions (requested with --partition=<partition> for certain resources). See your cluster's documentation for details.

Next steps

Now that you know all about compute nodes and job schedulers, you can now learn how to define these resource requests in workflow.toml so that row can generate appropriate job scripts.

Development of row is led by the Glotzer Group at the University of Michigan.

Keyboard shortcuts

Row documentation