How to run and monitor jobs for Slurm on the Grid

grid hpc hpc-grid Slurm

This tutorial goes through the steps of editing a simple batch script and running it on the Grid. The job script can be used as a base to create your own batch scripts. There is also an overview of commands for monitoring and controlling jobs.

Step 1

Log on to the Grid.

Step 2

Copy the job script to your home directory. Type cp /wsu/el7/scripts/tutorial/simple_job.sh.

Step 3

The contents of the script can be viewed by typing ls.

Edit the script to fit your needs. Type vim simple_job.sh.

Step 4

You are now in the vim text editor. Press i to insert and edit.
Use the up and down arrows to scroll through the file.
Edit the email address to your own.

Press Esc and then type :wq and press Enter to save and quit.

Step 5

Now that the script is edited you can submit it to run. Batch scripts are submitted using the following command sbatch simple_job.sh.

Step 6

The job will be submitted and a job id will be given. In this example, the job id is 243915. You can check the status of your jobs by entering qme.

The same can be done by using the command squeue -u <username>.

This will output the following information:

Squeue Output	Definition
JOBID	Unique number assigned to each job
PARTITION	Partition the job is scheduled to run or is running on
NAME	Name of the job, typically the job script name
USER	User id of the job
ST	Current state of the job
TIME	Amount of time job has been running
NODES	Number of nodes job is scheduled to run across
NODELIST(REASON)	If running, the list of the nodes the job is running on. If pending, the reason the job is waiting

This is the various job states:

Code	State	Meaning
CA	Canceled	Job was canceled
CD	Completed	Job completed
CF	Configuring	Job resources being configured
CG	Completing	Job is completing
F	Failed	Job terminated with non-zero exit code
NF	Node Fail	Job terminated due to failure of node(s)
PD	Pending	Job is waiting for compute node(s)
R	Running	Job is running on compute node(s)
TO	Timeout	Job terminated upon reaching its time limit

Step 7

A useful command in getting job information is scontrol show job <jobid>.

Step 8

After your job has completed, you can get additional information using the command sacct.

Command	Meaning
sacct -j <jobid>	Get information based on job id
sacct -j <jobid> --format=JobID,Jobname,partition,state,time,MaxRss,MaxVMSize,nodelist	For a more detailed output add the '--format' option. See Reference for complete options of the command.
sacct -u <username>	View information for all jobs of a user

Step 9

There are a few useful commands for controlling jobs.

Command	Meaning
scancel <jobid>	Cancel one job
scancel -u <username>	Cancel all jobs for a user
scontrol hold <jobid>	Hold a job from being scheduled
scontrol release <jobid>	Release a job to be scheduled
scontrol requeue <jobid>	Requeue (cancel and rerun) a job
scancel <jobid>_<index>	Cancel an indexed job in a job array

0 reviews

Print Article

Updating...

How to run and monitor jobs for Slurm on the Grid

Deleting...