HOWTO submit jobs

From HPC
Revision as of 11:32, 4 July 2013 by Cwmoller (talk | contribs) (New page: == Submitting jobs == TORQUE comes with very complete man pages. Therefore, for complete documentation of TORQUE commands you are encouraged to type <code>man pbs</code> and go from there...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Submitting jobs

TORQUE comes with very complete man pages. Therefore, for complete documentation of TORQUE commands you are encouraged to type man pbs and go from there. Jobs are submitted using the qsub command. Type man qsub for information on the plethora of options that it offers.

Let's say I have an executable called "myprog". Let me try and submit it to TORQUE:

[username@head002 ~]$ qsub myprog
qsub:  file must be an ascii script

Oops... That didn't work because qsub expects a shell script. Any shell should work, so use your favorite one. So I write a simple script called "myscript.sh"

#!/bin/bash
cd $PBS_O_WORKDIR
./myprog argument1 argument2

and then I submit it:

[username@head002 ~]$ qsub myscript.sh
31.head002.sun.ac.za

That worked! Note the use of the $PBS_O_WORKDIR environment variable. This is important, since by default TORQUE on our cluster will start executing the commands in your shell script from your home directory. To go to the directory in which you executed qsub, cd to $PBS_O_WORKDIR. There are several other useful TORQUE environment variables that we will encounter later.

Specifying job parameters

By default, any script you submit will run on a single processor for a maximum of 24 hours. The name of the job will be the name of the script, and it will not email you when it starts, finishes, or is interrupted. stdout and stderr are collected into separate files named after the job number. You can affect the default behavior of TORQUE by passing it parameters. These parameters can be specified on the command line or inside the shell script itself. For example, let's say I want to send stdout and stderr to a file that is different from the default:

[username@head002 ~]$ qsub -e myprog.err -o myprog.out myscript.sh

Alternatively, I can actually edit myscript.sh to include these parameters. I can specify any TORQUE command line parameter I want in a line that begins with "#PBS":

#!/bin/bash
#PBS -e myprog.err
#PBS -o myprog.out
cd $PBS_O_WORKDIR
./myprog argument1 argument2

Now I just submit my modified script with no command-line arguments

[username@head002 ~]$ qsub myscript.csh

Useful PBS parameters

Here is an example of a more involved script that requests only 1 hour of execution time, renames the job, and sends email when the job begins, ends, or aborts:

#!/bin/bash
 
# Name of my job:
#PBS -N My-Program
 
# Run for 1 hour:
#PBS -l walltime=1:00:00
 
# Where to write stderr:
#PBS -e myprog.err
 
# Where to write stdout: 
#PBS -o myprog.out
 
# Send me email when my job aborts, begins, or ends
#PBS -m abe
 
# This command switched to the directory from which the "qsub" command was run:
cd $PBS_O_WORKDIR
 
#  Now run my program
./myprog argument1 argument2
 
echo Done!

Some more useful PBS parameters:

  • -M: Specify your email address.
  • -j oe: merge standard output and standard error into standard output file.
  • -V: export all your environment variables to the batch job.
  • -I: run and interactive job (see below).

Once again, you are encouraged to consult the qsub manpage for more options.

Special concerns for running OpenMP programs

By default, PBS assigns you 1 core on 1 node. You can, however, run your job on up to 48 cores per node. Therefore, if you want to run an OpenMP program, you must specify the number of processors per node. This is done with the flag -l nodes=1:ppn=<cores> where <cores> is the number of OpenMP threads you wish to use. Keep in mind that you still must set the OMP_NUM_THREADS environment variable within your script, e.g.:

#!/bin/bash
#PBS -N My-OpenMP-Script
#PBS -l nodes=1:ppn=8
#PBS -l walltime=1:00:00
 
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=8
./MyOpenMPProgram

Using the PBS_NODEFILE for multi-threaded jobs

Until now, we have only dealt with serial jobs. In a serial job, your PBS script will automatically be executed on the target node assigned by the scheduler. If you asked for more than one node, however, your script will only execute on the first node of the set of nodes allocated to you. To access the remainder of the nodes, you must either use MPI or manually launch threads. But which nodes to run on? PBS gives you a list of nodes in a file at the location pointed to by the PBS_NODEFILE environment variable. In your shell script, you may thereby ascertain the nodes on which your job can run by looking at the file in the location specified by this variable:

#!/bin/bash
#PBS -l nodes=2:ppn=8
 
echo The nodefile for this job is stored at `echo $PBS_NODEFILE`
echo `cat $PBS_NODEFILE`

When you run this job, you should then get output similar to:

The nodefile for this job is stored at /var/spool/torque/aux/33.head002.sun.ac.za
comp001.sun.ac.za
comp001.sun.ac.za
comp001.sun.ac.za
comp001.sun.ac.za
comp001.sun.ac.za
comp001.sun.ac.za
comp001.sun.ac.za
comp001.sun.ac.za
comp002.sun.ac.za
comp002.sun.ac.za
comp002.sun.ac.za
comp002.sun.ac.za
comp002.sun.ac.za
comp002.sun.ac.za
comp002.sun.ac.za
comp002.sun.ac.za

If you have an application that manually forks processes onto the nodes of your job, you are responsible for parsing the PBS_NODEFILE to determine which nodes those are. If we ever catch you running on nodes that are not yours, we will provide your name and contact info to the other HPC users whose jobs you have interfered with and let vigilante justice take its course.

MPI jobs also require you to feed the PBS_NODEFILE to mpirun.

Examples

Fluent

Fluent script requesting 4 cores, on 1 node, -m selects to mail abort and end messages and -M is the email address to send to.

#!/bin/bash
#
#PBS -l nodes=1:ppn=4
#PBS -m ae
#PBS -M user@domain

cd $PBS_O_WORKDIR
 
export PATH=/apps/ansys.inc/ansys_inc/v145/fluent/bin:$PATH
 
export LM_LICENSE_FILE=/apps/ansys.inc/ansys_inc/shared_files/licensing/license.dat

# Automatically calculate the number of processors
np=$(cat $PBS_NODEFILE | wc -l)

fluent 3d -g -t$np -ssh -i xyz.jou -cnf=${PBS_NODEFILE}

Abaqus

Abaqus script requesting 4 cores, on 1 node, -m selects to mail abort and end messages and -M is the email address to send to.
#!/bin/bash
#
#PBS -l nodes=1:ppn=4
#PBS -m ae
#PBS -M user@domain

cd $PBS_O_WORKDIR

# the input file without the .inp extension
JOBNAME=xyz

# Automatically calculate the number of processors
np=$(cat $PBS_NODEFILE | wc -l)

/apps/Abaqus/Commands/abaqus job=$JOBNAME analysis cpus=$np interactive
wait