Queensland University of Technology   Brisbane Australia Skip bannerSkip to content A university for the real world - Information Technology Services
QUT Home TILS Home
Staff Directory Contact us
ITS Home About ITS Assistance Services Governance

Batch Processing

HPC & Research Support
About Us
Apply for System Access
User Guides
  Frequently Asked Questions
  Batch Processing
  Running Graphical Software
  Network File Storage
  Using Environment Modules
  Using Secure Shell
  Guides for Linux/Unix
Other Guides
     
  Web Links
Project Showcase & Gallery
Services & Resources
Performance Statistics
News & Updates
Client Satisfaction Results
Service Feedback

[Print-friendly version]

In order to get the best performance for all users of the system, it is necessary to strike a balance between running interactive jobs and scheduling jobs to run in batch mode. Obviously many applications, including editors and compilers, must be used interactively. However, once applications (including Matlab scripts) have been developed, we strongly encourage users to run their jobs through the batch system. Why? The main reason is that, for an average user, checkpointing (i.e., taking a snap-shot of a process for the purpose of "picking up where we left off" in the event of a crash) is quite tricky - however, PBS handles checkpointing quite easily. In fact, by default, PBS checkpoints all running jobs when the machine is brought down gently (unless otherwise specified - see g98 example below).

The queues are grouped into 3 categories, general, special and application specific. Each of these categories is described below together with the resource characteristics of each queue.

General Queues

There are four queues which fall into the general purpose category. These queues are targeted towards jobs requiring small to average sized memory and CPU time requirements. They have a relatively high priority to ensure a quick turn over of jobs.

Name Description

Max Jobs

Jobs/User

Memory

#CPUs

Time

Chkpt min

Priority

gen_30min Jobs which require a up to 30 minutes of CPU time and minimum memory requirement.

12

4

256M

4

30m

n/a

5

gen_2hour Jobs which will run for up to 2 hours of CPU time with a minimum memory requirement.

12

4

256M

4

2hr

n/a

10

gen_4hour Jobs which will run for up to 4 hours of CPU time with a minimum memory requirement.

12

4

512M

4

4hr

180mins

15

gen_8hour Jobs requiring a significant amount of CPU (up to 8 hours) and/or a larger memory allocation, up to 512M.

9

3

512M

4

8hr

210mins

20

Special Queues

The special queues are designed to provide larger allocations of memory and/or CPU time with some additional load scheduling constraints. The gen_nolimit and gen_largemem queues do not have any limit to the amount of CPU time a job will run for. However, jobs in the gen_offpeak queue are restricted to run between 7:30pm until 7:30am Monday to Thursday and 7:30pm Friday until 7:30am Monday; jobs that spill into peak periods will be checkpointed then held until the next offpeak period. Note, the gen_offpeak queue considers Brisbane's public holidays (e.g., Labour day, Exhibition holiday) to be normal days.

Name Description

Max Jobs

Jobs/User

Memory

#CPUs

Time

Chkpt min

Priority

gen_nolimit Jobs requiring an indeterminate amount of CPU time and/or a large memory allocation.

18

6

1.5G

4

unlimited

720mins

30

gen_offpeak Jobs* requiring an indeterminate amount of CPU time and/or a very large memory requirement. Jobs only run overnight and on weekends. 20 6 2G 2 unlimited 300mins 5
gen_largemem Jobs requiring large memory allocations but no more than 1 CPU 2 2 4G 1 unlimited 720mins 30

* PBS checkpoints gen_offpeak jobs that spill into peak periods. Once checkpointed, each job is held and its resources (memory, cpus, any license tokens) are released. It is therefore recommended that processes run in this queue are "home-grown" C, C++ or Fortran programs and are straightforward to checkpoint. Jobs submitted with '-c n' flags run a very real risk of being REMOVED from the queue by a system administrator.

Application Specific Queues

Abaqus

The following queues are used to run Abaqus finite element analysis jobs only. The number of concurrent Abaqus jobs is controlled by the Abaqus licensing software through the use of network tokens. Note, Abaqus 5.8 jobs relying on the site version of abaqus.env are automatically run over 6 processors. Currently, Abaqus 6.2 jobs only run over one processor; contact HPC staff to learn how to run in parallel.

Name

Description

Max Jobs

Jobs/User

Memory

#CPUs

Time

Chkpt min

Priority

abaqus_1hour

Small Abaqus jobs which will complete within 60 minutes.

2

2

512M

6

1hr

n/a

8

abaqus

Larger Abaqus jobs.

7

3

1G

10

unlimited

n/a

20

Upon completion of an Abaqus job, our PBS epilogue script does a number of things to ensure disk and licensing resources are not abused. Firstly, job temporary files with the following extensions are removed: .eig, .fct, .lnc, .opr, .scr, .sdb, .sol, .023. Secondly, job files with the following extensions will be compressed using 'gzip': .fil, .dat, .odb, .db, .jou. (These files can be uncompressed later using 'gunzip'.) Thirdly, the job's temporary directory residing on /scratch/abaqustmp will be removed. Finally, license tokens are returned to the Abaqus license manager.

Gaussian 98

Due to the extensive run-times associated with Gaussian, all jobs in this queue are run with a nice value of 10.

Name

Description

Max Jobs

Jobs/User

Memory

#CPUs

Time

Chkpt min

Priority

g98

Gaussian 98 jobs only.

20

15

1G

1

unlimited

1440mins

30

Batch System Commands

The following commands can be used to submit batch jobs and monitor their progress. For further reading material, download the PBS user guide

^M

Commands

Meanings

% qstat -Q

display what queues are available for you to use

% qstat

get a conscise overview of all running and queued jobs

% qstat -u username

find out what username's jobs are currently doing

% qstat -f | more

find out what everyone's jobs are currently doing (very detailed)

% qstat -f jobid | grep comment

see if (and when) a job with identifier jobidstarted running, and if not, why not

% qsub script

submit your job script to the default queue (gen_2hour)

% qsub -q queue_name script

submit your job script to a specific queue

% qdel jobid

abort a job with identifier jobid

% qhold jobid

place a hold on a job with identifier jobid

% qrls jobid

release a hold on a job with identifier jobid

% xpbs &

run the X-windows batch monitoring application in background

You can view a modified version of 'qstat' (Note: accessible only if you are working inside the QUT domain)

Examples

To submit a (non-Abaqus) job to a PBS queue, first a submit 'script' must be established. Say user bloggs wants to run her program 'newton' out of her home directory, then 'laguerre' out of her work directory. The submit script (let's call it 'myjobs') would look something like:

cd /home/bloggs
newton
cd /work/bloggs
laguerre

To submit the script to the gen_4hour queue:

% qsub -q gen_4hour -m abe -o output.dat -e error.dat -c c myjobs

This command line will submit the script 'myjobs' to the gen_4hour batch queue. An email message will be sent at the beginning (b) and end (e) of her job or if it aborts(a). The file 'output.dat' will contain anything that usually gets sent to the screen (-o), except errors (-e), which will go into 'error.dat'. Checkpointing (-c) will occur at the default value (c).

Now, say user bloggs wanted to run a larger job 'mylargejob' in the gen_nolimit queue. Also, say she wanted to checkpoint once a day instead of once every 12 hours and was not concerned with receiving email messages:

% qsub -q gen_nolimit -o output.dat -e error.dat -c c=1440 mylargejob

Next, say bloggs wanted to run a Gaussian 98 job in the g98 queue but did not want to checkpoint (because g98 does this already) and did not want the job to be restarted (-r) if the machine rebooted:

% qsub -r n -q g98 -o output.dat -e error.dat -c n my_g98job

Finally, imagine bloggs wanted to run a 40-hour job in the offpeak queue using the default checkpoint interval :

% qsub -q gen_offpeak -o output.dat -e error.dat -c c my_offpeakjob

If the job was submitted on Thursday afternoon, we would expect the following: (i) the job ran for 12 hours on Thursday evening and was checkpointed twice during the process, (ii) the job was checkpointed and held at 7:30am Friday, (iii) the job was released 7:30pm Friday and continued running over the weekend.

Checkpointing manually

Ideally, through appropriate job submission, there should be no need to checkpoint manually. However, if you forgot to specify the '-c' flag with qsub (or you're just feeling paranoid), the following command describes how you can checkpoint M running jobs:

% qhold -h n jobid_1 jobid_2 ... jobid_M

The '-h n' flags specify that the hold is not permanent, rather only temporary for checkpointing purposes. A successful manual checkpoint via 'qhold -h n' has a return code of 0. An unsuccessful manual checkpoint may yield a positive return code (e.g., 213) accompanied by the error message:

% qhold: Server returned error code for job jobid.sirius.qut.edu.au

Caveats: (i) this method will not checkpoint in the event that you used a 'never checkpoint' submission command, e.g., qsub -c n ..., (ii) checkpointing processes that use 3rd-party software, like Abaqus, is best handled by restart files (iii) Binaries spawned by 'mpirun' need a few extra flags for checkpointing to succeed; contact the HPC group, (iv) After checkpointing with 'qhold -h n', your jobs may not necessarily find themselves in a running state anymore - this might be the case when the queue is full or sirius is heavily loaded. Don't worry, they will eventually resume from the point that they were checkpointed - they will not restart from the very beginning.

Matlab Usage
Submitting Matlab jobs to the PBS batch system:

1. Create your script file to start matlab 6.x without the java virtual machine GUI and use input.m file
     matlab -nojvm -r input
2. Create your .m input file using the parameters that you want Matlab to read (with quit at the end*)
3. Submit your job to a batch queue using your script file. 
    For example, to submit the script file test in the 'gen_30min' queue use the command: 

    %  qsub -o test.out -e test.err -m e -q gen_30min test

*THE QUIT COMMAND MUST BE AT THE END OF YOUR M-FILE, OTHERWISE MATLAB WILL NOT TERMINATE AT THE END OF YOUR JOB!!