![]() |
![]() |
|
|
Batch Processing |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In order to get the best performance for all users of the system, it is necessary to strike a balance between running interactive jobs and scheduling jobs to run in batch mode. Obviously many applications, including editors and compilers, must be used interactively. However, once applications (including Matlab scripts) have been developed, we strongly encourage users to run their jobs through the batch system. Why? The main reason is that, for an average user, checkpointing (i.e., taking a snap-shot of a process for the purpose of "picking up where we left off" in the event of a crash) is quite tricky - however, PBS handles checkpointing quite easily. In fact, by default, PBS checkpoints all running jobs when the machine is brought down gently (unless otherwise specified - see g98 example below). The queues are grouped into 3 categories, general, special and application specific. Each of these categories is described below together with the resource characteristics of each queue. General QueuesThere are four queues which fall into the general purpose category. These queues are targeted towards jobs requiring small to average sized memory and CPU time requirements. They have a relatively high priority to ensure a quick turn over of jobs.
Special QueuesThe special queues are designed to provide larger allocations of memory and/or CPU time with some additional load scheduling constraints. The gen_nolimit and gen_largemem queues do not have any limit to the amount of CPU time a job will run for. However, jobs in the gen_offpeak queue are restricted to run between 7:30pm until 7:30am Monday to Thursday and 7:30pm Friday until 7:30am Monday; jobs that spill into peak periods will be checkpointed then held until the next offpeak period. Note, the gen_offpeak queue considers Brisbane's public holidays (e.g., Labour day, Exhibition holiday) to be normal days.
* PBS checkpoints gen_offpeak jobs that spill into peak periods. Once checkpointed, each job is held and its resources (memory, cpus, any license tokens) are released. It is therefore recommended that processes run in this queue are "home-grown" C, C++ or Fortran programs and are straightforward to checkpoint. Jobs submitted with '-c n' flags run a very real risk of being REMOVED from the queue by a system administrator. Application Specific QueuesAbaqus The following queues are used to run Abaqus finite element analysis jobs only. The number of concurrent Abaqus jobs is controlled by the Abaqus licensing software through the use of network tokens. Note, Abaqus 5.8 jobs relying on the site version of abaqus.env are automatically run over 6 processors. Currently, Abaqus 6.2 jobs only run over one processor; contact HPC staff to learn how to run in parallel.
Upon completion of an Abaqus job, our PBS epilogue script does a number of things to ensure disk and licensing resources are not abused. Firstly, job temporary files with the following extensions are removed: .eig, .fct, .lnc, .opr, .scr, .sdb, .sol, .023. Secondly, job files with the following extensions will be compressed using 'gzip': .fil, .dat, .odb, .db, .jou. (These files can be uncompressed later using 'gunzip'.) Thirdly, the job's temporary directory residing on /scratch/abaqustmp will be removed. Finally, license tokens are returned to the Abaqus license manager. Gaussian 98Due to the extensive run-times associated with Gaussian, all jobs in this queue are run with a nice value of 10.
Batch System CommandsThe following commands can be used to submit batch jobs and monitor their progress. For further reading material, download the PBS user guide
You can view a modified version of 'qstat' (Note: accessible only if you are working inside the QUT domain) Examples To submit a (non-Abaqus) job to a PBS queue, first a submit 'script' must be established. Say user bloggs wants to run her program 'newton' out of her home directory, then 'laguerre' out of her work directory. The submit script (let's call it 'myjobs') would look something like: cd /home/bloggs To submit the script to the gen_4hour queue: % qsub -q gen_4hour -m abe -o output.dat -e error.dat -c c myjobs This command line will submit the script 'myjobs' to the gen_4hour batch queue. An email message will be sent at the beginning (b) and end (e) of her job or if it aborts(a). The file 'output.dat' will contain anything that usually gets sent to the screen (-o), except errors (-e), which will go into 'error.dat'. Checkpointing (-c) will occur at the default value (c). Now, say user bloggs wanted to run a larger job 'mylargejob' in the gen_nolimit queue. Also, say she wanted to checkpoint once a day instead of once every 12 hours and was not concerned with receiving email messages: % qsub -q gen_nolimit -o output.dat -e error.dat -c c=1440 mylargejob Next, say bloggs wanted to run a Gaussian 98 job in the g98 queue but did not want to checkpoint (because g98 does this already) and did not want the job to be restarted (-r) if the machine rebooted: % qsub -r n -q g98 -o output.dat -e error.dat -c n my_g98job Finally, imagine bloggs wanted to run a 40-hour job in the offpeak queue using the default checkpoint interval : % qsub -q gen_offpeak -o output.dat -e error.dat -c c my_offpeakjob If the job was submitted on Thursday afternoon, we would expect the following: (i) the job ran for 12 hours on Thursday evening and was checkpointed twice during the process, (ii) the job was checkpointed and held at 7:30am Friday, (iii) the job was released 7:30pm Friday and continued running over the weekend. Checkpointing manuallyIdeally, through appropriate job submission, there should be no need to checkpoint manually. However, if you forgot to specify the '-c' flag with qsub (or you're just feeling paranoid), the following command describes how you can checkpoint M running jobs: % qhold -h n jobid_1 jobid_2 ... jobid_M The '-h n' flags specify that the hold is not permanent, rather only temporary for checkpointing purposes. A successful manual checkpoint via 'qhold -h n' has a return code of 0. An unsuccessful manual checkpoint may yield a positive return code (e.g., 213) accompanied by the error message: % qhold: Server returned error code for job jobid.sirius.qut.edu.au Caveats: (i) this method will not checkpoint in the event that you used a 'never checkpoint' submission command, e.g., qsub -c n ..., (ii) checkpointing processes that use 3rd-party software, like Abaqus, is best handled by restart files (iii) Binaries spawned by 'mpirun' need a few extra flags for checkpointing to succeed; contact the HPC group, (iv) After checkpointing with 'qhold -h n', your jobs may not necessarily find themselves in a running state anymore - this might be the case when the queue is full or sirius is heavily loaded. Don't worry, they will eventually resume from the point that they were checkpointed - they will not restart from the very beginning. Matlab Usage |