Views
BatchSystemAdmin
The HPCf Batch System
boreas.hpcf.upr.edu runs OpenPBS? and Maui to manage cpu resources, this file explains how we set up OpenPBS? and maui to manage the queue. User level documentation on the batch queue is available on our public web site:
http://www.hpcf.upr.edu/docs/batch/
PBS
What is PBS?
The Portable Batch System, PBS, is a batch job and computer system resource management package. It was developed with the intent to be conformant with the POSIX 1003.2d Batch Environment Standard. As such, it will accept batch jobs, a shell script and control attributes, preserve and protect the job until it is run, run the job, and deliver output back to the submitter.
Homepage and Download
The program is only available for registered users along with most of the documentation. Available documentation consists of:
- PBS Commands: html version of common man pages
- Using xpbs: Documentation for xpbs, a GUI to PBS commands
- System Administrator's Guide: provides information on building, installing and configuring OpenPBS?.
Additional Useful Sites
http://www.msi.umn.edu/smp/info/jobs/ Supercomputing Institute at University of Minnesota
http://www.hpc.gatech.edu/starting/o2000.html#HEADING19 PBS information at the Georgia Tech HPC Group
http://www.chpc.utah.edu/policies/pbs.html PBS User Guide at the University of Utah's Center for High Performance Computing
Installation
PBS is distributed as source code. The distribution is installed under /usr/people/humberto/src/OpenPBS?_2_3_12. The configure command used to compile the version currently running was:
# Necessary to be able to build tcl/tk for 64 bits > setenv LDFLAGS "-L/usr/freeware/lib64" # Set compilation configuration options > ./configure --enable-docs --set-server-home=/usr/local/spool/pbs --enable-plock-daemons=7 --enable-syslog --with-scp --enable-nodemask --enable-array --set-cc --enable-gui --with-tcl=/usr/freeware --set-default-server=boreas.hpcf.upr.edu --set-cflags=-64
On manutara, I disabled the gui (no xpbs)
./configure \ --enable-docs \ --set-server-home=/usr/local/spool/pbs \ --enable-plock-daemons=7 \ --enable-syslog \ --with-scp \ --set-cc \ --set-default-server=manutara.hpcf.upr.edu \ --set-cflags=-64
Configuration
PBS consists of three daemons, each one with its own configuration. The PBS Server (pbs_server) is in charge of accepting jobs for submission and then according to the scheduler decisions putting each job to run. The PBS Mom (pbs_mom) runs on every node controlled by PBS and is in charge of running the jobs, monitoring the node so it won't get overloaded and informing the server of the status of jobs. Finally, the PBS scheduler (pbs_sched) is in charge of making the decissions of what runs when and where (we replace pbs_sched by maui). The server presents the scheduler with a list of jobs and attributtes, and based on a user-configurable policy it determines the order in which jobs will run.
The configuration files are stored in every daemon's private directory, under
$PBS_HOME/
Server Configuration
Unlike Mom and the Job Scheduler, the Job Server (pbs_server) is configured while it is running. Only the nodes file has to be created before the server is started. This file ($PBS_HOME/server_priv/nodes) lists all the nodes controlled by PBS. It's not necessary if the system has only one exclusive node, i.e one that is used by one and only one job at a time. If there's only one node in the system and is timeshared, it has to be declared in this file. For example, the nodes file for boreas consists of:
boreas.hpcf.upr.edu np=20
The np declares the number
of virtual processors that this node has.
The complete syntax is:
After the nodes file is ready, the server is started and its configuration is done through the qmgr(8) command. qmgr provides the means to send commands to the server, either as arguments to qmgr (using the -c flag) or through its interface (by typing qmgr, a prompt ">" will come up waiting for commands).
The syntax for commands is (more details on the man page): command server [names]? [attr OP value[,attr OP value,...]]? command queue [names]? [attr OP value[,attr OP value,...]]? command node [names]? [attr OP value[,attr OP value,...]]?
Where, command is the command to perform on a object (objects being server, queue and node). Commands are: active -> sets active objects. Primarly to avoid using object's names in commands create -> to create a new object delete -> destroy an object set -> define or alter the attributes of the object unset -> clear attributes of the object list -> list the current attirbutes and values for the object print -> print all queue and server attributes in a format usable as input for qmgr
names is a list of one or more names of specific objects. The name list is in the form: [name]?[@server]?[,queue_name[@server]?...] the name is declared when the object is first created.
To record the changes made to the server configuration, use the command: qmgr -c "print server" > $PBS_HOME/server_priv/server_conf When the server is started, the configuration file may be reread using: qmgr < $PBS_HOME/server_priv/server_conf
To start the server the first time:
# /usr/local/sbin/pbs_server -t create
The server and mom must run as root, maui will run as the batch user.
See below for example server configurations.
Mom Configuration (page 31 Admin Guide)
The configuration file is stored in $PBS_HOME/mom_priv/config (mom doesn't require that this file exists. Another file may be specified using -c, or if the default values are ok, mom may be run without havign to read any config file). The current configuration is as suggested on the maui site, for integrating maui and PBS.
Current configuration options are:
$logevent 0x1FF $clienthost boreas.hpcf.upr.edu $restricted boreas.hpcf.upr.edu
Scheduler Configuration
We replace the PBS sceduler with maui. Maui documentation is extensive, and is available at http://www.supercluster.org/.
Maui runs under the batch userid, and is set up in /usr/people/batch/pbs/maui-3.0.5
Maui by default dedicates a machine to a job, so the most important configuration change is to specify NODEACCESSPOLICY = SHARED
The current configuration reads:
# maui.cfg 3.0 SERVERHOST boreas.hpcf.upr.edu # primary admin must be first in list ADMIN1 batch ADMIN2 humberto william RMTYPE[0] PBS # parameters documented at http://supercluster.org/documentation/maui/parameters.html # use theshowconfigcommand to display current maui settingsRMPOLLINTERVAL 00:01:00
SERVERPORT 42559 SERVERMODE NORMAL
LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3
#DEFAULTDOMAIN
# Priority Weights
QUEUETIMEWEIGHT 2 XFACTORWEIGHT 1000 RESOURCEWEIGHT 800
# FairShare
FSPOLICY OFF FSDEPTH 7 FSINTERVAL 86400 FSDECAY 0.80
# Policies
BACKFILLPOLICY ON BACKFILLTYPE FIRSTFIT ALLOCATIONPOLICY MINRESOURCE RESERVATIONPOLICY CURRENTHIGHEST
MAXJOBPERUSERPOLICY OFF MAXJOBPERUSERCOUNT 8
MAXPROCPERUSERPOLICY OFF MAXPROCPERUSERCOUNT 256
MAXPROCSECONDPERUSERPOLICY OFF MAXPROCSECONDPERUSERCOUNT 36864000
MAXJOBQUEUEDPERUSERPOLICY OFF MAXJOBQUEUEDPERUSERCOUNT 2
MAXPROCPERGROUPPOLICY OFF SMAXPROCPERGROUPCOUNT 128 MAXPROCPERGROUPCOUNT 160
NODEACCESSPOLICY SHARED
Starting and Stoping the Services
The daemons are started automatically at boot time using the /etc/init.d/pbs script. The mom and the scheduler don't have any special command line options. (see pbs_mom and pbs_sched man pages).The most relevant option for the server (pbs_server) is the -t flag, which Specifies the impact on jobs which were in execution, running, when the server shut down. The default value is warm, which means that all rerunnable jobs which were running when the server went down are requeued. The current setting is to start up hot, i.e. all jobs are requeued except non-rerunnable jobs that were executing. Any rerunnable job which was executing when the server went down will be run immediately (see man page for details and the /etc/init.d/pbs script). To stop them, a simple TERM signal should work (that's what the script does).
-- Main.RicardoBaratto?
Setting up the batch cpusets
See the IRIX Admin: Resource Administration book chapter on cpusets, available from: http://techpubs.sgi.com/library/dynaweb_bin/ebt-bin/0650/nph-infosrch.cgi/infosrchtpl/SGI_Admin/IA_Resource/@InfoSearch__BookTextView/4458
We have an open boot cpuset with 4 processors (see man boot_cpuset) defined in
/etc/config/boot_cpuset.config
MEMORY_LOCAL MEMORY_MANDATORYCPU 0 CPU 1 CPU 2 CPU 3
and enabled with chkconfig:
# chkconfig -f boot_cpuset on
An exclusive Batch cpuset is defined in
/etc/config/batch_cpuset.config
This is the batch_cpuset.config file:
EXCLUSIVE MEMORY_LOCAL MEMORY_MANDATORYCPU 4 CPU 5 CPU 6 CPU 7 CPU 8 CPU 9 CPU 10 CPU 11 CPU 12 CPU 13 CPU 14 CPU 15 CPU 16 CPU 17 CPU 18 CPU 19 CPU 20 CPU 21 CPU 22 CPU 23
The permissions are set up so only root can execute on the cpuset (PBS runs as root).
# ls -l /etc/config/batch_cpuset.config -rwx------ 1 root sys 268 Mar 13 09:36 batch_cpuset.config
The boot cpuset is configured at boot time, and init is attached to it. All system processes will therefore be restricted to the first 4 cpus in the machine.
To use the other 20 cpus, we need to start the batch cpuset and attach the PBS system to the cpuset. This is done by the /etc/init.d/pbs scripts:
I've found that maui must also be attached to the batch cpuset for backfilling to work reliably.
#!/bin/sh ## ## Start/stop pbs services ## IS_ON=/sbin/chkconfig PBS_SERVER=pbs_server # (from man pbs_server) # When pbs server is restarted after a crash, all jobs are requeued # except non-rerunnable jobs that were executing. Any rerunnable job # which was executing when the server went down will be run immediately. # After those jobs are restarted, then normal scheduling takes place for # all remaining queued jobs. If a job cannot be restarted immediately # the server will attempt to restart it periodically for upto 5 minutes. # (check man page for details) PBS_SERVER_OPTIONS="-t hot" PBS_MOM=pbs_mom PBS_SCHED=pbs_sched PBS_QMGR=qmgr PBS_SBINDIR=/usr/local/sbin PBS_BINDIR=/usr/local/bin CONFIG_FILE=/usr/local/spool/pbs/server_priv/server_confMAUIBINDIR=/usr/local/bin MAUI=maui MAUIUSER=batch SU=/sbin/su
if $IS_ON verbose || test -t 1; then ECHO=echo VERBOSE=-v else # quiet startup and shutdown ECHO=: VERBOSE= fi
case "$1" in
start) if $IS_ON pbs ; then $ECHO "PBS Services: "$ECHO "Starting Batch cpuset: " cpuset -q Batch -c -f /etc/config/batch_cpuset.config
$ECHO -n "PBS mom: $PBS_MOM" cpuset -q Batch -A $PBS_SBINDIR/$PBS_MOM $ECHO "."
# Use maui instead of pbs scheduler # $ECHO -n "PBS scheduler: $PBS_SCHED" # $PBS_SBINDIR/$PBS_SCHED # $ECHO "."
$ECHO -n "PBS server: $PBS_SERVER" cpuset -q Batch -A $PBS_SBINDIR/$PBS_SERVER $PBS_SERVER_OPTIONS $PBS_BINDIR/$PBS_QMGR < $CONFIG_FILE > /dev/null 2>&1
$ECHO -n "maui scheduler: $MAUI" cpuset -q Batch -A $SU $MAUIUSER -c "$MAUIBINDIR/$MAUI" $ECHO "." fi ;;
stop) killall $PBS_MOM killall $MAUI killall $PBS_SERVER cpuset -q Batch -m cpuset -q Batch -d ;; *) $ECHO "usage: $0 {start|stop}" esac
PBS was setup using no scheduler, maui will handle. I did set up the queues and nodes per the OpenPBS? docs:
# # Create queues and set their attributes. # # # Create and define queue default # create queue default set queue default queue_type = Execution set queue default resources_default.mem = 100mb set queue default resources_default.ncpus = 1 set queue default resources_default.walltime = 00:00:00 set queue default resources_available.mem = 16gb set queue default enabled = True set queue default started = True # # Set server attributes. # set server scheduling = False set server acl_host_enable = True set server acl_hosts = *.hpcf.upr.edu set server managers = batch@boreas.hpcf.upr.edu set server operators = batch@boreas.hpcf.upr.edu set server default_queue = default set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server scheduler_iteration = 600
The node can be created as follows:
create node boreas.hpcf.upr.edu np=20
-- Main.HumbertoOrtiz - 13 Mar 2001
PBS configuration on manutara/ehecatl
The PBS on manutara/ehecatl consists of two nodes. One is on manutara(4 cpus) and the second on ehecatl (8 cpus). Manutara is the master node of the cluster which runs the server and the maui scheduler. The nodes file on manutara /usr/local/spool/pbs/server_priv/nodes is:
ehecatl.uprm.edu np=8 manutara.uprm.edu np=4
There are two cpuset on manutara: batch cpuset for jobs assigned by PBS and boot cpuset for system and interactive jobs
The batch cpuset is defined in /etc/config/batch_cpuset.config and consists of 4 nodes with the same speed 250MHZ:
The boot cpuset consists of the rest 4 cpus and defined in /etc/config/boot_cpuset.config :EXCLUSIVE MEMORY_LOCAL MEMORY_MANDATORY
CPU 2 CPU 3 CPU 4 CPU 5
MEMORY_LOCAL MEMORY_MANDATORY
CPU 0 CPU 1 CPU 6 CPU 7
All three deamons run on manutara: pbs_server, pbs_sched by maui, and pbs_mom. Therefore The PBS script /etc/init.d/pbs on manutara attaches batch cpuset to all of them when it starts PBS and dettaches when it stops. It looks like the pbs script on boreas.
The ehecatl node doesn have any cpusets and the pbs script on ehecatl /etc/init.d/pbs serves only to start or stop mom:
##!/bin/sh ## ## Start/stop pbs services ## IS_ON=/sbin/chkconfig # (from man pbs_server) # (check man page for details) PBS_MOM=pbs_mom PBS_SBINDIR=/usr/local/sbinif $IS_ON verbose || test -t 1; then ECHO=echo VERBOSE=-v else # quiet startup and shutdown ECHO=: VERBOSE= fi
case "$1" in
start) if $IS_ON pbs ; then $ECHO "PBS Services: " $ECHO -n "PBS mom: $PBS_MOM" $PBS_SBINDIR/$PBS_MOM $ECHO "."fi ;;
stop) killall $PBS_MOM ;; *) $ECHO "usage: $0 {start|stop}" esac
Server configuration on manutara and ehecatl:
# # Create queues and set their attributes. # # # Create and define queue default # create queue default set queue default queue_type = Execution set queue default resources_default.mem = 150mb set queue default resources_default.ncpus = 1 set queue default resources_default.walltime = 00:00:00 set queue default resources_available.mem = 2gb set queue default enabled = True set queue default started = True # # Set server attributes. # set server scheduling = False set server max_user_run = 8 set server acl_host_enable = True set server acl_hosts = *.uprm.edu set server managers = batch@manutara.uprm.edu set server operators = batch@manutara.uprm.edu set server default_queue = default set server log_events = 511 set server mail_from = root@hpcf.upr.edu set server query_other_jobs = True set server scheduler_iteration = 600
Mom configuration on manutara:
$logevent 0x1ff $clienthost manutara.uprm.edu $restricted *.uprm.edu $usecp ehecatl.uprm.edu:/usr/people /usr/people $usecp ehecatl.uprm.edu:/disk4 /disk4Mom configuration on ehecatl:
$logevent 0x1ff $clienthost manutara.uprm.edu $restricted *.uprm.edu $usecp manutara.uprm.edu:/usr/people /usr/people $usecp manutara.uprm.edu:/disk4 /disk4
NFS configuration on manutara/ehecatl.
Manutara is a master node and it has two NFS filesystems exported to ehecatl the client: /usr/people and /disk4. Both of them contain users' home directories. The exported filesystems are described in the /etc/exports file on manutara:
/usr/people -rw,access=ehecatl.uprm.edu /disk4 -rw,access=ehecatl.uprm.eduAfter editing the /etc/exports file execute the exportfs command:
exportfs -a
The /etc/fstab file on a client has to contain filesytems exported by the master node.
This is the /etc/fstab file on ehecatl:
/dev/root / xfs rw,raw=/dev/rroot 0 0 manutara.uprm.edu:/usr/people /usr/people nfs rw 0 1 manutara.uprm.edu:/disk4 /disk4 nfs rw 0 1After editing the /etc/fstab file execute the mount command:
mount -t nfs -a
Rps servers mountd have to be uncommented in the /etc/inetd.conf files on both manutara and ehecatl.
# RPC-based services # These use the portmapper instead of /etc/services. # # we only support mountd versions 1 and 3 mountd/1,3 stream rpc/tcp wait/lc root /usr/etc/rpc.mountd mountd mountd/1,3 dgram rpc/udp wait/lc root /usr/etc/rpc.mountd mountd sgi_mountd/1 stream rpc/tcp wait/lc root /usr/etc/rpc.mountd mountd sgi_mountd/1 dgram rpc/udp wait/lc root /usr/etc/rpc.mountd mountd #rstatd/1-3 dgram rpc/udp wait root /usr/etc/rpc.rstatd rstatdBe sure you you reread this files after changing them with the following command:
etc/killall -HUP inetd command
NIS configuration on manutara/ehecatl.
Yp, ypmaster, ypserv must be enable on manutara via chkconfig command:
chkconfig -f yp on chkconfig -f ypmaster on chkconfig -f ypserv on
Yp must be enable on ehecatl via chkconfig command
chkconfig -f yp on
Stop and start /etc/init.d/network:
/etc/init.d/network stop /etc/init.d/network start
-- Main.ElenaLeyderman? - 06 Sep 2001
OpenPBS? and maui on cafeina
I set up OpenPBS? and maui on cafeina, sort of following the instructions above, with some minor and some major changes. I'm not running any cpusets on cafeina, I hope all our users will be well behaved, and use the honor system for batch jobs. If not, I'll kill them (or their jobs anyway).
I installed versions: OpenPBS?_2_3_16.tar.gz and maui-3.0.7p8.tar.gz
OpenPBS? configuration flags were:
./configure --enable-docs --set-server-home=/usr/local/spool/pbs --enable-plock-daemons=7 --enable-syslog --with-scp --enable-nodemask --set-cc --enable-gui --with-tcl=/usr/freeware --set-default-server=boreas.hpcf.upr.edu --set-cflags=-64
maui runs as the batch user, and is installed in
/usr/people/batch/src/maui-3.0.7/
I created a nodes file for PBS in
/usr/local/spool/pbs/server_priv/nodeswith the line
cafeina.hpcf.upr.edu np=32
But it didn't seem to work. I did create the node after I ran qmgr with the line:
create node cafeina.hpcf.upr.edu np=32
and there is a server_conf file in the server_priv directory with the correct node definition, so the name of the file may have changed. Until I created the node in qmgr, maui and pbs were running, but maui did not detect any processors available.
The mom configuration file reads:
$logevent 0x1FF $clienthost cafeina.hpcf.upr.edu $restricted cafeina.hpcf.upr.edu
The queues were created as follows:
# # Create queues and set their attributes. # # # Create and define queue default # create queue default set queue default queue_type = Execution set queue default resources_default.mem = 100mb set queue default resources_default.ncpus = 1 set queue default resources_default.walltime = 00:00:00 set queue default resources_available.mem = 24gb set queue default enabled = True set queue default started = True # # Set server attributes. # set server scheduling = True set server acl_host_enable = True set server acl_hosts = *.hpcf.upr.edu set server managers = batch@cafeina.hpcf.upr.edu set server operators = batch@cafeina.hpcf.upr.edu set server default_queue = default set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server scheduler_iteration = 600
Here is the /etc/init.d/pbs script (no cpusets):
#!/bin/sh ## ## Start/stop pbs services ## IS_ON=/sbin/chkconfigPBS_SERVER=pbs_server PBS_SERVER_OPTIONS="-t hot" PBS_MOM=pbs_mom PBS_SBINDIR=/usr/local/sbin
MAUIBINDIR=/usr/local/bin MAUI=maui MAUIUSER=batch SU=/sbin/su
if $IS_ON verbose || test -t 1; then ECHO=echo VERBOSE=-v else # quiet startup and shutdown ECHO=: VERBOSE= fi
case "$1" in
start) if $IS_ON pbs ; then $ECHO "PBS Services: "$ECHO -n "PBS mom: $PBS_MOM" $PBS_SBINDIR/$PBS_MOM $ECHO "."
$ECHO -n "PBS server: $PBS_SERVER" $PBS_SBINDIR/$PBS_SERVER $PBS_SERVER_OPTIONS
$ECHO -n "maui scheduler: $MAUI" $SU $MAUIUSER -c "$MAUIBINDIR/$MAUI" $ECHO "." fi ;;
stop) killall $PBS_MOM killall $MAUI killall $PBS_SERVER ;; *) $ECHO "usage: $0 {start|stop}" esac
Remember to run
chkconfig -f pbs on
-- Main.HumbertoOrtiz - 04 Sep 2002