Slurm
In this section, I will discuss on slurm and its ecosystem, including how can it interacts with mpi, python, mathematica and so on.
In a word, slurm is very powerful but honestly speaking, it has a really bad presentation of documentation and not so good and large community.
MPI
PMI
PMI is a key interface to understand how slurm work with mpi. Both of them has a lower level implementation which allocate resource called PMI. For srun
, PMI of slurm is used, which can be tuned by --mpi=<pmi>
, and all supported pmis can be shown by srun --mpi=list
.
Note the program is compiled by mpicxx
, which depends on the PMI implementation of corresponding MPIs. The PMI for compiling (provided by mpi) and for running (srun
provided by slurm or mpirun
provided by mpi ) should be the same. If not, the common error is each process think itself as a standalone rank 0 process.
To make them consistent, there are two approaches. Firstly, compile slurm with more PMI support or secondly, compile corresponding mpi implementation with slurm PMI support. Note the supported PMI is very limited for apt install
ed slurm and the pmi header file is also missing for a standard package installation indicating compiling mpi implmentations with slurm PMI support is also subtle.
Therefore, the best practice with minimal maintence effort here is always using mpirun
within sbatch script and avoid srun
. But sbatch script is still highly recommended way to submit tasks instead of directly using mpirun -host <hostname,list>
. Firstly, the sbatch task management is under the control and accounting of slurm, and the task would be running even after logout. Secondly, the enviroment variables in the master node would broadcast to the computing nodes before the task is beginning which is very handy. Besides, for mpirun to be available on compute node, ORTE need ssh connection, which is closed by pam plugin of slurm, so the only way to run jobs on compute nodes is via slurm interface.
MPI and OPENMP hybrid
Remember the -fopenmp
flag for mpicc
,(use -openmp
for the Intel compiler and -mp
for the PGI compiler) the others are similar to mpi workflow. See here and here for script demos. Also see nice c code explicity using MPI and pthreads here. Remember adjusting enviroment variables OMP_NUM_THREADS
and KMP_AFFINITY
for a better performance, see intel doc for affinity config.
Direct mpiexec is not allowed for root.
cron like job
See here, and example sbatch script below.
parallel job array submission
See here and here and here in Mandarin
For job arrays, some import notes. Use # SBATCH --array=1-5
to name the job as jobid_1
and so on. In the sbatch script, use env var ${SLURM_ARRAY_TASK_ID}
to call the id for specific tast. And for -N
or -n
in slurm, just use the value for one task. %A in the #SBATCH line becomes the job ID%a in the #SBATCH line becomes the array index. See here for more user case demo.
allocate computation node interactively
See this blog for details. For short, srun -N 1 -n 1 -w node1 --pty bash -i
. A slurm way of ssh. Or salloc -n2 -N1 -t 1:00:00
to allocate a node with give time and resource (2CPU cores). Then ssh to it (-X is support in this case).
x11
forward option for srun: doc
? Still have no clear idea on -n
and the real number of cpu cores the job can use (seems one for one thread).
Management and Accounting
Use Weight
option in NodeName line in slurm.conf
to change the priority of assinged nodes, see this post. Also, the cpu cores, sockets must by given in slurm.conf, slurm has no ability to auto detect.
Details and roles on sacctmgr
family commands: ref (better than the official doc)
billing
In slurm.conf, AccountingStorageTRES=gres/gpu,gres/gpu:tesla
. And in partitial line, TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0"
. If TRESBillingWeights is not defined then the job is billed against the total number of allocated CPUs.
shares
in user of sacctmgr
Share in user's perspective: doc
GrpTRESMins
in sacct modify user's total running time, and multifactor priority plugin to check the raw usage by sshare.
partition
Nodes may be in more than one partition.
scontrol show partition
sacctmgr add user name partition=gpu
or in slurm.conf: PartitionName=MONTHLY1 AllowAccounts=diamond Nodes=compute-0-0
. More conf paramters include AllocNodes
(no idea the difference between Nodes
), Default=YES
,Hidden
,State
,PreemptMode
,TRESBillingWeights
A blank list of nodes (i.e. "Nodes= ") can be used if one wants a partition to exist, but have no resources (possibly on a temporary basis).
Specify partition in user's pespective.
environment variable
batch script option
command line option
premmpted between queues. Also need global context PreemptType=preempt/partition_prio
resource manage
sacctmgr show tres
In slurm.conf
, we have GresTypes
as a comma delimited list of generic resources to be managed (e.g.GresTypes=gpu,mps). These resources may have an associated GRES plugin of the same name providing additional functionality. No generic resources are managed by default. Ensure this parameter is consistent across all nodes in the cluster for proper operation. The slurmctld daemon must be restarted for changes to this parameter to become effective.
And also gres
in node line conf in slurm.conf
as a comma delimited list of generic resources specifications for a node. The format is:"<name>[:<type>][:no_consume]:<number>[K|M|G]"
. The first field is the resource name, which matches the GresType
configuration parameter name. The optional type field might be used to identify a model of that generic resource. A generic resource can also be specified as non-consumable (i.e. multiple jobs can use the same generic resource) with the optional field ":no_consume". The final field must specify a generic resources count. A suffix of "K", "M", "G", "T" or "P" may be used to multiply the number by 1024, 1048576, 1073741824, etc. respectively. (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_consume:4G"). By default a node has no generic resources and its maximum count is that of an unsigned 64bit integer.
More on gres.conf man page. The gres.conf file may be omitted completely if the configuration information in the slurm.conf file fully describes all GRES. The file will always be located in the same directory as the slurm.conf file.
Jobs will not be allocated any generic resources unless specifically requested at job submit time using the options:
—gres: Generic resources required per node, recommended way, general snytax for job submission --gres=gpu:kepler:2
—gpu: GPUs required per job
Besides, mps is a more finer way to schedule gpu jobs, where one gpu resource can be divided into small parts.
QOS
QOS settings by saccmgr: ref.
User system hierachy: cluster-account-user, the triple system is called association.
AccountingStorageEnforce=limits,qos
in slurm.conf. This line is important for qos to work in slurm.conf. limits also implies association option, indicating user who are not added to slurm cannot use slurm.
Note: A user's account can not be changed directly. A new association needs to be created for the user with the new account. Then the association with the old account can be deleted.
Note usename is the same for OS and slurm.
GrpTRESMins=cpu=10,mem=20 would make 2 different limits 1 for 10 cpu minutes and 1 for 20 MB memory minutes. This is the case for each limit that deals with TRES. To remove the limit -1 is used i.e. GrpTRESMins=cpu=-1 would remove only the cpu TRES limit. When dealing with Memory as a TRES all limits are in MB.
Node QOS is weird. It seems the system takes each task as a node. Even more tasks are shared with the same node, the node count increases. Instead, try limit by cpu. Is cpu for cores or threades? Experiments: cpu in qos context is by cores, say cpu=28
may limit to 56 threads. It is worth noting however, --cpus-per-task
is given by threads (not sure now, there are conflicting evidence...)! Slurm seems to have a mix conception of cpu.
QOS limit is more flexible to fine tune than account which cannot be changed unless deleting the user. Or by sacctmgr modify user where user=example set defaultaccount=groupb
, to at least change the default group.
sacctmgr add coord names=blah
, users that can modify other users qos and so on.
PAM module
merged into ansible workflow
sudo apt install libpam-slurm
vim /etc/pam.d/sshd, add line
account required pam_slurm.so
. The order of plugins is very important. pam_slurm_adopt.so should be the last PAM module in the account stack. And lineaccount required pam_access.so
following.Edit /etc/security/access.conf, add the following
Issue: seems always fail to ssh even if there is some task on the corresponding nodes. Though it is not a big issue that there is always a slurm version of ssh works. Solved: this issue is due to the mismatch between pam slurm and pam slurm adopt.
scontrol
scontrol show node c1
, to see how many cores are really allocated.
scontrol show job 237
, to see the status of give job.
scontrol show assoc
, to check all details of users, accounts and qos.
sacct -a
, see all users' jobs, past and current
bring node from down: back to service, restart slurmctld or slurmd wont work, see this post. scontrol update nodename=c2 state=IDLE
. scontrol update nodename=node10 state=resume
use resume if there is job running on the node!! so
scontrol ping
check the status of master and backup node for slurmctld
Usage preparation for subway
Use jobname as idetifier, in sbtach use:
#SBATCH --job-name uuid
. Job can be canceled byscancel -n uuid
. check history jobsacct --name=ab345-98iy6-iu7-10299 --format=User,JobID,Jobname%30,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
.However, by default,
scontrol
can only access that information for about five minutes after the job finishes, after which it is purged from memory. Namely, scontrol can be only used for view on info of current running jobs.
misc
No conf to randomize node assignment: post, somewhat hard to believe
sinfo state for nodes: doc
~
power save modessdiag
to check scheduling relevant info and rpc callssprio
to check job priority factors and so onsattach jobid
directly see stdout and stderr of the running jobsshare
more on sacct usage to check history job status: post
strigger
: event trigger. eg.strigger --set --node --down --program=/usr/sbin/slurm_admin_notify
sacctmgr show stats
on RPC call statistics on sacctmgrscontrol reconfigure
can let slurmctld to reload the conf, thought some parameters may be valid only after restart.scontrol show config
asterik * in status of sinfo indicates the node is unreachable.
In some system, squeue is aliased to
alias squeue="squeue -u <user>"
, therefore you cannot directly view others' jobs. But you canunalias squeue
, and thensqueue
can check all jobs by all users.singularity plugin readme
srun -n parameter seems working well, mathematica
LaunchKernels
or numpy with mkl multi thread speed both cannot excced the limit by -n.A job's expected start time can be seen using the
squeue —start
command.PrivateData controls whether some info is accessible to normal users
PrivateData
This controls what type of information is hidden from regular users. By default, all information is visible to all users. User SlurmUser and root can always view all information. Multiple values may be specified with a comma separator. Acceptable values include:
accounts, jobs, nodes, users and so on.
Test legal nodelist syntax
sinfo -R
reasons for node status, completing for long time is a indication that something is wrong, most probable case is disconnected between master and the node somehow.If
SelectType=select/linear
is configured, all resources on the selected nodes will be allocated to the job/step. If SelectType=select/cons_res is configured, individual sockets, cores and threads may be allocated from the selected nodessbatch --mem
limit seems not work. The task can easily go over the memory limit without any problem.PropagateResourceLimits
orPropagateResourceLimitsExcept
parameters are configured in slurm.conf and avoid propagation of specified limits. Configure on these two parameters unless you want to ulimit be effective on compute nodes.Mailprog example: script
resource resevation by scontrol: doc
More reference
Network topology in slurm: doc
Schduling configuration: doc
Job premmption guide: doc
High throughput guide: namely fine tuning on burst of short jobs for slurm: doc
Large system fine tuning: doc
Gres: doc
Shuguang doc on slurm: doc
Tianhe doc on slurm in admin's perspective: doc
Heterogeneous job submission: doc (Including block, plane and cylic allocation)
Parameters on multicore multi threads controlling: doc
Job status and info to elasticsearch: doc
Slurm configuration explanation: way better than official doc: doc
sbatch script examples: more to explore
Elastic scaling on cloud
slurm doc on cloud bursting. seems already having some support for hybrid hpc scheme
post slurmctld will check the boottime to make sure resume works, use -b
for slurmd local test on elastic feature of slurm.
Nodename should always be consistent with local hostname for newly provisioned machine.
AWS
aws approch to achieve auto up and down for compute nodes by slurm with plugins: blog
aws slurm plugin repo, quoted as
burst from an on-prem SLURM headnode that is managining an on-prem compute cluster. You need to ensure that you can resolve AWS private address either through an AWS DirectConnect and/or VPN layer.
So the hybrid scheme is indeed available now.
Working with other tools
Python
Jupyter
Directly access to jupyter server in compute nodes. doc
SSH forward with four machines: post. Solution for jupyter in computation nodes.
mpi4py
mpirun -n 2 python3 pympi.py
, see here for a slurm working example.
ipyparallel
To be detailed studied
Combine slurm, ipyparallel and celery
Mathematica
remote kernel with the help of Tunnel
See manual of tunnel tool. Or more general reference of mma doc ParallelTools/tutorial/ConnectionMethods.
Some recaps: copy tunnel.m
into /opt/mathematica/11.0.1/Kernels/Packages
, and add both tunnel.sh
and tunnel_sub.sh
into ~/.Mathematica/FrondEnd
, chmod +x
for the two scripts. (It may be possible to configure the tunnel sh in system wide context?)
tunnel.sh is for remote control kernel, with GUI to launch, tunnel_sub.sh is for compute kernels, launched indirectly by control kernel.
Use local GUI call mathematica remote kernels in master node,
Of course you need to configure passwordless ssh login for user or add password above. The two files in launch command are for tunnel.sh and math kernel.
On master node, further call more remote kernels on compute nodes. You can achieve a 224 threads parallel table in our cluster!
The standard mathematica parallel and remot kernel workflow is decided as above.
References:
https://rcc.uchicago.edu/docs/software/environments/mathematica/index.html mathematica usage as some hpc manual
https://taoofmac.com/space/blog/2016/08/10/0830 mathematica on rasberry cluster
The slurm sbatch script to utlize spark cluster: demo script, so.
spark enabled jupyter: doc, blog.
database
Use database instance in HPC: doc
Last updated