Slurm
Last updated
Was this helpful?
Last updated
Was this helpful?
In this section, I will discuss on slurm and its ecosystem, including how can it interacts with mpi, python, mathematica and so on.
In a word, slurm is very powerful but honestly speaking, it has a really bad presentation of documentation and not so good and large community.
PMI is a key interface to understand how slurm work with mpi. Both of them has a lower level implementation which allocate resource called PMI. For srun
, PMI of slurm is used, which can be tuned by --mpi=<pmi>
, and all supported pmis can be shown by srun --mpi=list
.
Note the program is compiled by mpicxx
, which depends on the PMI implementation of corresponding MPIs. The PMI for compiling (provided by mpi) and for running (srun
provided by slurm or mpirun
provided by mpi ) should be the same. If not, the common error is each process think itself as a standalone rank 0 process.
To make them consistent, there are two approaches. Firstly, compile slurm with more PMI support or secondly, compile corresponding mpi implementation with slurm PMI support. Note the supported PMI is very limited for apt install
ed slurm and the pmi header file is also missing for a standard package installation indicating compiling mpi implmentations with slurm PMI support is also subtle.
Therefore, the best practice with minimal maintence effort here is always using mpirun
within sbatch script and avoid srun
. But sbatch script is still highly recommended way to submit tasks instead of directly using mpirun -host <hostname,list>
. Firstly, the sbatch task management is under the control and accounting of slurm, and the task would be running even after logout. Secondly, the enviroment variables in the master node would broadcast to the computing nodes before the task is beginning which is very handy. Besides, for mpirun to be available on compute node, ORTE need ssh connection, which is closed by pam plugin of slurm, so the only way to run jobs on compute nodes is via slurm interface.
Remember the -fopenmp
flag for mpicc
,(use -openmp
for the Intel compiler and -mp
for the PGI compiler) the others are similar to mpi workflow. See and for script demos. Also see nice c code explicity using MPI and pthreads . Remember adjusting enviroment variables OMP_NUM_THREADS
and KMP_AFFINITY
for a better performance, see for affinity config.
Direct mpiexec is not allowed for root.
See , and example sbatch script below.
? Still have no clear idea on -n
and the real number of cpu cores the job can use (seems one for one thread).
In slurm.conf, AccountingStorageTRES=gres/gpu,gres/gpu:tesla
. And in partitial line, TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0"
. If TRESBillingWeights is not defined then the job is billed against the total number of allocated CPUs.
shares
in user of sacctmgr
GrpTRESMins
in sacct modify user's total running time, and multifactor priority plugin to check the raw usage by sshare.
Nodes may be in more than one partition.
scontrol show partition
sacctmgr add user name partition=gpu
or in slurm.conf: PartitionName=MONTHLY1 AllowAccounts=diamond Nodes=compute-0-0
. More conf paramters include AllocNodes
(no idea the difference between Nodes
), Default=YES
,Hidden
,State
,PreemptMode
,TRESBillingWeights
A blank list of nodes (i.e. "Nodes= ") can be used if one wants a partition to exist, but have no resources (possibly on a temporary basis).
Specify partition in user's pespective.
environment variable
batch script option
command line option
premmpted between queues. Also need global context PreemptType=preempt/partition_prio
sacctmgr show tres
In slurm.conf
, we have GresTypes
as a comma delimited list of generic resources to be managed (e.g.GresTypes=gpu,mps). These resources may have an associated GRES plugin of the same name providing additional functionality. No generic resources are managed by default. Ensure this parameter is consistent across all nodes in the cluster for proper operation. The slurmctld daemon must be restarted for changes to this parameter to become effective.
And also gres
in node line conf in slurm.conf
as a comma delimited list of generic resources specifications for a node. The format is:"<name>[:<type>][:no_consume]:<number>[K|M|G]"
. The first field is the resource name, which matches the GresType
configuration parameter name. The optional type field might be used to identify a model of that generic resource. A generic resource can also be specified as non-consumable (i.e. multiple jobs can use the same generic resource) with the optional field ":no_consume". The final field must specify a generic resources count. A suffix of "K", "M", "G", "T" or "P" may be used to multiply the number by 1024, 1048576, 1073741824, etc. respectively. (e.g."Gres=gpu:tesla:1,gpu:kepler:1,bandwidth:lustre:no_consume:4G"). By default a node has no generic resources and its maximum count is that of an unsigned 64bit integer.
Jobs will not be allocated any generic resources unless specifically requested at job submit time using the options:
—gres: Generic resources required per node, recommended way, general snytax for job submission --gres=gpu:kepler:2
—gpu: GPUs required per job
Besides, mps is a more finer way to schedule gpu jobs, where one gpu resource can be divided into small parts.
User system hierachy: cluster-account-user, the triple system is called association.
AccountingStorageEnforce=limits,qos
in slurm.conf. This line is important for qos to work in slurm.conf. limits also implies association option, indicating user who are not added to slurm cannot use slurm.
Note: A user's account can not be changed directly. A new association needs to be created for the user with the new account. Then the association with the old account can be deleted.
Note usename is the same for OS and slurm.
GrpTRESMins=cpu=10,mem=20 would make 2 different limits 1 for 10 cpu minutes and 1 for 20 MB memory minutes. This is the case for each limit that deals with TRES. To remove the limit -1 is used i.e. GrpTRESMins=cpu=-1 would remove only the cpu TRES limit. When dealing with Memory as a TRES all limits are in MB.
Node QOS is weird. It seems the system takes each task as a node. Even more tasks are shared with the same node, the node count increases. Instead, try limit by cpu. Is cpu for cores or threades? Experiments: cpu in qos context is by cores, say cpu=28
may limit to 56 threads. It is worth noting however, --cpus-per-task
is given by threads (not sure now, there are conflicting evidence...)! Slurm seems to have a mix conception of cpu.
QOS limit is more flexible to fine tune than account which cannot be changed unless deleting the user. Or by sacctmgr modify user where user=example set defaultaccount=groupb
, to at least change the default group.
sacctmgr add coord names=blah
, users that can modify other users qos and so on.
merged into ansible workflow
sudo apt install libpam-slurm
vim /etc/pam.d/sshd, add line account required pam_slurm.so
. The order of plugins is very important. pam_slurm_adopt.so should be the last PAM module in the account stack. And line account required pam_access.so
following.
Edit /etc/security/access.conf, add the following
Issue: seems always fail to ssh even if there is some task on the corresponding nodes. Though it is not a big issue that there is always a slurm version of ssh works. Solved: this issue is due to the mismatch between pam slurm and pam slurm adopt.
scontrol show node c1
, to see how many cores are really allocated.
scontrol show job 237
, to see the status of give job.
scontrol show assoc
, to check all details of users, accounts and qos.
sacct -a
, see all users' jobs, past and current
scontrol ping
check the status of master and backup node for slurmctld
Use jobname as idetifier, in sbtach use: #SBATCH --job-name uuid
. Job can be canceled by scancel -n uuid
. check history job sacct --name=ab345-98iy6-iu7-10299 --format=User,JobID,Jobname%30,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
.
However, by default, scontrol
can only access that information for about five minutes after the job finishes, after which it is purged from memory. Namely, scontrol can be only used for view on info of current running jobs.
sdiag
to check scheduling relevant info and rpc calls
sprio
to check job priority factors and so on
sattach jobid
directly see stdout and stderr of the running job
sshare
strigger
: event trigger. eg. strigger --set --node --down --program=/usr/sbin/slurm_admin_notify
sacctmgr show stats
on RPC call statistics on sacctmgr
scontrol reconfigure
can let slurmctld to reload the conf, thought some parameters may be valid only after restart. scontrol show config
asterik * in status of sinfo indicates the node is unreachable.
In some system, squeue is aliased to alias squeue="squeue -u <user>"
, therefore you cannot directly view others' jobs. But you can unalias squeue
, and then squeue
can check all jobs by all users.
srun -n parameter seems working well, mathematica LaunchKernels
or numpy with mkl multi thread speed both cannot excced the limit by -n.
A job's expected start time can be seen using the squeue —start
command.
PrivateData controls whether some info is accessible to normal users
PrivateData
This controls what type of information is hidden from regular users. By default, all information is visible to all users. User SlurmUser and root can always view all information. Multiple values may be specified with a comma separator. Acceptable values include:
accounts, jobs, nodes, users and so on.
Test legal nodelist syntax
sinfo -R
reasons for node status, completing for long time is a indication that something is wrong, most probable case is disconnected between master and the node somehow.
If SelectType=select/linear
is configured, all resources on the selected nodes will be allocated to the job/step. If SelectType=select/cons_res is configured, individual sockets, cores and threads may be allocated from the selected nodes
sbatch --mem
limit seems not work. The task can easily go over the memory limit without any problem.
PropagateResourceLimits
or PropagateResourceLimitsExcept
parameters are configured in slurm.conf and avoid propagation of specified limits. Configure on these two parameters unless you want to ulimit be effective on compute nodes.
Nodename should always be consistent with local hostname for newly provisioned machine.
burst from an on-prem SLURM headnode that is managining an on-prem compute cluster. You need to ensure that you can resolve AWS private address either through an AWS DirectConnect and/or VPN layer.
So the hybrid scheme is indeed available now.
To be detailed studied
Some recaps: copy tunnel.m
into /opt/mathematica/11.0.1/Kernels/Packages
, and add both tunnel.sh
and tunnel_sub.sh
into ~/.Mathematica/FrondEnd
, chmod +x
for the two scripts. (It may be possible to configure the tunnel sh in system wide context?)
tunnel.sh is for remote control kernel, with GUI to launch, tunnel_sub.sh is for compute kernels, launched indirectly by control kernel.
Use local GUI call mathematica remote kernels in master node,
Of course you need to configure passwordless ssh login for user or add password above. The two files in launch command are for tunnel.sh and math kernel.
On master node, further call more remote kernels on compute nodes. You can achieve a 224 threads parallel table in our cluster!
The standard mathematica parallel and remot kernel workflow is decided as above.
References:
spark
See and and
For job arrays, some import notes. Use # SBATCH --array=1-5
to name the job as jobid_1
and so on. In the sbatch script, use env var ${SLURM_ARRAY_TASK_ID}
to call the id for specific tast. And for -N
or -n
in slurm, just use the value for one task. %A in the #SBATCH line becomes the job ID%a in the #SBATCH line becomes the array index. See for more user case demo.
See for details. For short, srun -N 1 -n 1 -w node1 --pty bash -i
. A slurm way of ssh. Or salloc -n2 -N1 -t 1:00:00
to allocate a node with give time and resource (2CPU cores). Then ssh to it (-X is support in this case).
x11
forward option for srun:
Use Weight
option in NodeName line in slurm.conf
to change the priority of assinged nodes, see . Also, the cpu cores, sockets must by given in slurm.conf, slurm has no ability to auto detect.
Details and roles on sacctmgr
family commands: (better than the official doc)
Share in user's perspective:
More on . The gres.conf file may be omitted completely if the configuration information in the slurm.conf file fully describes all GRES. The file will always be located in the same directory as the slurm.conf file.
QOS settings by saccmgr: .
bring node from down: back to service, restart slurmctld or slurmd wont work, see . scontrol update nodename=c2 state=IDLE
. scontrol update nodename=node10 state=resume
use resume if there is job running on the node!!
No conf to randomize node assignment: , somewhat hard to believe
sinfo state for nodes: ~
power save modes
more on sacct usage to check history job status:
singularity plugin
Mailprog example:
resource resevation by scontrol:
Network topology in slurm:
Schduling configuration:
Job premmption guide:
High throughput guide: namely fine tuning on burst of short jobs for slurm:
Large system fine tuning:
Gres:
Shuguang doc on slurm:
Tianhe doc on slurm in admin's perspective:
Heterogeneous job submission: (Including block, plane and cylic allocation)
Parameters on multicore multi threads controlling:
Job status and info to elasticsearch:
Slurm configuration explanation: way better than official doc:
sbatch script examples:
. seems already having some support for hybrid hpc scheme
slurmctld will check the boottime to make sure resume works, use -b
for slurmd local test on elastic feature of slurm.
aws approch to achieve auto up and down for compute nodes by slurm with plugins:
, quoted as
Directly access to jupyter server in compute nodes.
SSH forward with four machines: . Solution for jupyter in computation nodes.
mpirun -n 2 python3 pympi.py
, see for a slurm working example.
See of tunnel tool. Or more general reference of mma doc ParallelTools/tutorial/ConnectionMethods.
mathematica usage as some hpc manual
mathematica on rasberry cluster
remote kernel launching
The slurm sbatch script to utlize spark cluster: , .
spark enabled jupyter: , .
Use database instance in HPC: