VM Test
In this page, I will record all the operations in the VM environment as a pretest to deploy something on HPC2.
KVM settings
I chose KVM scheme to do the vitrualization. Test host enviroment Ubuntu 16.04.
VM create
Install relevant package sudo apt-get install qemu-kvm libvirt-bin virtinst bridge-utils cpu-checker
.
Check the installation and configuration is succeed by kvm-ok
.
VM environment Ubuntu 18.04 LTS server.
qemu-img create -f qcow2 master.img 5G
Note: one can also omit the qemu-img create command, but use —disk size=5
as args in the following line. The default path of the images is /var/lib/libvirt/images/
.
sudo virt-install --name=master --memory=2048 --vcpus=1 --cdrom=ubuntu-18.04.2-live-server-amd64.iso -f master.img --graphics vnc,listen=0.0.0.0,password=<key>
Use screen sharing in mac to access the VNC desktop to finish the installation, the default port is 5900.
Note: for multiple VM, if VNC port is not specified, the following port will be 5901 for the second one and so on.
sudo virsh list
to check the VM running now.
VM delete: virsh destroy domain && virsh undefine domain
to drop the VM, but still one need to delete the image. Try virsh vol-list default
and virsh vol-delete --pool default name
. The pool is find by virsh pool-list
.
VM volume resize: sudo virsh vol-resize master.qcow2 --pool default 10G
, sudo virsh vol-info master.qcow2 --pool default
. Never resize for a online VM, it is easy to corrupt the disk!!! For the resize operation on VM, see my blog here.
Network configuration
To simulate the cluster network topology in VM clusters, see my blog here.
Basic
Python3 preparation
The operations below only on master VM. Since ansible can be installed directly via apt, the python part might be moved later which controled by ansible
The following workflows on python is somehow deprecated for our cluster real settings
Final decision: change python under spack manage.
apt source change to tuna
sudo apt update && sudo apt install python3-pip
There is a pip3 upgrade issue on ubuntu18.04, the revert command is
sudo python3 -m pip uninstall pip
, see here for pip upgrade issue.change the default path of dist package, create file as
/etc/pip.conf
, and add the following in the fileand then
sudo pip3 install
for every packages you want.export PYTHONPATH=/opt/python3/dist-packages
for python3 to auto include it in sys.path.One can upgrade pip by
sudo pip3 install pip -U
, and in bashrcalias pip3="python3 -m pip"
.After such settings, user cannot install package in their own home dir unless create a file
~/.pip/pip.conf
to overwrite the default installation path asIt is worth noting that, after this user specific config, the installation of package is home dir even if we use sudo. Only
sudo su
to root can the installation dir change back to opt.
NFS and NTP
NTP part can be controlled by ansible later in production
on master,
sudo apt install nfs-kernel-server
.config
/etc/exports
, note there should not be any space after comma./opt 192.168.48.0/24 (rw,sync)
.on node1,
sudo apt install nfs-common
, andsudo showmount -e master
sudo mount -t nfs master:/opt /opt
nfs config file path in ubuntu:
/etc/default/nfs-kernel-server
where one can change the number of daemons for nfs server.remember to export PYTHONPATH in this node, and use python3, the external package works.
on master,
sudo install ntp ntpstat
, note there is a new toolset called timedatectl in ubuntu, be cautious.edit
/etc/ntp.conf
,sudo service ntp restart
ntpq -p
orntpstat
to check the current state of ntp service and upstreamfor node1, also config ntp service, but with server as master
Possible issue on NFS: gitlab post
SSH
export home and mount for node1 as nfs
ssh-keygen -b 4096
, and vimauthorized_keys
to add the publick key line, chmod 600 for the key file. And now one can ssh hostname freely without password.
Ansible
on master,
sudo apt install ansible
edit/etc/ansible/hostsfor a temporary ping test
(Just follow an all in one roles package)ansible all -m pingFor the vars part, actually you must specify python3, otherwise python is consulted while it doesn't exist on ubuntu18.04 by default. See this issue for reference.
create a standalone dir for ansible roles and config management.
ansible-playbook site.yml -vv --ask-become-pass
backup the ansible roles to other server:
rsync -avz -e "ssh -p port" hpc/ user@ip:/home/user/
parallel ad-hoc commands
ansible all -a "bash /blah/proxy.cfg"
Remote Desktop
Edit sshd_config in server side by turning on X11Forwading option.
check there is xauth binary in the server,
which xauth
For client, use
ssh -X
, or addForwardX11 yes
in.ssh/config
For mac client, install XQuatrz for X11 display support
Such desktop forwarding can be even forwarded more than once, say from A ssh -X to B and from B ssh -X to C, then
xclock
in C will redirect the display of clock on A's screen.
quota
apt install quota
edit /etc/fstab, use usrquota, grpquota instead of defaults for root /. being sure to separate all options with a comma and no spaces.
sudo mount -o remount /
to remount rootcheck the effect by
cat /proc/mounts | grep ' / '
sudo quotacheck -ugm /
, This command creates the files/aquota.user
and/aquota.group
.sudo quotaon -v /
turn on quotasudo setquota -u test 200M 220M 0 0 /
block s h node s h fs.sudo repquota -s /
generate reports for user usage on the disk.
slurm
munge
come with slurm on ubuntu apt, change the file /etc/munge/munge.key
to the same and restart service munge. Test see here.
Error: munged: Error: Logfile is insecure: world-writable permissions without sticky bit. chmod o+t /var
debug
slurmd -Dcvv
Somehow the service slurmctld and slurmd may get very messy, one may first ps aux|grep slurm
and kill them totally by pid, and then try start the service again.
integrate with mpi
Note intel-mpi is a little different to use! A direct srun
doesn't work. See here. Instead use sbatch script as follows
For intel mpi, the path should be something like ./a.out
though for openmpi, a.out
is enough.
Note the different implementation of PMI interface, which is the key for srun to work as expected. But it is ok srun itself doesn't work. We can combine sbatch and mpirun to achieve the same effect.
Note slurm can also interact with containers such as docker or sigularity in principle.
account slurmdbd
sudo apt install slurmdbd mysql-server python-mysqldb
sudo service mysql status
Ubuntu mysql has default root user with empty password, but can only log in by sudo
CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'pw';
GRANT ALL PRIVILEGES ON slurm_acct_db.* TO 'slurm'@'localhost';
in mysqlpossible mysql table missing issue: issue
slurmdbd behavior is somewhat very weird and require explicit acctmgr cluster add as well as restart service for slrumdbd and slrumctld many times in some sorts of order.
Anyways, slurm together with its eco and doc, is… totally a mess, Good luck
Note that pid is configured twice (one in etc one in service), make them consistent for all three services. slurmd, slurmdbd, slurmctld.
Experiments on hadoop & spark
spack install spark +hadoop ^jdk@1.8.0_212
. Cautions: 1. spark is not well supported by jdk 9+. There maybe java.lang.IllegalArgumentException
in pyspark running for java >= 9. See this blog. 2 It make things much worse by the fact that spack couldn't handle jdk install well due to some invalid url or cookies or licences issue for oracle java (the newest version jdk seems ok to install by spack though).
My workaround: install jdk 8 by sudo apt install openjdk-8-jdk
. Then add jdk as external package by editing packages.yaml for spack. And remember to spack install jdk@1.8.0_212
to add it to the spack db. Then try spack install command as shown at the beginning.
To use pyspark, spack load hadoop spark jdk
. Then directly use the python interpreter provided by spark as pyspark
. Or pip install pyspark
for some python, and use pyspark in the python by import pyspark
. Remember to set both env vars PYSPARK_PYTHON=python3
and PYSPARK_DRIVER_PYTHON=python3
, to avoid python version mix of spark worker if you choose original python.
cluster mode settings
Reference on this post
In spark/conf
, cp slave.templates slave
, and add slave hostname into slave file. cp spark-env.sh.template spark-env.sh
and add following lines.
Make sure the conf dir is sync with all nodes (which is by default if spack install in home and home is shared within cluster).
Start hadoop first `spack location -i hadoop`/sbin/start-all.sh.
Then start spark master and slaves. `spack location -i spark`/sbin/start-master.sh and start-slaves.sh
Task submission pyspark --master spark://master:7077
. (Note hadoop might have some issue in the above setting and running scheme.) Actually, one should edit etc/yarn-site.xml
to make pyspark --master yarn
work as expected. It may be due to memory limit of VMs. See post here.
Moreover, one still need a slaves file in hadoop/etc, though not exist for the installation?
standalone mode doc, prefer this approach now.
env vars
SPARK_MASTER_WEBUI_PORT
for webui port[Pyspark: Exception: Java gateway process exited before sending the driver its port number](https://stackoverflow.com/questions/31841509/pyspark-exception-java-gateway-process-exited-before-sending-the-driver-its-po)
: spack load jdk to setJAVA+HOME
, otherwise this error when create spark sc.
CGROUP
setup on Ubuntu16.04 for a test. ref, ref. cgroup: cpu setting.
sudo apt install cgroup-tools
sudo vim
/etc/cgconfig.conf
sudo vim /etc/cgrules
check cgroup working status:
cat /sys/fs/cgroup/cpu/app/calculation/tasks
Misc
To limit the cpu usage, combine cpu.cfs_quota_us and cpu.cfs_period_us. The ratio is the number of kernels one can maxilmally utilize. This is somewhat a hard limit comparing to cpu.shares which is just a soft limit and a nice like primitive.
Hopefully workflow
How to achieve minimal steps.
For master node
OS installation
create user account with sudo
two ethernet cable plugin and enable the wan network
manually
ip addr add ip dev lan_nic
sshd enable and run and generate ssh-keygen
apt install ansible
config inventory and vars for ansible roles
run ansible on master only
For slave node
OS installation
create user account with the same name and password
ethernet cable to master switch
sshd enable and run
paste the pub key to authorized keys
For master node
run ansible again for all nodes
Last updated