Real Setup
This section reviews the real setups on the cluster.
hardware
switch
port 23 24 are changed into VLAN2 for S1720-28GWR-4P.
AP
The reseverd ip is 192.168.1.250/24
. The dhcp is turned off and it sit in AP mode. When connecting to it, the ip is assigned by master node DHCP sevice in 48 subnet.
master server
LAN upper RJ45 port 10/24 and WAN lower RJ45 port 44/24.
maste node is currently set as static ip in WAN.
Harddisk(? position may not be accurate now): left down for 2.5 SSD, right down for sdb, old 2T HDD, left up for sdc, new 3.5 2T HDD.
computation server
The leftmost RJ45 port is utilized. (not the one for iDRAC) The nic name is eno1 in ubuntu OS.
outdated server
centos 6.6, name as d1. For operations on this server, please refer this section.
Note: d1 is not ready to open to users. It currently went offline.
software
basic on master
First use fdisk make one partition on each disk of sdb and sdc, and then use mkfs.ext4 /dev/sdb1
to format the two disks. Mount two 2T hdds /dev/sdb1
and /dev/sdc1
to /DATA and /BACKUP, where /DATA has similar permission with tmp and shared across nfs. Namely, chmod a+w /DATA
, chmod a+t /DATA
. Meanwhile, /BACKUP is only written by root. There are several backup crontab tasks by root managed by rsync -az
from /home/ubuntu and /opt to /BACKUP. Besides, the config file /etc/fstab
is also configured such that /DATA and /BACKUP can be mounted automatically when reboot.
The backup crontab and fstab mount config have not included into ansible workflow due to flexibility consideration.
In terms of network and nfs, ntp, apt setups, please see relevant section in Virtual Machine part.
Note the netplan config logic is merging instead of overwrting, so one must add dhcp4: false into config.yaml to make sure no dhcp is utilized.
Note: For proxy part, note there are softwares not following http_proxy and need to set proxy in their own way. Such apps include apt, and git and crontab and docker (Note there are four different types of proxies you may want to configure in terms of docker!). (Maybe /etc/enviroment is a better place for http proxy variables). Also note apt-add-repository use http_proxy env instead of apt.conf, indicating that you need sudo -E
.
swap partition
The swap partition on c4 to c9 is a bit annoying for its large size which may slow down the program without notice with high IO. Turn off them by sudo swapoff /dev/sda3
. Also note delete the swap line in /etc/fstab. However, the swap partition is before the root partition, so no idea how to utilize the disk space easily. Form personal perspective, I always recommend using swap.img file as swap, which is much more flexible than swap partition( see this for swap file config).
Temporary way to utilize the empty swap partition: mount them locally under /tmp/extra dir in c4 to c9.
ansible -i hosts cn[3:8] -m filesystem -a "dev=/dev/sda3 fstype=ext4 force=yes" --become -K
ansible -i hosts cn[3:8] -m file -a "path=/tmp/extra state=directory mode=01777" --become -K
ansible -i hosts cn[3:8] -m mount -a "path=/tmp/extra src=/dev/sda3 fstype=ext4 state=mounted" --become -K
Note how cn[3:8] corresponding to c4 to c9.
ansible
sudo apt install ansible
on master node.
Please see the ansible playbooks in HPC2. The playbooks went opensource, see here.
Test command: ansible-playbook site_test.yml -i hosts_test -vv
, remembering change the group role in hosts_test, this test will be conducted on a remote VM server.
gpu drivers
driver-418 seems to be vanishing in ppa, install 430 instead on c9.
nvlink commands from nvidia-smi: post
spack
into ansible workflow
Just see the practice on spack. Combine spack with lmod to provide consitent interface on package management.
some spack things to note
spack install rclone, and there is a go folder on the home directory, outside the spack folder!! Seems because
GOPATH
is by default instead of set within spack folder. See this blog for more info on go project and package organization in fs level. Solved by this commit.
python
into ansible workflow and intel parallel studio installation, always use intel python and its conda for users
prefered way: intel python+spack pip. spack load intel-parallel-studio
, spack load py-setuptools
, spack load py-pip
. Intel python+conda create enviroment.
Never use admin account's global pip. Reason: the package would be installed in ~/.pip. However, if later spack-pip install some packages, the dependece is automatically used if it is already in ~/.pip. But this folder is not accessible by other user which may lead to a chaos on python packages. However, for normal user, global pip3 is the recommended way to download packages.
jupyter
Use intel python and pip as root, pip install jupyter ipyparallel jupyter_contrib_nbextensions
. Somehow the cluster tab works after several trials. Dont know exact solution though.
mathematica
Installed by the bash script on /opt/mathematica/verno. And the script is in the bin subdirectory of the above path. Add it as a package in spack override repo, and spack load mathematica
to use it.
Possible issue: The activation need to be carried out per user per node basis. Maybe have a look at MathML in the future. Already written a script to activate all nodes at one time.
To utilize remote kernel launching with a better interface, add tunnel.m at Kernels/packages. Besides, one should also add tunnel.sh and tunnel_sub.sh in their home directory ~/.Mathematica/FrontEnd
.
One liner for user to be accessible ansible -i /home/ubuntu/hpc/hosts all -m command -a "/usr/bin/python3 /home/ubuntu/softwares/mmaact/automma.py '/opt/mathematica/11.0.1/bin/math'" --become-user=<user> -Kvv --become
.
matlab
mount iso to a dir, and use ssh -X
to install by X11 forwarding. Remember umount and mount iso2.
It is worth noting, that matlab installer supports -X for forwarding. While for matlab itself, only ssh -Y
works for remote desktop scheme.
singularity
spack install singularity
, remember to check spack edit singularity
, there is an after install warn prompt asking you run a script which would change the permission of some files with s bit. It is crucial for singularity run by normal users.
ganglia
into ansible workflow
apt-get install ganglia-monitor ganglia-monitor-python gmetad ganglia-webfrontend
ganglia-monitor is client side gmond.
On client, only ganglia-monitor
and ganglia-monitor-python
should be installed. The second one is necessary as modules to watch spec of nodes.
Please reference this post on the change of configuration files.
The workflow of ganglia configuration and installation have been merged to ansible playbooks.
gmetric can be customized to report certain spec. See temperature example here.
Apache password protected sites: digitalocean
gmetric expire invalid metric: mailist: add a finite int for -d flag in gmetric cli. Also see this post
GPU monitoring part, from beginning, I was thinking about incoporate nvidia plugin for ganglia github. But the solution is too invasive and no guranteen on the sucess based on search results in the internet. So finally I decided to write a small script to collect gpu data given by nvidia-smi and send them to gmond by gmetric command. Just as what I have done on cpu temperature.
The default apt package for ganglia webfront has no default view of cluster, fix by this and reported by this issue.
ELK
into ansible workflow
add elastic repo and key
apt install elasticsearch, es binding to localhost instead of master
apt install kibana and configure nginx reverse proxy
apt install logstash and configure the pipe, note the ip binding of beat input configure (there are
""
for string ip)apt install filebeat (on all nodes)
sudo filebeat setup --template -E output.logstash.enabled=false -E 'output.elasticsearch.hosts=["localhost:9200"]'
sudo filebeat setup -e -E output.logstash.enabled=false -E output.elasticsearch.hosts=['localhost:9200'] -E setup.kibana.host=localhost:5601
sudo filebeat setup --pipelines --modules system nginx mysql -E output.logstash.enabled=false -E output.elasticsearch.hosts=['localhost:9200'] -E setup.kibana.host=localhost:5601
refSummary of the above three points to init filebeat, the following is enough
The
-E
flag is for temporary output to es database to write in some templates and pipelines. Since the filebeat is configured by output to logstash.Note filebeat setup need temporary linking to es database, which is a must.
work test
curl -X GET "localhost:9200/_cat/indices?v"
disable logstash ssl (seems to be disable by default)
edit the index of logstash output (add beat.hostname in index name)
filebeat system module, timezone problem: post, post with logstash setup
Detailed explanation on the tiemstamp mismatch for system module from filebeat: there are two types of log files, one with time zone information or direct claim on UTC timestamp, which the parser of filebeat can input the data into ES confidently with UTC timestamps. That is because ES always accept UTC timestamps and @timestamp is time zone agnostic. The reason that we feel the timestamps are just right in kibana is because the default setting of kibana is render the timestamp with timezones defined by broswer, aka. local OS system.
Coming back to the second type of log files, like syslog and authlog in ubuntu, they have timestamps but have no indication whether such timestamps is UTC or localtime. Actually rsyslog can run on UTC even if OS is set to some other timezones. This could persists to the restart of rsyslog service. So to parse these logs and write UTC timestamps to ES, filebeat must has a way to specify whetehr we need to convert the lietral time strings in syslog by some timezone and write another literal timestamps to ES. This is in principle configured by /etc/filbeat/modules.d/system.yml. There is a variable in the file called var.convert_timezone, turn it to be true (seems default false), and in principle, you can get the correct time view in kibana now.
The mechanism behind convert_timezone in a ingest pipeline (pipeline basic) of ES. Namely the conversion is happening in ES side instead of handling by filebeat itself.
But reconfiguring filebeat turns out not that easy. There are two totally different cases. The first one is the output of filebeat is directly some ES, which seems to be the default support case from the doc. In this case, one should first stop filebeat service, and delete all previou pipeline by curl, and then start filebeat service again, everything should be fine now, easy. In this case, restaring filebeat will automatically generate new pipelines in ES, which you dont need to care.
The second use case of filebeat, make the output of filebeat to logstash. I dont think this setup is very meaningful nowadays, since filebeat seems very powerful by itself. However, if you insist this approach and try to fix time zone problem, it would be a little subtle. A simple change on system.yml won't work, though it may be a bug instead of a feature. There are several differences between this case and the ES direct case. In the second case, the pipeline in ES cannot be autogenerated when start filebeat, which is fair since filebeat has no chance to communicate with the ES directly on the normal runtime. So after stop filebeat service and delete all previous pipelines in ES, you need to generate pipelines by hand using filebeat setup tools as indicated by the above bash block. Apart from setup —pipelines, setup -e is also suggested in case. Things behinds setup -e is to configure both index template in ES and dashbord in Kibana. And setup —pipelines, as shown by this option, is to add pipelines in ES. One can restart filebeat after there two lines of commands. Extra options for these commands with -E flags is for temporary config on config time which overwrites the default config in filebeat.yml. This is necessary because filebeat must connect to ES directly at configuring time to write there pipelines and index templates back to the ES. It also worth noting these -M options for generating pipelines. It turns out hacking convert_timezone in yml files in modules.d doesn't work for logstash output. Instead, you MUST specify them by -M options explicitly when generating pipelines. E is for configuration overwirte while M is for module configuration overwrite.
For certain pipeline in ES, one can check by
curl -XGET 'http://localhost:9200/_ingest/pipeline/filebeat-6.8.0-nginx-e*'
, one should make sure there is a timezone key field in it if the convert_timezone function is enabled.Also tips for debug, turn kibana time range as this week, so that you can realize that something may written into the furture by some misconfigure.
In sum, time is a big topic and a subtle issue in ELK stack and even in development in general. Pay attention to them and be careful!
to change config of filebeat
service filebeat stop
curl -XDELETE 'http://localhost:9200/_ingest/pipeline/filebeat*'
run the two setup steps in the summray
service filebeat start
sudo /usr/share/elasticsearch/bin/elasticsearch-setup-passwords interactive
(Thanks to elastic guys, xpack security now comes to free for 6.8.0+).configure on password in logstash need quote.
To query es with user authetication, just add
-u esuser:espass
option in curl commands.
Basic debug:
unset http_proxy
,curl -v --user <user>:<pass> -XGET 'http://master:9200/_cluster/health?pretty'
for ES cluster, failure of some es node may lead to http authetication error 401. It may has nothing to do with user passwrod and authentication things.
Misc note:
For debug test on es, curl will go proxy!!
no specified JAVA_HOME warning in es service log doesn't matter
logstash config intro, grok official guide
actually it is ok for missing hostname, but the log from compute node is just too small compared to master…. It is not an issue due to ELK stack, but issue of non uptodate syslog. (time mismatch)
timezone issue of syslog: every damon can see the timezone issue only solved by service restart! case solved
ES basic query syntax: doc
pipline doesn't exsit error when modules is enabled for filebeat: post. Run
curl -XDELETE "http://localhost:9200/_ingest/pipeline/filebeat-*"
, to resolve conflict with possible old pipelines?Actual problem: one should load pipeline for all modules in one line, otherwise, the latter one would overwrite the former one.
Timestamp in ES is always UTC, but kibana show them with broswer default timezone.
metric monitoring and optimization: blog. Pay attention to heap size.
cluster conf
ssl is must
sudo /usr/share/elasticsearch/bin/elasticsearch-certutil ca --pass "" --out elastic-stack-ca.p12
,sudo /usr/share/elasticsearch/bin/elasticsearch-certutil cert --ca elastic-stack-ca.p12 --pass "" --out elastic-certificates.p12 --ca-pass ""
, we only need the final certifactes.p12 file.
elastalert
into ansible workflow
In general, a cool, reasonable and east-to-follow tools. The logical flow is better compared to the tools as middlewares in the logstash. Here we just query the ES database periodically, and sent alert accordingly based on some predefined rules.
pip3 install elastalert
pip3 install "elasticsearch>=5.0.0"
apt install elastalert
elastalert-create-index
For mail configuration, see this issue. It is better to use campus mail as From, since the cluster is air gapped with the Internet. But it could also be done for other smtp servers, just setup a port forwarding on the proxy server.
Note: to use elastalert-test-rule
, first unset http_proxy
to make it accessible to localhost ES.
quota
See reference on ubuntu 18.04 quota command: digital ocean, it is very well write up.
sudo apt install quota
need further experiments on VM cluster first before apply it, always be carful for disk stuff
See the VM corresponding part for operations.
Not included in ansible due to flexibility consideration. Only use in master node.
ulimit
merged into ansible workflow
user level vs shell level: ref.
hard limit can only be changed by root and soft limit is something that anyone can change, but firstly, you need to change it.
ulimit config file in /etc/security/limit.d
, see demo
check by ulimit -a
for individual users.
fail2ban
sudo apt install fail2ban
sudo fail2ban-client set sshd unbanip 166.
numa
apt install numactl
apt install hwloc
cgroup
sudo apt install cgroup-tools
tinc
sudo apt install tinc
combine tinc vpn with http proxy, such that http proxy ip is not public to everyone. Make http proxy only listen to the tinc ip interface.
tincd -n netname -K
to generate key pairs, and tincd -n netname
to start the daemon. For debug usage, try tincd -n netname -d5 -D
for a foregroud d with verbose output. On each Tinc daemon debug window, quit the daemon by pressing CTRL-\
.
sudo iptables -t nat -I POSTROUTING 1 -o tinc -s 192.168.48.0/24 ! -d 192.168.48.0/24 -j SNAT --to-source 10.26.11.1
on master node, make compute nodes available without any modification on them. (this new SNAT line is hopefully also managed by ansible playbooks). sudo iptables -t nat -nLv
check current iptables.
jumbo frame
merged into ansible
ip link set eth0 mtu 9000
Test: ping -M do -s 8972 master
, do: fragmentation forbiden, there is head bytes auto added, so -s 9000 is unaccessible, see this post.
MTU setting in netplan has issues in ubuntu18.04, so basically it doesn't work by netplan apply. See one possible solution by add mac address match in netplan
The benchmarks shows little gain in enabling jumbo frames.
Using mtu 8500 instead of 9000 due to issue in Intel I219LM.
docker
Installation reference (somehow docker's apt source is added into the main source.list file instead of ppa file in source.d dir)
Add relevant user to docker group: ref
proxy settings on docker server: ref
change default docker image path to /DATA: ref
docker hub image speedup post
Warning: only trusted used can be add to docker group to directly communicate with docker deamon. It is not desinged for normal users but only resever for the administrator to debug. See security issues of docker and also this post. For normal use of containers, please try sigularity instead.
tmpreaper
sudo apt install tmpreaper
, see usage. Autoconfigured as crontab task.
Ubuntu18.04 seems to have a default cleaner: so, see man tmpfiles.d
Age: If omitted or set to "-", no automatic clean-up is done. Seems no default auto delete?
mail
mailutils seem to use hostname as fromto, no matter what myhostname is configured by postfix, it instead use -aFrom in commad line. Workable example echo "hello"|mail --debug-level 3 -s "subject" -aFrom user@some.localdomain receiver@mails.tsinghua.edu.cn
. Or echo "hello"|mail -s "go" user@mails.tsinghua.edu.cn -r ubuntu@master.localdomain
.
Now equipped with all nodes shipped with smartmontools.
Use customized smail script for slurm to overcome the wrong send address format.
Use sudo postsuper -d ALL
to clean postqueue -p, see here.
Create aliases.db by sudo newaliases
see here.
backup
legacy approach (deprecated)
new approach based on restic
merged into ansible
apt install restic
ignorefile
Approach to recover the whole OS in hard disk level: post
RAID1 on c8
sudo apt install smartmontools
on c8, it depends on postfix, which I have configured to local only (not a big fan of postfix).
/dev/sdb is hardware raid 1, can be checked by sudo smartctl --all -i /dev/sdb -d megaraid,1
, check Health states by sudo smartctl -H /dev/sdb -d megaraid,2
. Foreground short self test on disk: sudo smartctl -t short -C /dev/sdb -d megaraid,1
. Basic operations of smartctl: ref.
It seems that there is also smartd enabled as service.
RAID5 on c9
(June 4, 2021) Six 8T HDD as hardware RAID5 are added in c9. It is mounted at /DATA.c9 and shared with other nodes via NFS. Remember to check the disk health in this RAID5 frequently (may be once a month) in case one disk is down in RAID5.
some benchmarks
network
iperf, the master to compuation node bandwidth is around 940Mbit/s, which is near to the limit of the Gigabit nic.
iperf for ipv6:
iperf -sV
,iperf -c <remote> -B <src> -V
iperf for udp:
iperf -su
,iperf -c <remote> -u -b 1000M
, you should specify udp bandwidth on your own, otherwise, it gives a result around 1Mbs.-r
first send then receive;-d
both at the same timefrequent commands related to ethtool: post
cpu
Tried linpack in MKL, see results in here
memory
Available frequency is 2400, though the param is 2666, the speed is limited by CPU 5120.
disk
by brute force
dd if=/dev/zero of=/tmp/output bs=8k count=50k
cn3 ssd: 1.4GB/s
cn3 write to home folder, which is share by NFS and stored in master's SSD, 89.1 MB/s
master ssd: 1.3 GB/s
master /DATA, sdb1: 1.3GB/s (? cannot understand), similar results for sdc in master, it is weird though. (maybe due to disk controller or cache therein?)
dn1: sdb2 557MB/s, similar result for sdb, sda (under lvm): 471MB/s
electrical consumption
For c[1-8], the peak power is about 260W and 1.5A. For c9 with two 2080Ti, it is about 700W and 3A.
Last updated