Admin Workflow

This section reviews some general workflow for daily administration on the cluster.

some scenarios

add a new user

  • Add user account and password info into ansible user playbook and run it.

  • The password can be generated by openssl rand -base64 32, see here for more approaches to generate random password.

  • sacctmgr add user <name> account=n3 to make user available to slurm. (already merged into ansible workflow)

  • sudo setquota -u <name> 60G 80G 0 0 / (already merged into ansible workflow)

  • mkdir on /DATA path. (already merged into ansible workflow)

  • Probably add <user> cpu usersoft/ line in /etc/cgrules.conf. And sudo cgrulesengd. (edit in roles/cgroup/files and run ansible cgroup role instead) Unincluded

  • Activate mathematica by ansible one-liner. Unincluded

  • Add backup crontab for user home directory. Unincluded taken over by restic as a whole

add new compute nodes

There is bootstrap script hosted on master in /home/ubuntu/bootstrap. You can open a simple http server in this dir on matser. On newly introduced nodes, mkdir .ssh as ubuntu user and just run bash <(curl -s http://192.168.48.10:8000/bt.sh) as root user. And it is enough to run ansible workflows from master now.

after reboot

  • It is highly suggested that all ansible playbooks to be executed once reboot (at least network and basic and elk for compute node reboot).

  • config cgroup as sudo cgconfigparser -l /etc/cgconfig.conf && sudo cgrulesengd.

  • start tinc vpn by sudo tincd -n debug.

  • sudo ethtool -K enp0s31f6 tso off gso off on master if you like

  • iptables (nat rules) on master is not persistent, see this issue for further develpment of ansible to incorporate persistence of iptables.

  • hostname is not persistent by hostname module of ansible!! see issue Solved by switch option in cloud.cfg.

  • MTU is not persistent by netplan, due to default cloud init in ubuntu (no good even after add mac match to netplan...)

  • may need to set scontrol update nodename=cx state=IDLE by hand to make them online again in slurm

  • for cn nodes, run ansible-playbook -l cn[x-1] -Kvv site.yml (non-persistent ones: start ntp, start filebeat[?], stop snap, enbale jumbo frame)

renew intel licence

Actually there is nothing as renew, just get a new serial number is enough.https://registrationcenter.intel.com/en/products

After apply for a new serial number, you can obtain the licence file following the instruction here: https://software.intel.com/en-us/articles/resend-license-file. Just put this licence file in /opt/intel/licenses/. the relevant link: https://registrationcenter.intel.com/en/products/license. https://registrationcenter.intel.com/en/products/

summary on works beyond ansible workflow

All extras in master nodes, keep the bottom line that all tasks on compute node should be merged into ansible workflow.

  • local nonsytem disk partition and make filesystem

  • hard disk mount and fstab configure (one time forever, required before basic roles, actually can easily merged into basic role) already merged into ansible workflow

  • Possible nvidia drivers install and reboot if GPU is available.(already merged into ansible workflow) Cuda and cudnn can be managed by spack.

  • quota initial configure (one time forever, required before user roles)

  • intel parallel studio install (one time forever) (no need to install before any roles, possible issue for python path maybe in python roles)

  • mathematica install and add virtual mathematica packages in spack (one time forever) (no need to install before any roles). Similar for Matlab (but it has a predefined recipe).

  • backup crontabs (one time forever? maybe find some more advanced tools) (no need to configure before any roles) changed to use restic for backup

  • python packages install and jupyter configure (continuing work) (no need to install before any roles)

  • spack packages install by specs and spack env maintenance (continuing work) (no need to install before any roles)

  • sacctmgr cluster, qos, priority and account add (continuing work for advanced scenario, minimum setup required before user roles after slurm roles)

  • two line of commands to final set up ELK stack on master (should find some more elegant way in the future) merged into ansible workflow

  • tinc vpn set up on master node

  • docker set up on master node

  • restic backup setup on master node, merged into ansible workflow

some checks by hand in a lower frequency

There might be some checks running in week basis. These checks should be run manually.

  • Smartctl related hard disk healthy check

  • restic backup integration and snapshots check

  • check postqueue -p on each node, to see whether some mails may be stalled. (sudo postsuper -d ALL to clear all stalled mails in the queue)

  • check the usage of localdisk: ansible all -m shell -a "df -HT|grep /dev/sda"

install softwares or libraries beyond spack

Installation path: large size commercial softwares: /opt/softwarename/version/. Open source library from source: /home/ubuntu/softwares/softwarename, in this dir, you may have some name+ver dir for installed versions of softwares and some dir name ended with src as source files. Hopefully, one should put a self-contained information file in each software dir. The name convention is softwarename.info. The content includes what each dir within for, what the notes and warnings for this software configuration and installation, most importantly, the install process details for each installed version, such as options for ./configure and so on. One may even want to capture all stdout for configure and make on each installed version dir with name configure.out, make.out and so on. To record this stdout more easily, using script cmake.out and then run cmake, remember ctrl-d or exit to stop the recording. See more on script command in linux.

Finally, you may want to include such libraries under the control of spack. This usually involves two steps. If such library already has a position in spack repo, then you only need to add the external path for this package in spack config packages.yaml. If there is some more further fine tuning on module files to load it, you need further hacking spack config modules.yaml. (All these config change should go under ansible workflow). Besides, if the softwares is not registered in spack, then you need first add a package.py in repo as a placeholder. Something as below is enough.

# placeholder for external mma

from spack import *

class Mathematica(Package):

    homepage = "https://www.wolfram.com/mathematica/"

    version('11.0')

The insipration of standard workflow on software installation is from [this post](http://www.admin-magazine.com/HPC/Articles/Warewulf-Cluster-Manager-Development-and-Run-Time/(language)/eng-US).

Known Issues

  • MTU cannot be set by netplan yml file, even after we have included the mac:address line in the yml.

    Current Workaround: Include directly command on ip in ansible playbooks.

  • Burst of NFS error logs, in master we have:

    nfsd4_validate_stateid: 270 callbacks suppressed
    NFSD: client 192.168.*.* testing state ID with incorrect client ID

    And in client, we have: NFS: nfs4_reclaim_open_state: Lock reclaim failed but with lower frequency. Meanwhile, the IO speed for NFS drop to 1/10 of normal speed.

    Current Workaround: cannot find any better solution currently, but rebooting of clients seem to be fine and make all these errors vanish. Possibly related to some Linux kernel NFS bugs.

  • Burst of errors: watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [192.168.*.10-m:27452] in one of the compute nodes and final crash of the machine though ping is still available then.

    Current Workaround: Maybe related to the above bug but not sure. Just a hard reboot solves everything.

  • Error log of ganglia client claiming that some python module won't work /usr/sbin/gmond[2358]: [PYTHON] Can't call the metric handler function for [tcpext_tcploss_percentage] in the python module [netstats]. But this is not true, a reboot can make these error vanishing and every loaded module is workable for gmond.

    Current Workaround: Restarting ganglia-monitor service should be enough. But it may still happen in a regular basis.

  • Sometimes, after restart of gmond, it cannot collect all metric, some are missed.

    Current Workaround: No idea why. Just try restarting gmond service, but it may still not work. In sum, gmond status is somewhat fragile and tend to miss some metric. Maybe related to this so, try configuring send_metadata_interval to nonzero if one use unicast for gmond. And be patient after restart, gmond may begin to collect missing metric several minutes or hours later.

  • In some new machines c[4-8], logrotate doesn't work as expected though the conf is the same as previous machines. /var/lib/logrotate/status gives new rotate time while the log isn't rotated at all, this should be the reason, still no idea why status file gives wrong rotate time, though (may be related with cron not sync time with timezone due to lack of restart service).

    Current workaround: sudo logrotate -v -f /etc/logrotate.conf. verbose and force rotate

    Possible related to cron service, which need to be restarted to have the same clock with new timezone setted on the machine.

  • (Fully solved) Assymetry network performance in LAN, master to cn direction can only run in 666Mb (2/3Gbit).

    The problem is now reduced to master only. (Specifically this nic: I219)

    Possible issue: post

    Solution: sudo ethtool -K enp0s31f6 tso off gso off from here, not very sure of side effects though. (note this command is not persistent when reboot)

    Related kernel commit

  • (Fully solved) Apache2 module in filebeat doesn't support convert time var for pipelines even by explicitly calling it, thus leaving a time mismatch for apache2/error.log.

    Current workaround: the support is merged into filebeat very recently later than 6.8.0 release. But you can hack it on your own, see this issue.

  • docker pull might have permission issue in tmux shell.

  • (Fully solved) C9 automatically enter drian state due to the reason "batch job complete failure". In syslog, slurmstepd: error: Domain socket directory /tmp/slurmd: No such file or directory. mkdir /tmp/slurmd in c9 seems to mitigate the problem. There will be a new warning: slurmstepd error: Unable to get current working directory: No such file or directory but slurm works ok. (Update: c6 also seems going through this issue)

    Workaround: the reason for this issue is missing /tmp/slurmd, which is deleted by tmpreaper if not used for a long time and this will lead failure of slumrd in these nodes. So just add TMPREAPER_PROTECT_EXTRA='/tmp/slurm*' in /etc/tmpreaper.conf.

  • Kibana and possibly its backend ES server become extremely slow to response in recent months. ES node may fail due to no obvious reason. (maybe due to small heap size memory limitation, validated)

  • Sometimes, ansible could render template host[0] as m instead of master, but it is not something wrong in template writing, since it happens nondeterminsticly! (pay attention to this, a wrong rendering may lead to crash of ES cluster)

  • mail fails to send mail to tsinghua email from local. It is highly possible that tsinghua mta starts restricting this ip, since other ips can send mail successfully via the same command. And also elastaleart send from tsinghua mail also fails with the error ERROR:root:Error while running alert email: Error connecting to SMTP host: SMTP AUTH extension not supported by server.. It seems that tsinghua mail service block the cluster ip and deny service.

    Current workaround: Using relay machine to port forward, linking tsinghua mail 25 to relay 26 (25 is listened by relay local mail service), and set up smtp host as the relay machine in the cluster master with port 26. This works and indeed shows that master ip is on the blacklist of tsinghua mail service. (And relay seems to be blocked now...)

  • ulimit -u is still 2048 in tmux sessions, so. Maybe restart works, to be checked. (Validated)

  • c3 offline slurm due to no explicit reason, slurmd cannot started due to slurmd-c3: error: Unable to register: Zero Bytes were transmitted or receivedslurmd-c3: debug: Unable to register with slurm controller, retrying slurmd-c3: debug: slurm_recv_timeout at 0 of 4, recv zero bytes. Currently no clues.

    Current workaround: reboot c3 solves this.

  • ureadahead[948]: ureadahead:..: Ignored relative path floods in syslog when restart, according to stackoverflow , it is normal that it occurs for a machine rebooting after more than 1 year which is common in HPC. such module make no sense and can be disabled in system service, but not tried since rebooting thing is not so common.

  • Possible stall in nfs client in compute nodes (cannot ssh since /home inaccessible and floods of log as RPC request reserved 108 but used 356) such logs are often reported in nfs community, it is like a big bug happens for no reason and no explicit triggers but no solution have I found.

    Current workaround: hardware reboot corresponding node. This issue is extremely danagerous as it floods the log in master instead of compute node! and thus it may eat up all disks in master which is very dangerous, in case of that, we have explicitly set sudo setquota -u syslog 30G 50G 0 0 / on master to limit the max size of log files.

Last updated