ToDo
A list of the future tasks to be implemented on HPC2, may be brief or somewhat in detail. Not-so-urgent types of tasks are listed in italic way.
Hardware Level
Network related
- Divide S1720 switch into two VLAN 
- DHCP on the master node with very long lease time 
Server related
- iDRAC configuration 
- Open manage configuration 
- include more old machines into the cluster 
Software Level
OS related
- locale to en.US for all systems and timezone 
- ntp server for master node 
- nfs settings 
- dhcp server for master node 
- apt and pip source change 
- ? shutdown password login in LAN 
- user resource limit - disk limit in home 
- resource limit in master 
- qos in slurm 
- pam in slurm 
 
- sshfs 
- rclone 
- rsyslog forward (by ELK stack) 
- gitlab 
- syncthing 
DevOp softwares
- cobbler 
- ansible - create playbooks for: 
- user management and their ssh key 
- apt tools install 
- service state management 
- network configuration (dhcp server with fix ip on master and dhcp auto on nodes and nat) 
 
- ?nagios 
- ganglia 
- modules 
- backup tools 
HPC softwares
- spack 
- slurm 
- intel parallel studio (ifort icc intelmpi intelpython and mkl included) 
- Eigen and armadillo 
- boost 
- GSL 
- distributed mathematica - automatically activation 
- launch remote kernel via cli 
 
- SLEPc PETSc and external linear and eigen solver packages 
- Julia 
- ipyparallel 
- Matlab 
Modern cloud and distributed system tools
- OpenStack 
- Spark 
- Kubernetes 
- Elasticsearch 
- ?ceph 
Miscs
- opensource the ansible playbooks in this HPC 
- ansible authorized key double check 
- possible setup one more v2ray inbound which has only outbounds to 176 network, used for other users to access jupyter notebook 
- spack install hpl for linpack benchmark(just use mkl one) 
- backup mysql database 
- ? backup elastic database 
- bootstrap setup for new machines by curl scripts 
- more careful division on playbooks, new roles comes in! gpu partition and shared storage on compute nodes, backup manage node (backup of slurmctld slurmdbd and possible elastic node) 
- change mount logic to more robust and support on [sn] 
- add a debug partition queue and a gpu queue for slurm 
- replication of ES 
- authetication of ES 
- make hostname consistent by ansible on ubuntu18.04, (a detailed study on cloud init subsystem) 
- ganglia incomplete metric collection 
- ganglia gpu plugin 
- elastalert for mail alerting 
- MPI benchmark 
- GlusterFS 
tasks to explore
- unifomity of user 
- uniformity of environment variables (propagate by sbatch) 
warning
- name system with version number 
- ?try install all useful things in opt or home which is going to be export 
Future directions
- combination of k8s and slurm 
- hybrid HPC/cloud setup and elastic scaling 
design principle
- more on master, less on slave, everything in slave should be included in ansible workflows 
- more on ansible playbooks, less by hand 
Last updated
Was this helpful?