ToDo
A list of the future tasks to be implemented on HPC2, may be brief or somewhat in detail. Not-so-urgent types of tasks are listed in italic way.
Hardware Level
Network related
Divide S1720 switch into two VLAN
DHCP on the master node with very long lease time
Server related
iDRAC configuration
Open manage configuration
include more old machines into the cluster
Software Level
OS related
locale to en.US for all systems and timezone
ntp server for master node
nfs settings
dhcp server for master node
apt and pip source change
? shutdown password login in LAN
user resource limit
disk limit in home
resource limit in master
qos in slurm
pam in slurm
sshfs
rclone
rsyslog forward (by ELK stack)
gitlab
syncthing
DevOp softwares
cobbler
ansible
create playbooks for:
user management and their ssh key
apt tools install
service state management
network configuration (dhcp server with fix ip on master and dhcp auto on nodes and nat)
?nagios
ganglia
modules
backup tools
HPC softwares
spack
slurm
intel parallel studio (ifort icc intelmpi intelpython and mkl included)
Eigen and armadillo
boost
GSL
distributed mathematica
automatically activation
launch remote kernel via cli
SLEPc PETSc and external linear and eigen solver packages
Julia
ipyparallel
Matlab
Modern cloud and distributed system tools
OpenStack
Spark
Kubernetes
Elasticsearch
?ceph
Miscs
opensource the ansible playbooks in this HPC
ansible authorized key double check
possible setup one more v2ray inbound which has only outbounds to 176 network, used for other users to access jupyter notebook
spack install hpl for linpack benchmark(just use mkl one)
backup mysql database
? backup elastic database
bootstrap setup for new machines by curl scripts
more careful division on playbooks, new roles comes in! gpu partition and shared storage on compute nodes, backup manage node (backup of slurmctld slurmdbd and possible elastic node)
change mount logic to more robust and support on [sn]
add a debug partition queue and a gpu queue for slurm
replication of ES
authetication of ES
make hostname consistent by ansible on ubuntu18.04, (a detailed study on cloud init subsystem)
ganglia incomplete metric collection
ganglia gpu plugin
elastalert for mail alerting
MPI benchmark
GlusterFS
tasks to explore
unifomity of user
uniformity of environment variables (propagate by sbatch)
warning
name system with version number
?try install all useful things in opt or home which is going to be export
Future directions
combination of k8s and slurm
hybrid HPC/cloud setup and elastic scaling
design principle
more on master, less on slave, everything in slave should be included in ansible workflows
more on ansible playbooks, less by hand
Last updated