Helios Cluster: Overview

For large parallel computations, SNS has a 64 node beowulf cluster. Each node has dual fourteen-core 64-bit Intel Xeon Broadwell processors, providing a total of 1792 processor cores. Each node has 128 GB RAM (4.5GB/core). For low-latency message passing, all the nodes are interconnected using EDR Infiniband.

Job queuing is provided by SLURM. OpenMPI is available for MPI applications. The operating system is PUIAS Linux 7.

All nodes mount the same /usr/local, /home, and /data, filesystems as the other computers in SNS, providing the same computing environment as is available throughout the rest of SNS.

All nodes have access to our parallel filesystem through /scratch.

1TB of local scratch is available on each node in /scratch_local.

Access to the Helios cluster is restricted. If you would like to use the Helios cluster, please contact the computing staff.

The following hosts have been configured as SLURM submit hosts, meaning that you can submit jobs from them and execute most SLURM commands from them:

  • Selene
  • Eos

This arrangement makes it easier to take advantage of the cluster. You no longer need to maintain a separate user environment on the cluster or transfer your data to/from the cluster before and after cluster runs. This also allows you to develop/test/debug your codes in an environment that is identical to the cluster nodes. In other words, if you can run your program on any of the servers listed above, it should run on the cluster without any modification

The cluster determines job scheduling and priority using Fair Share. This is a score determined per user based on past usage; the more jobs that you run the lower your score will temporarily be.

Jobs will be assigned a quality of service (QOS) based on the length of time requested for the job.

QOS Time Limit Cores Available
short 24 hours 1792
medium 72 hours 896
long 168 hours 448

The current maximum allowed time is 168 hours or 7 days. Users needing to run jobs for longer than the maximum time window should add the capability to utilize restart files into their jobs so that they comply with these limits.