For large parallel computations, SNS has a 64-node beowulf cluster. Each node has dual eight-core 64-bit Intel Xeon Sandy Bridge processors, providing a total of 1024 processor cores. Each node has 32 GB RAM (2GB/core). For low-latency message passing, all the nodes are interconnected using 4x FDR Infiniband.
Job queuing is provided by Open Grid Engine (OGS) 2011.11. OpenMPI is available for MPI applications. The operating system is PUIAS Linux 6, which is the same as in use on the rest of the SNS computers.
All nodes mount the same /usr/local, /home, and /data, filesystems as the other computers in SNS, providing the same computing environment as is available throughout the rest of SNS.
All nodes have access to our parallel filesystem through /scratch.
Access to the Hyperion cluster is restricted. If you would like to use the Hyperion cluster, please contact the computing staff.
Because the cluster uses the same /usr/local, /home, and /data filesystems, and uses Grid Engine which separates job submission hosts from execution hosts, it's no longer necessary to log into the cluster. The following hosts have been configured as Grid Engine submit hosts, meaning that you can submit jobs from them and execute most SGE commands from them:
This arrangement makes it easier to take advantage of the cluster. You no longer need to maintain a separate user environment on the cluster or transfer your data to/from the cluster before and after cluster runs. This also allows you to develop/test/debug your codes in an evironment that is identical to the cluster nodes. In other words, if you can run your program on any of the servers listed above, it should run on the cluster without any modification
In order to allow many users to have access to the cluster at once and prevent one (or a few) users from monopolizing the computing power of the cluster, it is requested the the following job size/time limits be followed:
513-1024 cores 12 hours 257-512 cores 24 hours 129-256 cores 48 hours 65-128 cores 72 hours 1-64 cores no limit
Users needing to run jobs for longer than these time windows should add the capability to utilize restart files into their jobs so that they comply with these limits.
You can view the current utilization of the cluster here.