SGI Altix (Outdated)¶
Warning
This page is deprecated! The SGI Altix is a former system!
System¶
The SGI Altix 4700 is a shared memory system with dual core Intel Itanium 2 CPUs (Montecito) operated by the Linux operating system SUSE SLES 10 with a 2.6 kernel. Currently, the following Altix partitions are installed at ZIH:
Name | Total Cores | Compute Cores | Memory per Core |
---|---|---|---|
Mars | 384 | 348 | 1 GB |
Jupiter | 512 | 506 | 4 GB |
Saturn | 512 | 506 | 4 GB |
Uranus | 512 | 506 | 4 GB |
Neptun | 128 | 128 | 1 GB |
The jobs for these partitions (except Neptun) are scheduled by the Platform LSF
batch system running on mars.hrsk.tu-dresden.de
. The actual placement of a submitted job may
depend on factors like memory size, number of processors, time limit.
Filesystems¶
All partitions share the same CXFS filesystems /work
and /fastfs
.
ccNUMA Architecture¶
The SGI Altix has a ccNUMA architecture, which stands for Cache Coherent Non-Uniform Memory Access. It can be considered as a SM-MIMD (shared memory - multiple instruction multiple data) machine. The SGI ccNUMA system has the following properties:
- Memory is physically distributed but logically shared
- Memory is kept coherent automatically by hardware.
- Coherent memory: memory is always valid (caches hold copies)
- Granularity is L3 cache line (128 B)
- Bandwidth of NUMAlink4 is 6.4 GB/s
The ccNUMA is a compromise between a distributed memory system and a flat symmetric multi processing machine (SMP). Although the memory is shared, the access properties are not the same.
Compute Module¶
The basic compute module of an Altix system is shown below.
It consists of one dual core Intel Itanium 2 "Montecito" processor, the
local memory of 4 GB (2 GB on Mars
), and the communication component,
the so-called SHUB. All resources are shared by both cores. They have a
common front side bus, so that accumulated memory bandwidth for both is
not higher than for just one core.
The SHUB connects local and remote resources. Via the SHUB and NUMAlink all CPUs can access remote memory in the whole system. Naturally, the fastest access provides local memory. There are some hints and commands that may help you to get optimal memory allocation and process placement ). Four of these blades are grouped together with a NUMA router in a compute brick. All bricks are connected with NUMAlink4 in a "fat-tree"-topology.
Remote memory access via SHUBs and NUMAlink
CPU¶
The current SGI Altix is based on the dual core Intel Itanium 2 processor (code name "Montecito"). One core has the following basic properties:
Component | Count |
---|---|
Clock rate | 1.6 GHz |
Integer units | 6 |
Floating point units (multiply-add) | 2 |
Peak performance | 6.4 GFLOPS |
L1 cache | 2 x 16 kB, 1 clock latency |
L2 cache | 256 kB, 5 clock latency |
L3 cache | 9 MB, 12 clock latency |
Front side bus | 128 bit x 200 MHz |
The theoretical peak performance of all Altix partitions is hence about 13.1 TFLOPS.
The processor has hardware support for efficient software pipelining. For many scientific applications it provides a high sustained performance exceeding the performance of RISC CPUs with similar peak performance. On the down side is the fact that the compiler has to explicitly discover and exploit the parallelism in the application.
Usage¶
Compiling Parallel Applications¶
This installation of the Message Passing Interface supports the MPI 1.2 standard with a few MPI-2
features (see man mpi
). There is no command like mpicc
, instead you just have to use the normal
compiler (e.g. icc
, icpc
, or ifort
) and append -lmpi
to the linker command line. Since the
include files as well as the library are in standard directories there is no need to append
additional library- or include-paths.
- Note for C++ programmers: You need to link with
-lmpi++abi1002 -lmpi
instead of-lmpi
. - Note for Fortran programmers: The MPI module is only provided for the Intel compiler and does not
work with
gfortran
.
Please follow these following guidelines to run your parallel program using the batch system on Mars.
Batch System¶
Applications on an HPC system can not be run on the login node. They have to be submitted to compute nodes with dedicated resources for the user's job. Normally a job can be submitted with these data:
- number of CPU cores,
- requested CPU cores have to belong on one node (OpenMP programs) or can distributed (MPI),
- memory per process,
- maximum wall clock time (after reaching this limit the process is killed automatically),
- files for redirection of output and error messages,
- executable and command line parameters.
LSF¶
The batch system on Atlas is LSF, see also the general information on LSF.
Submission of Parallel Jobs¶
The MPI library running on the Altix is provided by SGI and highly optimized for the ccNUMA architecture of this machine. However, communication within a partition is faster than across partitions. Take this into consideration when you submit your job.
Single-partition jobs can be started like this:
bsub -R "span[hosts=1]" -n 16 mpirun -np 16 a.out<
Really large jobs with over 256 CPUs might run over multiple partitions. Cross-partition jobs can be submitted via PAM like this
bsub -n 1024 pamrun a.out
Batch Queues¶
Batch Queue | Admitted Users | Available CPUs | Default Runtime | Max. Runtime |
---|---|---|---|---|
interactive |
all |
min. 1, max. 32 |
12h |
12h |
small |
all |
min. 1, max. 63 |
12h |
120h |
intermediate |
all |
min. 64, max. 255 |
12h |
120h |
large |
all |
min.256, max.1866 |
12h |
24h |
ilr |
selected users |
min. 1, max. 768 |
12h |
24h |