User Tools

Site Tools


wiki:user_guide

User guide

System architecture

The architecture of the hybrid cluster of the IBiSCo Data Center (Infrastructure for BIg data and Scientific Computing) can be represented as a set of multiple layers. (Figure 1).
The lowest layer of the architecture consists of the hardware, characterized by calculation and storage nodes; in the upper level the application level, which allows users to submit their tasks.
The intermediate level of the architecture consists of the set of CUDA and MPI libraries which are capable of making the two levels communicate with each other. The Lustre distributed file system, which offers a high-performance paradigm of access to resources, also belongs this level.
To guarantee the correct working conditions of the cluster it was necessary to use a high-performance network. Therefore, InfiniBand technology was chosen.

Diagram of levels of the cluster

Figure 1: Diagram of levels of the cluster

The hardware level

The cluster comprises 36 nodes and 2 switches, placed in 4 racks of the Data Center. They perform two functions: calculation and storage. To support the calculation there are 128 GPUs, distributed among 32 nodes (4 GPUs per node). To support storage 320 Terabytes are available distributed among 4 nodes (80 Terabytes per node).
As previously mentioned, to ensure access to resources and low-latency broadband communication between nodes, InfiniBand has been used. This technology provides Remote Direct Memory Access (RDMA) which allows direct access from a local system to a remote memory location (main memory or GPU memory of a remote node) without involving the CPUs in operations, thus reducing delay in the message transfer.
The peculiarity of the cluster lies in its heterogeneity, given by the coexistence of two technologies: InfiniBand and Ethernet. Despite the advantages of InfiniBand, Ethernet cannot be eliminated since it is essential for maintenance operations, monitoring the infrastructure, and ensuring access to resources outside the cluster.

The Storage Node Architecture

The cluster has 4 Dell R740 storage nodes, each offering 16 16TB SAS HHDs and 8 1.9TB SATA SSDs. Each node is equipped with two InfiniBand EDR ports, one of which is connected to the Mellanox InfiniBand switch dedicated to storage, which guarantees a 100 Gb/s connection to all the compute nodes.
While the aforementioned nodes are dedicated to the home and scratch areas of users, a separate 3 PB storage system will then be available. It will be accessible via Ethernet and can be used as a repository where users can move data to and from the storage systems connected to the InfiniBand when they need large amounts of data over the time span of their job. As for the Ethernet network, each node is equipped with 2 ports at 25 Gb/s for connection to the Data Center core switch.

The Compute Node Architecture

The cluster compute nodes are 32 Dell C4140s, each equipped with 4 NVIDIA V100 GPUs, 2 Ethernet ports at 10 Gb/s each, 2 InfiniBand ports at 100 Gb/s each, 2 Intel Gen 2 Xeon Gold CPUs, and 2 SATA disks 480 GB solid-state. Each node is also equipped with 22 64 GB RAM memory modules, overall 1.375 TiB (1408 GB). Each GPU is equipped with 34 GB RAM memory. The nodes can be divided into 3 sub-clusters. In Table 1 below are shown the hardware features peculiar to each sub-cluster.

Sub-ClusterNumber of nodesInfiniband NIC Type Processor Type
121Mellanox MT28908 [ConnectX-6]Intel Xeon Gold 6240R CPU@2.40GHz48
24Mellanox MT27800 [ConnectX-5]Intel Xeon Gold 6230R CPU@2.10GHz56
37Mellanox MT27800 [ConnectX-5]Intel Xeon Gold 6230R CPU@2.10GHz56

Table 1: Hardware specifications

One of the goals of the cluster is to make the infrastructure scalable based on application requirements. C4140 servers are available with both PCIe 3.0 and NVLink 2.0 communication channels, enabling GPU-CPU and GPU-GPU communications over PCIe or NVLink respectively.
When the application only needs CPU, parallelism is guaranteed by the presence of the two CPUs available on each C4140 node. When the application requires GPU parallelism, it is recommended to use a more efficient interconnection system than the more common PCIe bus. Compared to PCIe 3.0, NVLink 2.0 uses a point-to-point connection which offers advantages in terms of communication performance. While the maximum throughput of the PCIe 3.0 bus link is 16 GB/s, NVLink reaches 25 GB/s. However, being a point-to-point architecture the total bi-directional bandwidth is 50 GB/s. In our specific case, each V100 GPU has 2 links to all other GPUs for a total of 6 links up to 300 GB/s as aggregate bidirectional bandwidth (see Figure 2). When applications require parallelism of more than 4 GPUs it is necessary to provide efficient inter-node communication technology, such as InfiniBand.
The next section describes the middleware needed to enable these communication models. InfiniBand RDMA technology ensures that data is not copied between network layers but transferred directly to NIC buffers; applications can directly access remote memory, without CPU involvement, ensuring extremely low latency, in the order of magnitude of hundreds of nanoseconds.

NVLink e PCIe nei C4140 NVLink 2.0 e PCIe 3.0 La rete del cluster ibrido

Figure 2

HPC cluster network as combination of Ethernet and Infiniband technologies

The combination of High-Throughput Computing (HTC) with the High-Performance Computing (HPC) one is an important aspect for an efficent cluster network. Even if Infiniband technology is characterized by its low latency, not all applications benefit from such a high performance network because their workflow could make limited use of interprocess communication. Therefore is useful to have also Ethernet network intecconnections, allowing an application to exploit the technology more suitable for its needs. To have this versatility, our cluster implements an hybrid Infiniband-Ethernet network. It is imperative that the network traffic is guaranteed even in the event of a failure of a network device: to satisfy this demand, there two equal switch and each connection to the first switch is replicated to the second one. In case of the first switch, an automatic failover service enable the connections via the second one. Finally the LAG Ethernet protocol is used to improve the global efficency of the network: the result is the doubling of the bandwidth when both switches are active. With this aim in our framework some Ethernet switches mod. Huawei CE-8861-4C-EI are linked in a ring topology via 2×100 Gb/s stack links. Then we use the protocollo Link Aggregation Group multi-chassis (mLAG) to aggregate the 25 Gb/s ports. To guarantee high efficiency also for etherogenous applications workflows we implemented two Infiniband networks dedicated to different tasks: one for interprocess communication and the other for data access. Each network use a Mellanox SB7800 switch.

The Middleware level

To allow efficient use of the technologies described, it is necessary to provide a software layer that supports communication between the layers.
The Message Passing Interface (MPI) is a programming paradigm widely used in parallel applications that offers mechanisms for interprocess communication. This communication can take place in different forms: point-to-point, unilateral or collective. The MPI standard has evolved by introducing support for CUDA-aware communications in heterogeneous GPU-enabled contexts, thus enabling optimized communication between host and GPU. For this reason, modern MPI distributions provide CUDA-aware MPI libraries with the Unified Communication X (UCX) framework. The middleware installed on our cluster is based on a combination of the following technologies:

  • OpenFabrics Enterprise Distribution (OFED) for drivers and libraries needed by Mellanox InfiniBand NICs.
  • CUDA Toolkit for drivers, libraries and development environment required to take advantage of NVIDIA GPUs.
  • The CUDA-Aware implementation of MPI OpenMPI through the UCX framework.
  • A Lustre distributed file system to access storage resources with a high performance paradigm.
The process communication sub-layer

Moving data between processes continues to be a critical aspect of harnessing the full potential of a GPU. Modern GPUs provide some data copying mechanisms that can facilitate and improve the performance of communications between GPU processes. In this regard, NVIDIA has introduced CUDA Inter-Process Copy (IPC) and GPUDirect Remote Direct Memory Access (RDMA) technologies for intra- and inter-node GPU process communications. NVIDIA has partnered with Mellanox to make this solution available for InfiniBand-based clusters. Another noteworthy technology is NVIDIA gdrcopy, which enables GPU-to-GPU inter-node communications optimized for small messages, overcoming the significant overhead introduced in exchanging synchronization messages via the rendevous protocol for small messages. The UCX framework is provided to combine the aforementioned technologies with communication libraries (eg OpenMPI). This framework is driven by the UCF Consortium (Unified Communication Framework) which is a collaboration between industries, laboratories and universities to create production-grade communication frameworks and open standards for data-centric, high-performance applications. UCX claims to be a communication framework optimized for modern, high-bandwidth, low-latency networks. It exposes a set of abstract communication primitives that automatically use the best hardware resources available at execution time. Technologies supported include RDMA (both InfiniBand and RoCE), TCP, GPU, shared memory, and atomic network operations. In addition, UCX implements some best practices for message transfer of all sizes, based on the experience gained while running the application on the largest datacenters and supercomputers in the world.
Table 2 lists the software characteristics of the nodes in the clusters.

OS InfiniBand Drive CUDA Driver & Runtime Version NVIDIA GPUDirect-RDMA NVIDIA gdrcopy UCX MPI Distribution BLAS Distribution
Linux CentosOS 7 NVIDIA MLNX_OFED 5.3-1.0.0.1 11.1 1.1 2.2 1.10.0 Open MPI 4.1.0rc5 Intel MKL 2021.1.1

Table 2: Software Specifications

The Lustre Distributed File System

As stated earlier, a key aspect in high-performance computing is the efficient delivery of data to and from the compute nodes. While InfiniBand technology can bridge the gap between CPU, memory, and I/O latencies, storage components could be a bottleneck for your entire workload. The implementation adopted in our hybrid cluster is based on the use of Lustre, a high-performance parallel and distributed file system. High performance is ensured by Lustre's flexibility in supporting multiple storage technologies, from common Ethernet and TCP/IP-based to high-speed, low-latency such as InfiniBand, RDMA and RoCE. In the Lustre architecture, all compute nodes in the cluster are clients, while all data is stored on Object Storage Target running on the storage nodes. Communications are handled by a Management Service (MGS) and all metadata are stored on Metadata Target (MDT).
In the implemented Lustre architecture (see Figure 3), both Management Service and Metadata Service (MDS) are configured on a storage node with Metadata Targets (MDT) stored on a 4-disk RAID-10 SSD array. The other 3 storage nodes host the OSTs for the two file systems exposed to Lustre, one for the home directory of the users and one for the scratch area of ​​the jobs. In particular, the home filesystem is characterized by large needs for disk space and fault tolerance, so it is composed of 6 MDTs stored on an array of 3-disk SAS RAID-5 HDDs of 30 TB each. On the other hand, the scratch area is characterized by fast disk access times without the need for redundancy. Therefore, it is composed of 6 MDTs stored on 1.8 TB SSD disks.

The implementation of the Lustre architecture

Figure 3: The implementation of the Lustre architecture


Obtaining login credentials

Currently a potential user must ask for an account to the Ibisco reference colleague of her/his instituion, giving some identification data. The refrence colleague sends the data to the Ibisc administrators: they send back tha access data with a temporary password.

ATTENTION: the TEMPORARY password must be changed at the first access

To change the password from command line use the “yppasswd” command. Yppasswd creates or changes a password valid on every resource of the cluster (not only on the front-end server) (Network password in a Network Information Service - NIS).

Access procedure

To acccess the system (in particular its front-end or UI - User Interface) an user needs to connect via SSH protocol to the host ibiscohpc-ui.scope.unina.it. Access is currently only in non-graphical terminal emulation mode. However the account is valid for all cluster resources.

Access example from unix-like systems:

$ ssh ibiscohpc-ui.scope.unina.it -l <USERNAME>

To access Ibisco from Windows systems a simple software is putty, freely available at https://www.putty.org/. From Windows 10 onwards it is also possible to use Openssh in a command window (CMD.exe o Powershell.exe). It is pre-installed (if it is not activated, it simply has to be activated in the Optional Features).

Job preparation ans submission

In the system is installed the resource manager SLURM to manage the cluster resources. Complete documentation is avalailable at https://slurm.schedmd.com/.

SLURM is an open source software sytstem for cluster management; it is highly scalable and integrates fault-tolerance and job scheduling mechanisms.

SLURM basic concepts

The main components fo SLURM are:

  • nodes - computing nodes;
  • partitions - logic groups of nodes;
  • jobs - allcocation of the resources assinged to an user for a given time interval;
  • job steps - set of (tipically parallel) ativities inside a job.

Partitions can be thought as job queues each of which defines constraints on job size, time limits, resource usage permissions by users, etc.

SLURM allows a centralized management through a daemon, slurmctld, to monitor resources and jobs. Each node is managed by a daemon, slurmctld, which takes care of handling requests for activity

Some tools available to the user are:

  • srun - start a job;
  • sbatch - submit batch scripts;
  • salloc -request the allocation of resources (nodes), with any constraints (eg, number of processors per node);
  • scancel - terminate queued or running jobs;
  • sinfo - know information about the system status;
  • squeue - know jobs status;
  • sacct - get information about jobs.

A complete list of available commands is in man (available also online at https://slurm.schedmd.com/man_index.html): man <cmd>

Examples of use of some basic commands

system and resoures information

sinfo - Know and verify resources status (existing partitions and relating nodes, …) and system general status:

Example: $ sinfo

Output:

  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
  hpc*      up     infinite   32     idle  ibiscohpc-wn[01-32]

Output shows partitions information; in this example:

  • there is a partititon named “hpc” (* refers to default partition);
  • the partition is available (status: up);
  • the partition can be used without time limits;
  • the partition consists of 32 nodes;
  • its status is idle;
  • the available nodes are named ibiscohpc-wn01, ibiscohpc-wn02, …, ibiscohpc-wn32.

squeue - Know jons queue status:

Example: $ squeue

Output:

  JOBID PARTITION     NAME     USER         ST      TIME  NODES NODELIST(REASON)
  4815  hpc           sleep    cnr-isas     R       0:04 

Output shows, for each job:

  • job identifier;
  • name of partition on which the job was launched;
  • job name;
  • name of the user who launched the job;
  • job status (running);
  • job execution time.

scontrol - detailed information about job and resources

Example (detailed information about ibiscohpc-wn02 node)

$ scontrol show node ibiscohpc-wn02

Output:

  NodeName=ibiscohpc-wn02 Arch=x86_64 CoresPerSocket=24
     CPUAlloc=0 CPUTot=96 CPULoad=0.01
     AvailableFeatures=HyperThread
     ActiveFeatures=HyperThread
     Gres=gpu:tesla:4(S:0)
     NodeAddr=ibiscohpc-wn02 NodeHostName=ibiscohpc-wn02 Version=20.11.5
     OS=Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov 26 12:36:06 CST 2018
     RealMemory=1546503 AllocMem=0 FreeMem=1528903 Sockets=2 Boards=1
     State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
     Partitions=hpc
     BootTime=2022-02-01T16:24:43 SlurmdStartTime=2022-02-01T16:25:25
     CfgTRES=cpu=96,mem=1546503M,billing=96
     AllocTRES=
     CapWatts=n/a
     CurrentWatts=0 AveWatts=0
     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
     Comment=(null)
job preparation and submission

srun - manage a parallel job execution on the cluster managed by Slurm. If necessary, srun allocates resources for job execution.

Some useful srun parameters are:

-c, –cpus-per-task=<ncpus>

  • number of CPUs allocated per process. By default, one CPU is used per process.

-l, –label

  • It shows at the top of the lines, on stdout, the number of the task to which the output refers.

-N, –nodes=<minnodes>[-maxnodes]

  • minimum number (minnodes) of nodes to allocate for the job and the possible maximum one.
  • If the parameter is not specified, the commmand allocates the nodes neeeded to satisfy the requirements specified by the parameters -n e -c.
  • If the values are outside the allowed range for the associated partition, the job is placed in a PENDING state. This allows for possible execution at a later time, when the partition limit is possibly changed.

-n, –ntasks=<number>

  • number of task to run. srun allocates the necessary resources based on the number of required tasks (by default, one node is required for each task but, using the –cpus-per-task option, this behavior can be changed).

Example, interactively access a node, from UI:

  $ salloc  srun --pty /bin/bash

Example, submit a batch job, from UI:

  $ echo -e '#!/bin/sh\nhostname' | sbatch

Example, submit an MPI interactive job with <N> tasks, from UI:

  $ srun -n <N> <EXEFILE>

Available file systems

Users of the resource currently have the ability to use the following file systems

/lustre/home/ file system shared between nodes and UI created using Lustre technology where users' homes reside

/lustre/scratch file system shared between nodes created using Lustre technology to be used as a scratch area

/home/scratch file system local to each node to be used as a scratch area

Available software

COMPILERS

INTEL : compilers and libraries

To use Intel's suite of compilers and libraries, you need to use (interactively or internally any script in which they are needed) the command

. /nfsexports/intel/oneapi/setvars.sh 
NVIDIA HPC SDK (compiler suites, libraries, etc provided by NVIDIA)
  • Ver 20.10 - is available in the directory /nfsexports/SOFTWARE/nvidia/hpc_sdk/
OpenMPI
  • Ver 4.1.orc5 - configured to be CUDA-AWARE, is available in the directory /usr/mpi/gcc/openmpi-4.1.0rc5/
Julia
  • Ver 1.6.1 - interpreter available in the directory /nfsexports/SOFTWARE/julia-1.6.1
FFTW libraries
  • ver 3.3.10 - compiled with Intel compilers available in the directory /nfsexports/SOFTWARE/fftw-3.3.10
Anaconda 3 environment
  • available in the directory /nfsexports/anaconda3

complete sw packages for specific applications

Matlab
  • Ver R2020b - available in the directory /nfsexports/SOFTWARE/MATLAB/R2020b/bin
Quantum ESPRESSO
  • Ver 7.0 - available in the directory /nfsexports/SOFTWARE/qe-7.0
OpenFOAM
  • Ver 7.0 - available in the directory /nfsexports/SOFTWARE/OpenFOAM-7.0/
Rheotool
  • Ver 5.0 - available in the same directory of OpenFOAM

For python users

Base

To use python, it is necessary to start the conda environment using the following command,

source /nfsexports/SOFTWARE/anaconda3.OK/setupconda.sh
<commands execution> [Example: python example.py]
conda deactivate 

Tensorflow

The tensorflow sub-environment activated after starting the conda environment

source /nfsexports/SOFTWARE/anaconda3.OK/setupconda.sh
conda activate tensorflowgpu
<commands execution> [Example: python example.py]
conda deactivate
conda deactivate

Bio-Informatics

To use bioconda sub-environment, the following command has to be executed.

source /nfsexports/SOFTWARE/anaconda3.OK/setupconda.sh
conda activate bioconda
<commands execution> [Example: python example.py]
conda deactivate
conda deactivate

Packages list

To list the available packages in the given environment, run the command,

conda list

Parallel computation in python

  • The effective usage of the hpc can be done using parallelizing the processes. The codes can be parallelized by distributing the tasks among the available nodes and their respective CPUs and GPUs as well. This information can be specified in a simple submission bash script or a sub.sh file as follows,
#!/bin/bash
#SBATCH --nodes=[nnodes]           #number of nodes
#SBATCH --ntasks-per-node=[ntasks per node] #number of cores per node
#SBATCH --gres=gpu:[ngpu]        #number of GPUs per node

Example of parallel jobs submission

Suppose a given python code has to be executed for different values of a variable “rep”. It is possible to execute the python codes parallelly during the job submission process by creating temporary files each file with rep=a1, a2,… The python code example.py can have a line:

 rep=REP 

The submission script sub.sh can be used to parallelize the process in following way:

#!/bin/bash
#SBATCH --nodes=[nnodes]            #number of nodes
#SBATCH --ntasks-per-node=[ntasks per node]  #number of cores per node
#SBATCH --gres=gpu:[ngpu]         #number of GPUs per node
NPROC=[nprocesses]                     #number of processing units to be accessed

tmpstring=tmp               #temporary files generated 

count=0                     #begin counting the temporary files
for rep in {1..10};         #The value of rep should run from 1 to 10
do
    tmpprogram=${tmpstring}_${rep}.py         #temporary file names for each of the values of rep
    sed -e "s/REP/$rep/g"   #replace the variable REP in the .py with rep specified in the sub.sh file.
    $program > $tmpprogram  #create the temporary files in parallel
    python $tmpprogram &    #run the temporary files
    (( count++ ))           #increase the count number
    [[ $(( count % NPROC )) -eq 0 ]] && wait  #wait for the parallel programs to finish.
done
rm ${tmpstring}*            #optionally remove the temporary files after the execution of all the temporary files
  • Parallelization can be implemented within the python code itself. For example, the evaluation of a function for different variable values can be done in parallel. Python offers many packages to parallelize the given process. The basic one among them is multiprocessing.
  • The keras and Pytorch modules in tensorflow which are mainly used for machine learning detects the GPUs automatically.
wiki/user_guide.txt · Last modified: 2022/05/28 18:18 by cnr-guest