====== User guide ====== **System architecture** The architecture of the hybrid cluster of the IBiSCo Data Center (Infrastructure for BIg data and Scientific Computing) can be represented as a set of multiple layers. (Figure 1).\\ The lowest layer of the architecture consists of the hardware, characterized by calculation and storage nodes; in the upper level the application level, which allows users to submit their tasks.\\ The intermediate level of the architecture consists of the set of CUDA and MPI libraries which are capable of making the two levels communicate with each other. The Lustre distributed file system, which offers a high-performance paradigm of access to resources, also belongs this level.\\ To guarantee the correct working conditions of the cluster it was necessary to use a high-performance network. Therefore, InfiniBand technology was chosen. {{:wiki:thearchitecturecluster.png?400|Diagram of levels of the cluster}} //Figure 1: Diagram of levels of the cluster// === The hardware level === The cluster comprises 36 nodes and 2 switches, placed in 4 racks of the Data Center. They perform two functions: calculation and storage. To support the calculation there are 128 GPUs, distributed among 32 nodes (4 GPUs per node). To support storage 320 Terabytes are available distributed among 4 nodes (80 Terabytes per node). \\ As previously mentioned, to ensure access to resources and low-latency broadband communication between nodes, InfiniBand has been used. This technology provides Remote Direct Memory Access (RDMA) which allows direct access from a local system to a remote memory location (main memory or GPU memory of a remote node) without involving the CPUs in operations, thus reducing delay in the message transfer. \\ The peculiarity of the cluster lies in its heterogeneity, given by the coexistence of two technologies: InfiniBand and Ethernet. Despite the advantages of InfiniBand, Ethernet cannot be eliminated since it is essential for maintenance operations, monitoring the infrastructure, and ensuring access to resources outside the cluster. == The Storage Node Architecture == The cluster has 4 Dell R740 storage nodes, each offering 16 16TB SAS HHDs and 8 1.9TB SATA SSDs. Each node is equipped with two InfiniBand EDR ports, one of which is connected to the Mellanox InfiniBand switch dedicated to storage, which guarantees a 100 Gb/s connection to all the compute nodes. \\ While the aforementioned nodes are dedicated to the home and scratch areas of users, a separate 3 PB storage system will then be available. It will be accessible via Ethernet and can be used as a repository where users can move data to and from the storage systems connected to the InfiniBand when they need large amounts of data over the time span of their job. As for the Ethernet network, each node is equipped with 2 ports at 25 Gb/s for connection to the Data Center core switch. == The Compute Node Architecture == The cluster compute nodes are 32 Dell C4140s, each equipped with 4 NVIDIA V100 GPUs, 2 Ethernet ports at 10 Gb/s each, 2 InfiniBand ports at 100 Gb/s each, 2 Intel Gen 2 Xeon Gold CPUs, and 2 SATA disks 480 GB solid-state. Each node is also equipped with 22 64 GB RAM memory modules, overall 1.375 TiB (1408 GB). Each GPU is equipped with 34 GB RAM memory. The nodes can be divided into 3 sub-clusters. In Table 1 below are shown the hardware features peculiar to each sub-cluster. ^Sub-Cluster^Number of nodes^Infiniband NIC Type ^ Processor Type ^ Number of cores per node |1|21|Mellanox MT28908 [ConnectX-6]|Intel Xeon Gold 6240R CPU@2.40GHz|48| |2|4|Mellanox MT27800 [ConnectX-5]|Intel Xeon Gold 6230R CPU@2.10GHz|56| |3|7|Mellanox MT27800 [ConnectX-5]|Intel Xeon Gold 6230R CPU@2.10GHz|56| //Table 1: Hardware specifications// == NVLink and InfiniBand: High-performance interconnections for intra and inter-node GPU communications == One of the goals of the cluster is to make the infrastructure scalable based on application requirements. C4140 servers are available with both PCIe 3.0 and NVLink 2.0 communication channels, enabling GPU-CPU and GPU-GPU communications over PCIe or NVLink respectively. \\ When the application only needs CPU, parallelism is guaranteed by the presence of the two CPUs available on each C4140 node. When the application requires GPU parallelism, it is recommended to use a more efficient interconnection system than the more common PCIe bus. Compared to PCIe 3.0, NVLink 2.0 uses a point-to-point connection which offers advantages in terms of communication performance. While the maximum throughput of the PCIe 3.0 bus link is 16 GB/s, NVLink reaches 25 GB/s. However, being a point-to-point architecture the total bi-directional bandwidth is 50 GB/s. In our specific case, each V100 GPU has 2 links to all other GPUs for a total of 6 links up to 300 GB/s as aggregate bidirectional bandwidth (see Figure 2). When applications require parallelism of more than 4 GPUs it is necessary to provide efficient inter-node communication technology, such as InfiniBand. \\ The next section describes the middleware needed to enable these communication models. InfiniBand RDMA technology ensures that data is not copied between network layers but transferred directly to NIC buffers; applications can directly access remote memory, without CPU involvement, ensuring extremely low latency, in the order of magnitude of hundreds of nanoseconds. {{:wiki:pcivsnvlink1.png?300|NVLink e PCIe nei C4140}} {{:wiki:pcivsnvlink2.png?300|NVLink 2.0 e PCIe 3.0}} {{:wiki:clusternet.png?300|La rete del cluster ibrido}} //Figure 2// === HPC cluster network as combination of Ethernet and Infiniband technologies === The combination of High-Throughput Computing (HTC) with the High-Performance Computing (HPC) one is an important aspect for an efficent cluster network. Even if Infiniband technology is characterized by its low latency, not all applications benefit from such a high performance network because their workflow could make limited use of interprocess communication. Therefore is useful to have also Ethernet network intecconnections, allowing an application to exploit the technology more suitable for its needs. To have this versatility, our cluster implements an hybrid Infiniband-Ethernet network. It is imperative that the network traffic is guaranteed even in the event of a failure of a network device: to satisfy this demand, there two equal switch and each connection to the first switch is replicated to the second one. In case of the first switch, an automatic failover service enable the connections via the second one. Finally the LAG Ethernet protocol is used to improve the global efficency of the network: the result is the doubling of the bandwidth when both switches are active. With this aim in our framework some Ethernet switches mod. Huawei CE-8861-4C-EI are linked in a ring topology via 2x100 Gb/s stack links. Then we use the protocollo Link Aggregation Group multi-chassis (mLAG) to aggregate the 25 Gb/s ports. To guarantee high efficiency also for etherogenous applications workflows we implemented two Infiniband networks dedicated to different tasks: one for interprocess communication and the other for data access. Each network use a Mellanox SB7800 switch. === The Middleware level === To allow efficient use of the technologies described, it is necessary to provide a software layer that supports communication between the layers. \\ The Message Passing Interface (MPI) is a programming paradigm widely used in parallel applications that offers mechanisms for interprocess communication. This communication can take place in different forms: point-to-point, unilateral or collective. The MPI standard has evolved by introducing support for CUDA-aware communications in heterogeneous GPU-enabled contexts, thus enabling optimized communication between host and GPU. For this reason, modern MPI distributions provide CUDA-aware MPI libraries with the Unified Communication X (UCX) framework. The middleware installed on our cluster is based on a combination of the following technologies: * OpenFabrics Enterprise Distribution (OFED) for drivers and libraries needed by Mellanox InfiniBand NICs. * CUDA Toolkit for drivers, libraries and development environment required to take advantage of NVIDIA GPUs. * The CUDA-Aware implementation of MPI OpenMPI through the UCX framework. * A Lustre distributed file system to access storage resources with a high performance paradigm. == The process communication sub-layer == Moving data between processes continues to be a critical aspect of harnessing the full potential of a GPU. Modern GPUs provide some data copying mechanisms that can facilitate and improve the performance of communications between GPU processes. In this regard, NVIDIA has introduced CUDA Inter-Process Copy (IPC) and GPUDirect Remote Direct Memory Access (RDMA) technologies for intra- and inter-node GPU process communications. NVIDIA has partnered with Mellanox to make this solution available for InfiniBand-based clusters. Another noteworthy technology is NVIDIA gdrcopy, which enables GPU-to-GPU inter-node communications optimized for small messages, overcoming the significant overhead introduced in exchanging synchronization messages via the rendevous protocol for small messages. The UCX framework is provided to combine the aforementioned technologies with communication libraries (eg OpenMPI). This framework is driven by the UCF Consortium (Unified Communication Framework) which is a collaboration between industries, laboratories and universities to create production-grade communication frameworks and open standards for data-centric, high-performance applications. UCX claims to be a communication framework optimized for modern, high-bandwidth, low-latency networks. It exposes a set of abstract communication primitives that automatically use the best hardware resources available at execution time. Technologies supported include RDMA (both InfiniBand and RoCE), TCP, GPU, shared memory, and atomic network operations. In addition, UCX implements some best practices for message transfer of all sizes, based on the experience gained while running the application on the largest datacenters and supercomputers in the world. \\ Table 2 lists the software characteristics of the nodes in the clusters. ^ OS ^ InfiniBand Drive ^ CUDA Driver & Runtime Version ^ NVIDIA GPUDirect-RDMA ^ NVIDIA gdrcopy ^ UCX ^ MPI Distribution ^ BLAS Distribution ^ | Linux CentosOS 7 | NVIDIA MLNX_OFED 5.3-1.0.0.1 | 11.1 | 1.1 | 2.2 | 1.10.0 | Open MPI 4.1.0rc5 | Intel MKL 2021.1.1 | // Table 2: Software Specifications // == The Lustre Distributed File System == As stated earlier, a key aspect in high-performance computing is the efficient delivery of data to and from the compute nodes. While InfiniBand technology can bridge the gap between CPU, memory, and I/O latencies, storage components could be a bottleneck for your entire workload. The implementation adopted in our hybrid cluster is based on the use of Lustre, a high-performance parallel and distributed file system. High performance is ensured by Lustre's flexibility in supporting multiple storage technologies, from common Ethernet and TCP/IP-based to high-speed, low-latency such as InfiniBand, RDMA and RoCE. In the Lustre architecture, all compute nodes in the cluster are clients, while all data is stored on Object Storage Target running on the storage nodes. Communications are handled by a Management Service (MGS) and all metadata are stored on Metadata Target (MDT). \\ In the implemented Lustre architecture (see Figure 3), both Management Service and Metadata Service (MDS) are configured on a storage node with Metadata Targets (MDT) stored on a 4-disk RAID-10 SSD array. The other 3 storage nodes host the OSTs for the two file systems exposed to Lustre, one for the home directory of the users and one for the scratch area of ​​the jobs. In particular, the home filesystem is characterized by large needs for disk space and fault tolerance, so it is composed of 6 MDTs stored on an array of 3-disk SAS RAID-5 HDDs of 30 TB each. On the other hand, the scratch area is characterized by fast disk access times without the need for redundancy. Therefore, it is composed of 6 MDTs stored on 1.8 TB SSD disks. {{:wiki:lustrearch.png?600|The implementation of the Lustre architecture}} // Figure 3: The implementation of the Lustre architecture // ---- ==== Obtaining login credentials ==== Currently a potential user must ask for an account to the Ibisco reference colleague of her/his instituion, giving some identification data. The refrence colleague sends the data to the Ibisc administrators: they send back tha access data with a temporary password. ATTENTION: the TEMPORARY password must be changed at the first access To change the password from command line use the "yppasswd" command. Yppasswd creates or changes a password valid on every resource of the cluster (not only on the front-end server) (Network password in a Network Information Service - NIS). ==== Access procedure ==== To acccess the system (in particular its front-end or UI - User Interface) an user needs to connect via SSH protocol to the host ibiscohpc-ui.scope.unina.it. Access is currently only in non-graphical terminal emulation mode. However the account is valid for all cluster resources. Access example from unix-like systems: ''$ ssh ibiscohpc-ui.scope.unina.it -l '' To access Ibisco from Windows systems a simple software is putty, freely available at ''https://www.putty.org/''. From Windows 10 onwards it is also possible to use Openssh in a command window (CMD.exe o Powershell.exe). It is pre-installed (if it is not activated, it simply has to be activated in the Optional Features). ==== Job preparation ans submission ==== In the system is installed the resource manager SLURM to manage the cluster resources. Complete documentation is avalailable at ''https://slurm.schedmd.com/''. SLURM is an open source software sytstem for cluster management; it is highly scalable and integrates fault-tolerance and job scheduling mechanisms. ==== SLURM basic concepts ==== The main components fo SLURM are: * //**nodes**// - computing nodes; * //**partitions**// - logic groups of nodes; * //**jobs**// - allcocation of the resources assinged to an user for a given time interval; * //**job steps**// - set of (tipically parallel) ativities inside a job. Partitions can be thought as //**job queues**// each of which defines constraints on job size, time limits, resource usage permissions by users, etc. SLURM allows a centralized management through a daemon, //**slurmctld**//, to monitor resources and jobs. Each node is managed by a daemon, //**slurmctld**//, which takes care of handling requests for activity Some tools available to the user are: * [[https://slurm.schedmd.com/srun.html|srun]] - start a job; * [[https://slurm.schedmd.com/sbatch.html|sbatch]] - submit batch scripts; * [[https://slurm.schedmd.com/salloc.html|salloc]] -request the allocation of resources (nodes), with any constraints (eg, number of processors per node); * [[https://slurm.schedmd.com/scancel.html|scancel]] - terminate queued or running jobs; * [[https://slurm.schedmd.com/sinfo.html|sinfo]] - know information about the system status; * [[https://slurm.schedmd.com/squeue.html|squeue]] - know jobs status; * [[https://slurm.schedmd.com/sacct.html|sacct]] - get information about jobs. * … A complete list of available commands is in man (available also online at ''https://slurm.schedmd.com/man_index.html''): ''man '' ==== Examples of use of some basic commands ==== == system and resoures information == ''**sinfo**'' - Know and verify resources status (existing partitions and relating nodes, ...) and system general status: Example: ''$ sinfo'' Output: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST hpc* up infinite 32 idle ibiscohpc-wn[01-32] Output shows partitions information; in this example: * there is a partititon named "hpc" (* refers to default partition); * the partition is available (status: up); * the partition can be used without time limits; * the partition consists of 32 nodes; * its status is idle; * the available nodes are named ''ibiscohpc-wn01'', ''ibiscohpc-wn02'', ..., ''ibiscohpc-wn32''. ''**squeue**'' - Know jons queue status: Example: ''$ squeue'' Output: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4815 hpc sleep cnr-isas R 0:04 Output shows, for each job: * job identifier; * name of partition on which the job was launched; * job name; * name of the user who launched the job; * job status (running); * job execution time. ''**scontrol**'' - detailed information about job and resources Example (detailed information about ''ibiscohpc-wn02'' node) ''$ scontrol show node ibiscohpc-wn02'' Output: NodeName=ibiscohpc-wn02 Arch=x86_64 CoresPerSocket=24 CPUAlloc=0 CPUTot=96 CPULoad=0.01 AvailableFeatures=HyperThread ActiveFeatures=HyperThread Gres=gpu:tesla:4(S:0) NodeAddr=ibiscohpc-wn02 NodeHostName=ibiscohpc-wn02 Version=20.11.5 OS=Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov 26 12:36:06 CST 2018 RealMemory=1546503 AllocMem=0 FreeMem=1528903 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=hpc BootTime=2022-02-01T16:24:43 SlurmdStartTime=2022-02-01T16:25:25 CfgTRES=cpu=96,mem=1546503M,billing=96 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) == job preparation and submission == ''**srun**'' - manage a parallel job execution on the cluster managed by Slurm. If necessary, srun allocates resources for job execution. Some useful srun parameters are: ''-c'', ''--cpus-per-task='' * number of CPUs allocated per process. By default, one CPU is used per process. ''-l'', ''--label'' * It shows at the top of the lines, on stdout, the number of the task to which the output refers. ''-N'', ''--nodes=[-maxnodes]'' * minimum number (''minnodes'') of nodes to allocate for the job and the possible maximum one. * If the parameter is not specified, the commmand allocates the nodes neeeded to satisfy the requirements specified by the parameters ''-n'' e ''-c''. * If the values are outside the allowed range for the associated partition, the job is placed in a ''PENDING'' state. This allows for possible execution at a later time, when the partition limit is possibly changed. ''-n'', ''--ntasks='' * number of task to run. ''srun'' allocates the necessary resources based on the number of required tasks (by default, one node is required for each task but, using the ''--cpus-per-task'' option, this behavior can be changed). Example, interactively access a node, from UI: $ salloc srun --pty /bin/bash Example, submit a batch job, from UI: $ echo -e '#!/bin/sh\nhostname' | sbatch Example, submit an MPI interactive job with tasks, from UI: $ srun -n ==== Available file systems ==== Users of the resource currently have the ability to use the following file systems ''/lustre/home/'' file system shared between nodes and UI created using Lustre technology where users' homes reside ''/lustre/scratch'' file system shared between nodes created using Lustre technology to be used as a scratch area ''/home/scratch'' file system local to each node to be used as a scratch area ==== Available software ==== === COMPILERS === == INTEL : compilers and libraries == To use Intel's suite of compilers and libraries, you need to use (interactively or internally any script in which they are needed) the command . /nfsexports/intel/oneapi/setvars.sh == NVIDIA HPC SDK (compiler suites, libraries, etc provided by NVIDIA) == * //**Ver 20.10**// - is available in the directory '' /nfsexports/SOFTWARE/nvidia/hpc_sdk/ '' == OpenMPI == * //**Ver 4.1.orc5**// - configured to be CUDA-AWARE, is available in the directory ''/usr/mpi/gcc/openmpi-4.1.0rc5/ '' == Julia == * //**Ver 1.6.1**// - interpreter available in the directory '' /nfsexports/SOFTWARE/julia-1.6.1 '' == FFTW libraries== * //**ver 3.3.10**// - compiled with Intel compilers available in the directory '' /nfsexports/SOFTWARE/fftw-3.3.10 '' == Anaconda 3 environment == * available in the directory '' /nfsexports/anaconda3 '' === complete sw packages for specific applications === == Matlab == * //**Ver R2020b**// - available in the directory '' /nfsexports/SOFTWARE/MATLAB/R2020b/bin '' == Quantum ESPRESSO == * //**Ver 7.0**// - available in the directory '' /nfsexports/SOFTWARE/qe-7.0 '' == OpenFOAM == * //**Ver 7.0**// - available in the directory '' /nfsexports/SOFTWARE/OpenFOAM-7.0/ '' == Rheotool == * //**Ver 5.0**// - available in the same directory of OpenFOAM ==== For python users ==== === Base === To use python, it is necessary to start the conda environment using the following command, source /nfsexports/SOFTWARE/anaconda3.OK/setupconda.sh [Example: python example.py] conda deactivate === Tensorflow === The tensorflow sub-environment activated after starting the conda environment source /nfsexports/SOFTWARE/anaconda3.OK/setupconda.sh conda activate tensorflowgpu [Example: python example.py] conda deactivate conda deactivate === Bio-Informatics === To use bioconda sub-environment, the following command has to be executed. source /nfsexports/SOFTWARE/anaconda3.OK/setupconda.sh conda activate bioconda [Example: python example.py] conda deactivate conda deactivate === Packages list === To list the available packages in the given environment, run the command, conda list === Parallel computation in python === * The effective usage of the hpc can be done using parallelizing the processes. The codes can be parallelized by distributing the tasks among the available nodes and their respective CPUs and GPUs as well. This information can be specified in a simple submission bash script or a sub.sh file as follows, #!/bin/bash #SBATCH --nodes=[nnodes] #number of nodes #SBATCH --ntasks-per-node=[ntasks per node] #number of cores per node #SBATCH --gres=gpu:[ngpu] #number of GPUs per node === Example of parallel jobs submission === Suppose a given python code has to be executed for different values of a variable "rep". It is possible to execute the python codes parallelly during the job submission process by creating temporary files each file with rep=a1, a2,... The python code example.py can have a line: rep=REP The submission script sub.sh can be used to parallelize the process in following way: #!/bin/bash #SBATCH --nodes=[nnodes] #number of nodes #SBATCH --ntasks-per-node=[ntasks per node] #number of cores per node #SBATCH --gres=gpu:[ngpu] #number of GPUs per node NPROC=[nprocesses] #number of processing units to be accessed tmpstring=tmp #temporary files generated count=0 #begin counting the temporary files for rep in {1..10}; #The value of rep should run from 1 to 10 do tmpprogram=${tmpstring}_${rep}.py #temporary file names for each of the values of rep sed -e "s/REP/$rep/g" #replace the variable REP in the .py with rep specified in the sub.sh file. $program > $tmpprogram #create the temporary files in parallel python $tmpprogram & #run the temporary files (( count++ )) #increase the count number [[ $(( count % NPROC )) -eq 0 ]] && wait #wait for the parallel programs to finish. done rm ${tmpstring}* #optionally remove the temporary files after the execution of all the temporary files * Parallel job submissions can also be done by job array submission. More information about Job arrays can be found in [[https://slurm.schedmd.com/job_array.html]]. * Parallelization can be implemented within the python code itself. For example, the evaluation of a function for different variable values can be done in parallel. Python offers many packages to parallelize the given process. The basic one among them is [[https://docs.python.org/3/library/multiprocessing.html|multiprocessing]]. * The keras and Pytorch modules in tensorflow which are mainly used for machine learning detects the GPUs automatically.