===== Instructions and tips on using Ibisco software ===== ==== Obtaining login credentials ==== Currently a potential user must ask for an account to the Ibisco reference colleague of her/his instituion, giving some identification data. The reference colleague sends the data to the Ibisc administrators: they send back tha access data with a temporary password. ATTENTION: the TEMPORARY password must be changed at the first access To change the password from command line use the "yppasswd" command. Yppasswd creates or changes a password valid on every resource of the cluster (not only on the front-end server) (Network password in a Network Information Service - NIS). The login procedure will change lightly in a few months, see ahead Access Procedure. ==== Access procedure ==== To access the system (in particular its front-end or UI - User Interface) an user needs to connect via SSH protocol to the host ibiscohpc-ui.scope.unina.it. Access is currently only in non-graphical terminal emulation mode. However the account is valid for all cluster resources. Currently the access is made via the SSH technique "user-password", as shown below Access example from unix-like systems: ''$ ssh ibiscohpc-ui.scope.unina.it -l '' To access Ibisco from Windows systems a simple software is PuTTY, freely available at ''https://www.putty.org/''. From Windows 10 onwards it is also possible to use Openssh in a command window (CMD.exe o Powershell.exe). It is pre-installed (if it is not activated, it simply has to be activated in the Optional Features). In a few months the access to the cluster will be exclusively via the "user-SSH Key" method (other secure access methods are being studied).\\ The current users are invited to generate their key pairs and upload the public key on the server in their home.\\ The new users, when asking for an account, will follow a lightly different procedure: they will generate the keys pair but will not upload the public key to the server (they will not have yet access): they will send it to the Ibisco admin. The admin will copy it, with the right permissions, in the home of the new user. After that the user will have the ability to enter the system without digiting a server password (but still he/she will have to digit a passphrase, see ahead). \\ Once inside the user will create a server password with yppasswd valid for access all the nodes of the cluster.\\ Obviously, it is important to keep in a secret and safe place the private key and the passphrase, otherwise, as in all safety problem, all the advantages brought by safer access algorithms will vanish. Here we show a possible way to generate the key pair in linux and in windows. Anyway there is a lot of documentation in internet about how to do. **on a linux system**\\ from your home directory execute\\ ''$ ssh-keygen -t rsa''\\ Press enter to first question (filename)\\ In response to the prompt "Enter passphrase", enter a key passphrase to protect access to the private key. Using a passphrase enhances security, and a passphrase is recommended.\\ The key pair is generated by the system. * If you still have access with password to the system (old users), you can execute the following command that copy your public key to the server and append it to the file ./ssh/authorized_keys with the right permissions: ''$ ssh-copy-id -i ~/.ssh/id_rsa.pub @ibiscohpc-wiki.scope.unina.it'' * If you are a new user, simply send by mail the file ~/.ssh/id_rsa.pub to the Ibisco admins: they will provide for copying it in your .ssh directory on the cluster with the right permissions. **on a Windows system**\\ We suggest PuTTY, a package for Windows that simplifies the use of Windows as SSH client and the management of the connections to remote hosts\\ To create the key pair (https://the.earth.li/~sgtatham/putty/0.77/htmldoc/) you can follow the following procedure. - Run PUTTYGEN.EXE - Leave the standard choiches (Key -> SSH-2 RSA Key,Use probable primes, show fingerprint as SHA256; Parameters -> RSA, 2048 bit) - Press the "Generate" button and follow the indications - when prompted for a passphrase, insert a good one and save it in a safe place - Save in some safe directory or external usb device the private key (remember the path, needed to run a session with PuTTY) - copy all the content of the box under "Public key for pasting ..." (copy-paste) in a file id_rsa.pub. It will have the right format to be accepted by the OpenSSH (the SSH package available on linux and therefore also on IBiSco) - send by mail the public key to the admins: as written before, they will provide for copying it in your .ssh directory on the cluster with the right permissions. ==== Available file systems ==== Users of the resource currently have the ability to use the following file systems ''/lustre/home/'' file system shared between nodes and UI created using Lustre technology where users' homes reside ''/lustre/scratch'' file system shared between nodes created using Lustre technology to be used as a scratch area ''/home/scratch'' file system local to each node to be used as a scratch area ATTENTION: '' /lustre/scratch '' and '' /home/scratch '' are ONLY accessible from the nodes (i.e. when one of them is accessed), not from the UI In-depth documentation on Lustre is available online, at the link: '' https://www.lustre.org/ '' ''/ibiscostorage'' new scratch area shared among UI and computation nodes (available from 07/10/2022), **not** LUSTRE based ==== Job preparation ans submission ==== === Premise: new job management rules active from 9/10/2022 === To improve the use of resources, the job management rules have been changed. * New usage policies based on // fairshare // mechanisms have been implemented \\ * New queues for job submissions have been defined - ** sequential ** queue: * accepts only sequential jobs with a number of tasks not exceeding 1, * who do not use GP-GPUs, * for a total number of jobs running on it not exceeding 128 * and maximum execution time limit of 1 week - ** parallel ** queue: * accepts only parallel jobs with task number greater than 1 and less than 1580, * that use no more 64 GP-GPUs * and maximum execution time limit of 1 week - ** gpus ** queue: * only accepts jobs that use no more than 64 GP-GPUs, * with task number less than 1580 * and maximum execution time limit of 1 week - ** hparallel ** queue: * accepts only parallel jobs with task number greater than 1580 and less than 3160, * that make use of at least 64 GP-GPUs * and maximum execution time limit of 1 day From 9 October the current queue will be disabled and only those defined here will be active, to be explicitly selected. For example, to subdue a job in the ** parallel ** queue, execute \\ $ srun -p parallel If the job does not comply with the rules of the queue used, it will be terminated. === Use of resources === In the system is installed the resource manager SLURM to manage the cluster resources. Complete documentation is avalailable at ''https://slurm.schedmd.com/''. SLURM is an open source software sytstem for cluster management; it is highly scalable and integrates fault-tolerance and job scheduling mechanisms. ==== SLURM basic concepts ==== The main components fo SLURM are: * //**nodes**// - computing nodes; * //**partitions**// - logic groups of nodes; * //**jobs**// - allcocation of the resources assinged to an user for a given time interval; * //**job steps**// - set of (tipically parallel) ativities inside a job. Partitions can be thought as //**job queues**// each of which defines constraints on job size, time limits, resource usage permissions by users, etc. SLURM allows a centralized management through a daemon, //**slurmctld**//, to monitor resources and jobs. Each node is managed by a daemon, //**slurmctld**//, which takes care of handling requests for activity Some tools available to the user are: * [[https://slurm.schedmd.com/srun.html|srun]] - start a job; * [[https://slurm.schedmd.com/sbatch.html|sbatch]] - submit batch scripts; * [[https://slurm.schedmd.com/salloc.html|salloc]] -request the allocation of resources (nodes), with any constraints (eg, number of processors per node); * [[https://slurm.schedmd.com/scancel.html|scancel]] - terminate queued or running jobs; * [[https://slurm.schedmd.com/sinfo.html|sinfo]] - know information about the system status; * [[https://slurm.schedmd.com/squeue.html|squeue]] - know jobs status; * [[https://slurm.schedmd.com/sacct.html|sacct]] - get information about jobs. * … A complete list of available commands is in man (available also online at ''https://slurm.schedmd.com/man_index.html''): ''man '' ==== Examples of use of some basic commands ==== == system and resoures information == ''**sinfo**'' - Know and verify resources status (existing partitions and relating nodes, ...) and system general status: Example: ''$ sinfo'' Output: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST hpc* up infinite 32 idle ibiscohpc-wn[01-32] Output shows partitions information; in this example: * there is a partititon named "hpc" (* refers to default partition); * the partition is available (status: up); * the partition can be used without time limits; * the partition consists of 32 nodes; * its status is idle; * the available nodes are named ''ibiscohpc-wn01'', ''ibiscohpc-wn02'', ..., ''ibiscohpc-wn32''. ''**squeue**'' - Know jons queue status: Example: ''$ squeue'' Output: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4815 hpc sleep cnr-isas R 0:04 Output shows, for each job: * job identifier; * name of partition on which the job was launched; * job name; * name of the user who launched the job; * job status (running); * job execution time. ''**scontrol**'' - detailed information about job and resources Example (detailed information about ''ibiscohpc-wn02'' node) ''$ scontrol show node ibiscohpc-wn02'' Output: NodeName=ibiscohpc-wn02 Arch=x86_64 CoresPerSocket=24 CPUAlloc=0 CPUTot=96 CPULoad=0.01 AvailableFeatures=HyperThread ActiveFeatures=HyperThread Gres=gpu:tesla:4(S:0) NodeAddr=ibiscohpc-wn02 NodeHostName=ibiscohpc-wn02 Version=20.11.5 OS=Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov 26 12:36:06 CST 2018 RealMemory=1546503 AllocMem=0 FreeMem=1528903 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=hpc BootTime=2022-02-01T16:24:43 SlurmdStartTime=2022-02-01T16:25:25 CfgTRES=cpu=96,mem=1546503M,billing=96 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) == job preparation and submission == ''**srun**'' - manage a parallel job execution on the cluster managed by Slurm. If necessary, srun allocates resources for job execution. Some useful srun parameters are: ''-c'', ''--cpus-per-task='' * number of CPUs allocated per process. By default, one CPU is used per process. ''-l'', ''--label'' * It shows at the top of the lines, on stdout, the number of the task to which the output refers. ''-N'', ''--nodes=[-maxnodes]'' * minimum number (''minnodes'') of nodes to allocate for the job and the possible maximum one. * If the parameter is not specified, the commmand allocates the nodes neeeded to satisfy the requirements specified by the parameters ''-n'' e ''-c''. * If the values are outside the allowed range for the associated partition, the job is placed in a ''PENDING'' state. This allows for possible execution at a later time, when the partition limit is possibly changed. ''-n'', ''--ntasks='' * number of task to run. ''srun'' allocates the necessary resources based on the number of required tasks (by default, one node is required for each task but, using the ''--cpus-per-task'' option, this behavior can be changed). Example, interactively access a node, from UI: $ salloc srun --pty /bin/bash Example, submit a batch job, from UI: $ echo -e '#!/bin/sh\nhostname' | sbatch Example, submit an MPI interactive job with tasks, from UI: $ srun -n **Important command when using OpenMP** Add the following command in the script used to submit an OpenMP job: $ export OMP_NUM_THREADS = * It specifies the number of threads to be used when running a code which uses OpenMP to execute parallel programming. The maximum value should be the number of processors available. For more information about OpenMP, please check [[https://www.openmp.org/spec-html/5.0/openmpse50.html|openMP]]. ==== Tips for using the Intel OneAPI suite (suite compilers, libraries, etc provided by Intel) ==== To use Intel's suite of compilers and libraries, you need to use (interactively or internally any script in which they are needed) the command . /nfsexports/intel/oneapi/setvars.sh ==== Tips for using Red Hat Developer Toolset ==== For details about "Red Hat Developer Toolset" see https://access.redhat.com/documentation/en-us/red_hat_developer_toolset/7/html/user_guide/chap-red_hat_developer_toolset Here we report some examples showing how one can call the various development enviroments: * Create a bash subshell in whichc the tools are working (in these case gcc/g++/gfortran/... v.10): $ scl enable devtoolset-10 bash # Make the tools operational (in this case gcc/g++/gfortran/... v.10, in the current shell): $ source scl_source enable devtoolset-10 ==== Tip for using Singularity ==== The following script is an example of how to use Singularity #!/bin/bash singularity run library://godlovedc/funny/lolcow ==== Tips for python users with the new conda environment ==== === Base === To use python, it is necessary to start the conda environment using the following command, source /nfsexports/SOFTWARE/anaconda3.OK/setupconda.sh [Example: python example.py] conda deactivate === Tensorflow === The tensorflow sub-environment activated after starting the conda environment source /nfsexports/SOFTWARE/anaconda3.OK/setupconda.sh conda activate tensorflowgpu [Example: python example.py] conda deactivate conda deactivate === Bio-Informatics === To use bioconda sub-environment, the following command has to be executed. source /nfsexports/SOFTWARE/anaconda3.OK/setupconda.sh conda activate bioconda [Example: python example.py] conda deactivate conda deactivate === Pytorch === To use Pytorch sub-environment, the following commands have to be executed. source /nfsexports/SOFTWARE/anaconda3.OK/setupconda.sh conda activate pytorchenv [Example: python example.py] conda deactivate conda deactivate === Packages list === To list the available packages in the given environment, run the command, conda list ==== Other tips for python ==== === Parallel computation in python === * The effective usage of the hpc can be done using parallelizing the processes. The codes can be parallelized by distributing the tasks among the available nodes and their respective CPUs and GPUs as well. This information can be specified in a simple submission bash script or a sub.sh file as follows, #!/bin/bash #SBATCH --nodes=[nnodes] #number of nodes #SBATCH --ntasks-per-node=[ntasks per node] #number of cores per node #SBATCH --gres=gpu:[ngpu] #number of GPUs per node === Example of parallel jobs submission === Suppose a given python code has to be executed for different values of a variable "rep". It is possible to execute the python codes parallelly during the job submission process by creating temporary files each file with rep=a1, a2,... The python code example.py can have a line: rep=REP The submission script sub.sh can be used to parallelize the process in following way: #!/bin/bash #SBATCH --nodes=[nnodes] #number of nodes #SBATCH --ntasks-per-node=[ntasks per node] #number of cores per node #SBATCH --gres=gpu:[ngpu] #number of GPUs per node NPROC=[nprocesses] #number of processing units to be accessed tmpstring=tmp #temporary files generated count=0 #begin counting the temporary files for rep in {1..10}; #The value of rep should run from 1 to 10 do tmpprogram=${tmpstring}_${rep}.py #temporary file names for each of the values of rep sed -e "s/REP/$rep/g" #replace the variable REP in the .py with rep specified in the sub.sh file. $program > $tmpprogram #create the temporary files in parallel python $tmpprogram & #run the temporary files (( count++ )) #increase the count number [[ $(( count % NPROC )) -eq 0 ]] && wait #wait for the parallel programs to finish. done rm ${tmpstring}* #optionally remove the temporary files after the execution of all the temporary files * Parallel job submissions can also be done by job array submission. More information about Job arrays can be found in [[https://slurm.schedmd.com/job_array.html]]. * Parallelization can be implemented within the python code itself. For example, the evaluation of a function for different variable values can be done in parallel. Python offers many packages to parallelize the given process. The basic one among them is [[https://docs.python.org/3/library/multiprocessing.html|multiprocessing]]. * The keras and Pytorch modules in tensorflow which are mainly used for machine learning detects the GPUs automatically. ==== Tips for using gmsh (a mesh generator) ==== To use gmsh it is necessary to configure the execution environment (shell) in order to guarantee the availability of the necessary libraries by running the following command: $ scl enable devtoolset-10 bash Within the shell configured in this way, it is then possible to execute the gmsh command available in the directory /nfsexports/SOFTWARE/gmsh-4.10.1-source/install/bin When the ad-hoc configured shell is no longer needed, you can terminate with the command $ exit ==== Tips for using FOAM Ver 9.0 ==== To use this version, available in the directory /nfsexports/SOFTWARE/OpenFOAM-9.0/ you need to configure the environment as follows: $ source /nfsexports/SOFTWARE/OpenFOAM-9.0/etc/bashrc $ source /nfsexports/SOFTWARE/intel/oneapi/compiler/latest/env/vars.sh ==== Tips for using Matlab for the execution of parallel jobs on the IBiSCo HPC cluster ==== == Basic tips == * To use matlab command window, please use '' ssh ibiscohpc-ui.scope.unina.it -l [username] -Y '' when logging into the IBISCO cluster. * Setup the matlab environment by using the command '' /nfsexports/SOFTWARE/MATLAB/R2020b/bin/matlab ''. This opens the mathworks command window where you will be able to add the settings file (see ahead).\\ * Matlab version R2022a can be accessed using the command '' /nfsexports/SOFTWARE/MATLAB/R2022a/bin/matlab ''. == Configuration and execution == Attached you will find an example of a ***Profile File*** that can be used to configure the multi-node parallel machine for the execution of Matlab parallel jobs on the IBiSco HPC cluster.\\ {{:wiki:SlurmIBISCOHPC.mlsettings.lsettings.zip|Example of configuration file for Parallel excution}} // The file must be decompressed before use // To be accessed by a Matlab program, the user ***must first import*** that file by starting the ***Cluster Profile Manager*** on the Matlab desktop (On the ***Home*** tab, in the ***Environment*** area, select ***Parallel*** > ***Create and Manage Clusters***).\\ {{:wiki:FinestraConfigurazioneParallel.png?400|Parallel Configuration Window}} //Figure 1: Parallel Configuration Window// In the ***Create and Manage Clusters*** select the ***Import*** option {{:wiki:FinestraImportpParallel.png?400| Import Configuration Window}} //Figure 2: Import Configuration Window// Once the profile was imported, it can be referenced by a Matlab parallel program using the profile name 'SlurmIBISCOHPC': i.e.\\ mypool=parpool('SlurmIBISCOHPC', ___, ___, ___, ___) ... delete(mypool); To modify the 'SlurmIBISCOHPC' profile the user can use - the ***Create and Manage Clusters*** window - the Matlab Profile commands such as saveProfile (https://it.mathworks.com/help/parallel-computing/saveprofile.html) === Example of running a parallel matlab script === This is an example of using **parfor** to parallelize the for loop (demonstrated at {{https://it.mathworks.com/help/parallel-computing/decide-when-to-use-parfor.html?s_eid=PSM_15028 | MathWorks}}). This example calculates the spectral radius of a matrix and converts a for-loop into a parfor-loop. Open a file named as **test.m** with the following code mypool=parpool('SlurmIBISCOHPC', 5) % 5 is the number of workers n = 100; A = 200; a = zeros(n); parfor i = 1:n a(i) = max(abs(eig(rand(A)))); end delete(mypool); quit To run this code, the following command executed on the UI can be used: /nfsexports/SOFTWARE/MATLAB/R2020b/bin/matlab -nodisplay -nosplash -nodesktop -r test ----------------------- [[wiki:startdoc| Start]]\\