This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
wiki:user_guide [2022/03/28 16:19] cnr-guest [How to log in] |
wiki:user_guide [2022/05/28 18:18] (current) cnr-guest [Job preparation ans submission] |
||
---|---|---|---|
Line 20: | Line 20: | ||
== The Storage Node Architecture == | == The Storage Node Architecture == | ||
- | The cluster has 4 Dell R740 storage nodes, each offering 16 16TB SAS HHDs and 8 1.9TB SATA SSDs. Each node is equipped with two InfiniBand EDR ports, one of which is connected to the Mellanox InfiniBand switch dedicated to storage, which guarantees a 100 Gb/s connection to all the compute nodes. \\ A separate 3 PB storage system | + | The cluster has 4 Dell R740 storage nodes, each offering 16 16TB SAS HHDs and 8 1.9TB SATA SSDs. Each node is equipped with two InfiniBand EDR ports, one of which is connected to the Mellanox InfiniBand switch dedicated to storage, which guarantees a 100 Gb/s connection to all the compute nodes. \\ While the aforementioned nodes are dedicated to the home and scratch areas of users, a separate 3 PB storage system |
== The Compute Node Architecture == | == The Compute Node Architecture == | ||
Line 101: | Line 101: | ||
Complete documentation is avalailable at '' | Complete documentation is avalailable at '' | ||
- | SLUR is an open source software sytstem for cluster management; it is highly scalable and integrates fault-tolerance and job scheduling mechanisms. | + | SLURM is an open source software sytstem for cluster management; it is highly scalable and integrates fault-tolerance and job scheduling mechanisms. |
==== SLURM basic concepts ==== | ==== SLURM basic concepts ==== | ||
Line 127: | Line 127: | ||
* … | * … | ||
- | A complete list of available commands is in man (available also onlinbe | + | A complete list of available commands is in man (available also online |
==== Examples of use of some basic commands ==== | ==== Examples of use of some basic commands ==== | ||
Line 228: | Line 228: | ||
- | ==== Preparation and submission of jobs ==== | ||
- | * to interactively access a node from the UI, run | ||
- | |||
- | salloc | ||
- | |||
- | * to submit a batch job from the UI, run | ||
- | |||
- | echo -e '# | ||
- | |||
- | * to submit an interactive MPI job with <N> tasks, from the UI, run | ||
- | |||
- | srun -n <N> < | ||
- | |||
- | * to check the status of the resources from the UI, run | ||
- | | ||
- | sinfo | ||
- | | ||
- | * To check the status of jobs from the UI, run | ||
- | |||
- | squeue | ||
- | | ||
==== Available file systems ==== | ==== Available file systems ==== | ||
Line 272: | Line 251: | ||
< | < | ||
+ | |||
+ | == NVIDIA HPC SDK (compiler suites, libraries, etc provided by NVIDIA) == | ||
+ | |||
+ | * //**Ver 20.10**// - is available in the directory '' | ||
+ | |||
+ | == OpenMPI == | ||
+ | * //**Ver 4.1.orc5**// | ||
+ | |||
+ | |||
+ | == Julia == | ||
+ | |||
+ | * //**Ver 1.6.1**// - interpreter available in the directory '' | ||
+ | |||
+ | == FFTW libraries== | ||
+ | |||
+ | * //**ver 3.3.10**// - compiled with Intel compilers available in the directory '' | ||
+ | |||
+ | == Anaconda 3 environment == | ||
+ | |||
+ | * available in the directory '' | ||
+ | |||
+ | === complete sw packages for specific applications === | ||
+ | |||
+ | == Matlab == | ||
+ | |||
+ | * //**Ver R2020b**// - available in the directory '' | ||
+ | |||
+ | == Quantum ESPRESSO == | ||
+ | |||
+ | * //**Ver 7.0**// - available in the directory '' | ||
+ | |||
+ | == OpenFOAM == | ||
+ | * //**Ver 7.0**// - available in the directory '' | ||
+ | |||
+ | == Rheotool == | ||
+ | * //**Ver 5.0**// - available in the same directory of OpenFOAM | ||
+ | ==== For python users ==== | ||
+ | === Base === | ||
+ | |||
+ | To use python, it is necessary to start the conda environment using the following command, | ||
+ | |||
+ | < | ||
+ | < | ||
+ | conda deactivate </ | ||
+ | |||
+ | === Tensorflow === | ||
+ | |||
+ | The tensorflow sub-environment activated after starting the conda environment | ||
+ | |||
+ | < | ||
+ | conda activate tensorflowgpu | ||
+ | < | ||
+ | conda deactivate | ||
+ | conda deactivate</ | ||
+ | |||
+ | === Bio-Informatics === | ||
+ | |||
+ | To use bioconda sub-environment, | ||
+ | |||
+ | < | ||
+ | conda activate bioconda | ||
+ | < | ||
+ | conda deactivate | ||
+ | conda deactivate</ | ||
+ | |||
+ | === Packages list === | ||
+ | To list the available packages in the given environment, | ||
+ | |||
+ | < | ||
+ | |||
+ | === Parallel computation in python === | ||
+ | * The effective usage of the hpc can be done using parallelizing the processes. The codes can be parallelized by distributing the tasks among the available nodes and their respective CPUs and GPUs as well. This information can be specified in a simple submission bash script or a sub.sh file as follows, | ||
+ | |||
+ | < | ||
+ | #SBATCH --nodes=[nnodes] | ||
+ | #SBATCH --ntasks-per-node=[ntasks per node] #number of cores per node | ||
+ | #SBATCH --gres=gpu: | ||
+ | |||
+ | === Example of parallel jobs submission === | ||
+ | Suppose a given python code has to be executed for different values of a variable " | ||
+ | < | ||
+ | |||
+ | The submission script sub.sh can be used to parallelize the process in following way: | ||
+ | |||
+ | < | ||
+ | #SBATCH --nodes=[nnodes] | ||
+ | #SBATCH --ntasks-per-node=[ntasks per node] #number of cores per node | ||
+ | #SBATCH --gres=gpu: | ||
+ | NPROC=[nprocesses] | ||
+ | |||
+ | tmpstring=tmp | ||
+ | |||
+ | count=0 | ||
+ | for rep in {1..10}; | ||
+ | do | ||
+ | tmpprogram=${tmpstring}_${rep}.py | ||
+ | sed -e " | ||
+ | $program > $tmpprogram | ||
+ | python $tmpprogram & #run the temporary files | ||
+ | (( count++ )) # | ||
+ | [[ $(( count % NPROC )) -eq 0 ]] && wait #wait for the parallel programs to finish. | ||
+ | done | ||
+ | rm ${tmpstring}* | ||
+ | * Parallel job submissions can also be done by job array submission. More information about Job arrays can be found in [[https:// | ||
+ | |||
+ | * Parallelization can be implemented within the python code itself. For example, the evaluation of a function for different variable values can be done in parallel. Python offers many packages to parallelize the given process. The basic one among them is [[https:// | ||
+ | |||
+ | * The keras and Pytorch modules in tensorflow which are mainly used for machine learning detects the GPUs automatically. | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||