The new (January 2021) NIC5 cluster is available! It features 70 nodes with two 32-cores AMD EPYC Rome 7542 cpus at 2.9 GHz and 256 GB of RAM, 3 nodes with 1 TB of RAM, 520 TB of fast BeeGFS /scratch and a 100 Gbps Infiniband HDR interconnect, for a total of 4672 cores. See http://www.ceci-hpc.be/clusters.html#nic5 and https://www.campus.uliege.be/nic5.
Following the security incident on 05/06/2020, you will find in this paragraph important and up to date information concerning the status, acces and usage of NIC4:
- New CÉCI accounts are no more created on NIC4, and existing accounts are no more automatically renewed. Existing users are strongly encouraged to backup their important data (/home and /scratch), delete unneeded ones, and migrate to NIC5.
- The new NIC4 login node login-nic4.segi.ulg.ac.be is available from the gateways installed in each University.
- Users who have asked the renewal of their account since 09/06/2020 can have access to their files in their /home and /scratch directories and are encouraged to get back their important data and kindly delete unneeded ones.
- The documentation about how to connect to the clusters through the new gateways has been updated, see:
- ULiège users who are working from outside the University network must use the latest ULiège VPN https://my.segi.uliege.be/new-vpn to connect to the gwceci.uliege.be gateway. The use of a PC or server inside the University to connect to the gateway instead of the VPN is strongly discouraged for evident security reasons.
- The former NIC4 login node that has been compromised is in quarantine. We repurposed a computing node into the new login-nic4 login node, with a full update of the kernel and the application of all the existing security updates. Because of these updates, the new login node has more recent versions of some libraries and of the Infiniband drivers than the computing nodes.
- Therefore, it is possible that some programs that where compiled on the old login node will not run anymore on the new login node, but still continue to run through Slurm on the compute nodes. It is also probable that you will not be able to compile MPI programs on the new login node. We will keep you informed here of the possible workarounds. In the meantime, to compile and run new MPI programs, you should use the Lemaitre3 or NIC5 clusters.
- So, before launching any program interactively on the new login node, or submitting a Slurm job, please backup all your existing programs/executables! Then just test if previous Slurm jobs that succeeded before are still working properly if you submit them as is without any change (again, keep a full backup of the original programs and all input files relevant to these previous jobs).
NIC4 is the old (February 2014!) High Performance Computing (HPC) massively parallel cluster of the University of Liège, installed in the framework of the Consortium des Équipements de Calcul Intensif ( CÉCI ) and funded by the Fonds de la Recherche Scientifique de Belgique ( F.R.S.-FNRS ) under Grant No. 2.5020.11.
Hosted at the SEGI facility (Sart Tilman B26), it features 128 compute nodes with two 8-cores Intel E5-2650 processors at 2.0 GHz and 64 GB of RAM (4 GB/core), interconnected with a QDR Infiniband network (2:1 blocking factor), and having exclusive access to a fast 144 TB FHGFS parallel filesystem.
Access and General Documentation
Anyone who is officially affiliated with a university member of the CÉCI consortium (ULiège, UCLouvain, ULB, UNamur and UMons), is holder of an official university email address and is endorsed by a supervisor with a permanent academic or scientific position, can claim access to the NIC4 supercomputer and all the CÉCI supercomputing infrastructure.
The first step is to create a CÉCI account by visiting the login.ceci-hpc.be website (your computer must be connected to your university network, see below) and entering your email address. The full procedure is explained here.
The clusters must be accessed with a secure shell (SSH), through a gateway. More information here. ULiège users who are working from outside the University network must use the latest ULiège VPN https://my.segi.uliege.be/new-vpn to connect to the gwceci.uliege.be gateway. The use of a PC or server inside the University to connect to the gateway instead of the VPN is strongly discouraged for evident security reasons.
To start working, you will need to write a submission script describing the resources you need and the operations you want to perform, and submit that script to the resource manager/job scheduler. The one installed on the CÉCI clusters is named Slurm. Find more information on how to do that here.
Preferred kind of jobs and queue configuration
NIC4 is ideally suited for massively parallel jobs (MPI, several dozens of cores) with many communications and/or a lot of (parallel) disk I/O operations.
Accordingly, the default queue is configured to allow jobs of maximum 3 days. If your job takes longer, increase the degree of parallelism, or implement checkpointing, a mechanism that allows stopping a computation moment and restart it at a later time (more on this here). The maximum number of cores per user is currently set to 480 (these settings are indicative only, and may change depending on the load on the cluster).
The 128 nodes have 64 GB of RAM and thus 4 GB per core. To maximize the occupation rate of all the cores of the cluster, do not launch MPI jobs with more than 4 GB per core.
A small number of SMP/OpenMP parallel jobs running on one node are allowed, but they also should try to respect the 4 GB of RAM per core limitation. If your jobs need more RAM or longer running time, use other CÉCI clusters. And if the parallel efficiency of your OpenMP job is not ideal, don't forget that running 2 jobs with 8 cores each should be more efficient that only one job with 16 cores !
A small number of serial jobs with no more than 16 GB are also allowed. If your jobs need more RAM or longer running time, use other CÉCI clusters (Hercules2, Dragon2).
To enforce these guidelines, the maximum number of concurrently running jobs per user is set to 128 , and the maximum number of jobs an user can submit at the same time is set to 256 (sum of the RUNNING and PENDING jobs). Again, these settings are indicative only, and may change depending on the load on the cluster.
See also Which cluster should I use?
The home directories of the users ( $HOME on the /home/ partition) are hosted on a 70 TB NFS server, with a quota of 20 GB per user (that could be increased upon motivated request). Check your local quota with the 'quota' command or with the more general 'ceci-quota' command.
The $HOME directories ARE NOT BACKED UP ! It is your responsibility to get back a copy of your important files and directories. Your home directory is well suited to hold you configuration files, your programs (sources and executables), your input data and your important result files (if they fit within your quota). Don't launch your jobs directly in your home directory, instead work in your $SCRATCH directory (see below)
NIC4 has a second independent storage system, a 144 TB very fast parallel distributed FHGFS file system. Each user has automatically a $SCRATCH = $GLOBALSCRATCH directory on that /scratch partition, where he should launch his jobs. There is no quota on this partition.
Moreover, the CÉCI common filesystem is fully available on the 6 CÉCI clusters, and the /CECI/ partition is directly accessible from all the login and compute nodes on all the clusters. Make sure to try it out by just typing 'cd $CECIHOME ; pwd'. More information here.
'module available' shows you the list of all the installed software and applications. If you need more information about a module, use 'module show software_name' and/or 'module help software_name'.
BLAS and LAPACK: If you need the BLAS and/or LAPACK libraries, use the 'openblas/0.2.20' module which provides an optimized and multithreaded implementation of the BLAS and LAPACK libraries (the number of threads is controlled by the $OPENBLAS_NUM_THREADS environment variable which is set by default to 1), or use the Intel MKL library 'intel/mkl/64/11.1/2013_sp1.1.106'.
MPI (Message Passing Interface): try first OpenMPI version 1.6.4, compiled with GCC ('openmpi/qlc/gcc/64/1.6.4') or with Intel compilers ('openmpi/qlc/intel/64/1.6.4').
Python: The basic version of Python installed with the system is 2.6.6. If you need a more recent version, there is a 'python/2.7.6' module. If you need a very recent Python2, or Python3, see the EasyBuild section below.
Matlab/Octave: Matlab licenses are expensive and not easy to manage at the ULg level for a cluster available to users from other universities, so Matlab will not be installed on NIC4. But we provide a 100% free and mostly compatible alternative: Octave ! ('module load EasyBuild Octave/4.4.1-foss-2018b ; octave --gui'). And there is also Scilab ('scilab/5.4.1'). Moreover, we installed modules ('mcr/R201*_v*') for different versions of the Malab Compiler/Matlab Runtime that enables the execution of compiled MATLAB applications or components on computers that do not have MATLAB installed (see also https://indico.cism.ucl.ac.be/event/19/ ).
EasyBuild: loading the 'EasyBuild' module gives you access to an Experimental and Untested list of additional software ('module load EasyBuild ; module avail'). EasyBuild modules using MPI may have some problems. Do not try to mix modules coming from the main '---- /cm/shared/modulefiles ----' section with modules coming with the '---- /home/easybuild/Modules/modulefiles ----' section.
Last edit: March 3rd 2021 at 16:00 by David Colignon