NIC5 is the new (January 2021) High Performance Computing (HPC) massively parallel cluster of the University of Liège, installed in the framework of the Consortium des Équipements de Calcul Intensif ( CÉCI ) and funded by the Fonds de la Recherche Scientifique de Belgique ( F.R.S.-FNRS ) under Grant No. 2.5020.11.

Hosted at the SEGI facility (Sart Tilman B26), it features 70 nodes with two 32 cores AMD EPYC Rome 7542 cpus at 2.9 GHz and 250 GB of RAM, 3 nodes with 1 TB of RAM, 520 TB of fast BeeGFS /scratch and a 100 Gbps Infiniband HDR interconnect, for a total of 4672 cores.

The previous NIC4 supercomputer has its own dedicated page here.

Access and General Documentation

As for the 5 other CÉCI clusters, the main source of information concerning the access to and the usage of NIC5 is gathered on the CÉCI website.

Anyone who is officially affiliated with a university member of the CÉCI consortium (ULiège, UCLouvain, ULB, UNamur and UMons), is holder of an official university email address and is endorsed by a supervisor with a permanent academic or scientific position, can claim access to the NIC5 supercomputer and all the CÉCI supercomputing infrastructure.

The first step is to create a CÉCI account by visiting the login.ceci-hpc.be website (your computer must be connected to your university network, or you have to use a VPN, see below) and entering your email address. The full procedure is explained here.

The clusters must be accessed with a secure shell (SSH), through a gateway. More information here. ULiège users who are working from outside the University network must use the latest ULiège VPN https://my.segi.uliege.be/new-vpn to connect to the  gwceci.uliege.be  gateway. The use of a PC or server inside the University to connect to the gateway instead of the VPN is strongly discouraged for evident security reasons.

To start working, you will need to write a submission script describing the resources you need and the operations you want to perform, and submit that script to the resource manager/job scheduler. The one installed on the CÉCI clusters is named Slurm. Find more information on how to do that here.

iconeInfo Please read carefully the FAQ and the Documentation sections before submitting your question preferably via the CÉCI support wizard .

Specific Documentation

Preferred kind of jobs and queue configuration

NIC5 is ideally suited for MPI parallel jobs (several dozens of cores) with many communications and/or a lot of (parallel) disk I/O operations, and SMP/OpenMP parallel jobs running on one node.

Accordingly, the default queue is configured with a walltime of maximum 2 days. If your job takes longer, increase the degree of parallelism, or implement checkpointing, a mechanism that allows stopping a computation moment and restart it at a later time (more on this here). If the parallel efficiency of your OpenMP job is not very good at high core count, don't forget that running 2 jobs with 16 cores each should be more efficient that only one job with 32 cores! And if your jobs still need longer running time, use other CÉCI clusters (Hercules2, Dragon2). The maximum number of cores per user is currently set to 1024 (these settings are indicative only, and may change depending on the load on the cluster).

70 nodes have 250 GB of RAM and thus 3.9 GB per core. To maximize the occupation rate of all the cores of the cluster, try to not launch MPI or OpenMP jobs with more than 3.9 GB (or 4000 MB) per core.

3 nodes in the Slurm partition "hmem" have 1 TB of RAM. Do not use these nodes to run jobs that could run on the standard 250 GB nodes!

To enforce these guidelines, the maximum number of concurrently running jobs per user is set to 256, and the maximum number of jobs an user can submit at the same time is set to 512 (sum of the RUNNING and PENDING jobs). Again, these settings are indicative only, and may change depending on the load on the cluster.

See also Which cluster should I use?

Storage

The home directories of the users ( $HOME on the  /home/ partition) are hosted on a 50 TB NFS server, with a quota of 100 GB and 200.000 files per user (that could be increased upon motivated request). Check your local quota with the 'quota' command or with the more general 'ceci-quota' command.

The $HOME directories ARE NOT BACKED UP ! It is your responsibility to get back a copy of your important files and directories. Your home directory is well suited to hold your configuration files, your programs (sources and executables), your input data and your important result files (if they fit within your quota). Don't launch your jobs directly in your home directory, instead work in your $GLOBALSCRATCH directory (see below).

NIC5 has a second independent storage system, a 520 TB fast parallel distributed BeeGFS file system. Each user has automatically a $GLOBALSCRATCH directory on that /scratch partition, where he should launch his jobs. There is a quota of 5 TB and 500.000 files per user (that could be increased upon motivated request). This scratch partition is, of course, not backed up.

Moreover, the CÉCI common filesystem is fully available on the 6 CÉCI clusters, and the /CECI/ partition is directly accessible from all the login and compute nodes on all the clusters. Make sure to try it out by just typing 'cd $CECIHOME ; pwd'. More information here.

 

Last edit: March 3rd 2021 by David Colignon

Partager cette page