CNAG's Frequently Asked Questions (FAQ)
Table of Contents
- Lustre Filesystem
- Support Team
Backup Policy ↩
The Backup Policy at CNAG cluster is to daily backup the contents of $HOME and certain directories of /project related to the Sequencer runs and the production team.
Any files in /scratch or outside the vetted /project directories are not backed up.
Which filesystems do I have available? Which usage is intended? ↩
CNAG’s cluster uses a shared filesystem (Lustre) so all your data is accessible from all nodes. Quotas are enforced to limit user data on Lustre.
- /home (NFS)
- /apps (NFS)
- /project (Lustre)
- /scratch (Lustre)
There is a local hard disk available in each node not accessible from anywhere else but the node:
This is your home filesystem where you should store your personal files and scripts. This directory should remain private.
This filesystem is backed up once a day.
This is where applications are installed in the machine. Users are not allowed to install software or modify existing installations from this filesystem.
This filesystem is intended to store production data that cannot be regenerated programmatically, typically the results of Sequencers.
Some directories are backed up once a day.
This filesystem presents the highest performance for distributed read and write and should be used for your daily operation. Temporary or intermediate files should be stored on local harddrives whenever possible.
This filesystem is not backed up.
This disk is intended for temporary files that don’t need to be recovered, like mesh partitions local to the node, MPI temporary communications between nodes, decompressed files pending filtering before recompressing, sort intermediate files, splitting files by linecount… The data is not recoverable after the job so you must copy it to either /project or /scratch (depending on your team).
This filesystem is identified by $TMPDIR environment variable and each job has a private directory in that filesystem pointed to by the variable.
How can I check how much free disk have I available? What is a quota? ↩
All the filesystems you have available have a quota set. These quotas limit the amount of space available to a user or group of users. All users have per-user quota set on /project and /scratch. You can check your current usage with command:
Also remember the filesystem slows down whenever you try to write near your quota limit. Please remember to plan your quota needs, request any quota increase you may need in advance and cleanup and curate the data you have on the filesystems. If you reach your limit your jobs may fail.
How can I open a GUI in the logins? ↩
To open GUIs you need to connect with parameter -X in your ssh connection (in Linux/OSX) or with some kind of x11 Forwarding (Windows). Examples:
ssh -X -l username <login>
ERROR: “/usr/bin/manpath: can’t set the locale; make sure $LC_* and $LANG are correct” or “cannot set LC_CTYPE locale” ↩
This error is related to the locale (the language dependent character encoding) of your system being different/incompatible with MareNostrum’s. If you find yourself in this situation, please try the following:
LANG=es_ES.UTF-8 ssh -l username <login>
Some Mac versions have a bug in the terminal that ignores the previous setting and causes this error. You should be able to disable this behaviour by unchecking “Set locale environment variables on startup” in Terminal Settings -> Advanced.
My job failed and I see a message like “OOM Killed…” in the logs. What is this? ↩
This is a message from the OS kernel stating that your process was consuming too much memory exceeding the node’s limits and was thus killed.
If you encounter such problem you should try to increase the number of cores per task your jobs request. It is recommended you contact our Support Team if it’s the first time you try to tweak these settings.
My job has been waiting for a long time. How can I see when will it be executed? ↩
There is no reliable method to predict when a certain job will start executing. The system is designed so all jobs will be executed eventually, but it may take longer for some of them.
My job exits with some error just after starting. What is wrong? ↩
This is usually related with some configuration problems of the input or the application itself. Before retrying your submissions, please manually test a minimal execution to see what the problem is. If nothing turns up or you don’t know how to fix it, please contact our Support Team with detailed explanations of the problems encountered, where are the tests you tried and their results and how to test and reproduce your execution.
What is SLURM? ↩
SLURM stands for Simple Linux Utility for Resource Management and it’s used as job scheduler. This program interprets the amount of resources a job requires and ontrols how many resources are available for new jobs. It also controls the execution state of the jobs and interfaces with the MPI libraries to distribute the resources among the allocation.
If you need more information please check the User’s Guide.
How can I check the status of my jobs? What does the status message […] mean? ↩
Use mnq and check the User’s Guide.
SLURM reports several state messages for the jobs. Some imply critical issues with the job that will prevent it from running while others are just informative messages. Here we list a few of them with their meaning. If your message doesn’t appear in this list, do not hesitate to contact our Support Team.
There is a limit on the amount of jobs a user or group may have running at the same time. This message means your job will be kept back because your group has filled all your running slots and it must wait until one of the running jobs is over before being able to start execution.
The job doesn’t have enough priority to start requesting resources for execution. As time passes and higher priority jobs get executed this job will become of the highest priority and will start reserving resources and eventually being executed.
There are not enough resources available to satisfy the job’s requirements. The job will wait until enough resources become available. Usually means the machine is full.
Some of the nodes requested are not available. They may be reserved or in maintenance, among other reasons. This message may appear when using certain partitions or reservations. You may contact with our Support Team for further information.
Lustre Filesystem ↩
What is Lustre? ↩
Lustre is a distributed High Performance Filesystem available on CNAG cluster. This filesystem is accessible from all logins and compute nodes at the same time. Right now there are 2 filesystems:
The Guide mentions special commands for Lustre. Which are those? Why should I use them? ↩
Lustre is specially designed to provide High Performance in a Distributed environment and some of the traditional UNIX commands (rm, ls, find…) are inefficient for the system and may cause unnecessary and harmful overload, even to the point of outages in extreme situations. Therefore, both the Lustre developers and the Support Team at BSC provide some commands to perform those tasks in the most efficient way possible.
Lustre provided commands
They are the same as the original UNIX commands preceded by lfs. They don’t have as many features as their UNIX counterparts but most common usage is covered. You can check them all with
- lfs ls: Efficient ls for Lustre
- lfs find: Efficient find for Lustre
- lfs cp: Common copy command
- lfs df: Accurate description of filesystem usage
There are many more commands but they are not useful for typical users.
BSC Support provided commands
The Support Team at BSC provides some commands for ease of use. You can access them by:
module load bsc
And the commands are:
- lrm: Efficient rm for Lustre in two steps.
- lls: Shorthand for lfs ls.
- lcp: Shorthand for lfs cp.
- lfind: Shorthand for lfs find.
This command searches efficiently for all files following the specified pattern in a specified path and creates a shellscript to delete each file individually. This script can be checked to make sure all the files present can be deleted and just requires execution.
module load bsc #This creates script rmlist.sh lrm /path/to/my/files/*.bai #check its contents cat rmlist.sh #Everything correct, delete them. bash rmlist.sh
Support Team ↩
How/when may I contact Support Team? ↩
You may send an e-mail any time and it will be answered on the next working day on office hours (9:00 - 18:00 CET). Bank holidays correspond to Barcelona’s.
- cnag_support AT bsc DOT es