HPC User Portal Guide
Table of Contents
- What is HPC User Portal
- Logging in
- Job monitoring
- Maintenance and Machine Status
- Machine statistics
- Periodic benchmarks
What is HPC User Portal ↩
The HPC User Portal is a job and resource monitoring platform developed by the BSC Support Team’s software engineers. With it, every HPC machine user can check the status and general resource usage metrics of the jobs launched. Alongside those functionalities, this system also provides historic machine stats (in terms of available and allocated CPUs) for the primary BSC machines. The platform is still growing, so it will progressively offer more information with time.
Before the implementation of this system, all users were required to contact support if they wanted to obtain resource usage metrics and information about their jobs. This procedure won’t be necessary anymore. This guide will explain how to use the portal and what it can do for you.
Logging in ↩
You can locate the resource at this URL: https://userportal.bsc.es
You will be greeted with the login screen. If you have never logged in before, you can request a password using the “forgot password” procedure. The only thing that will be requested is the e-mail address used for your BSC association. After filling the form, you will receive a e-mail containing an URL where you can set up your password.
With that out of the way, we can proceed to the next chapters.
Job monitoring ↩
The main page of the HPC User Portal is the job monitoring screen. It will list all your jobs launched in all the machines by every account you have. This list contains a brief listing of the general characteristics of each job (like its name, user, status, node/task configuration…). If the job listed is in the “running” status, it will also show you the current CPU and memory usage.
At first glance, it gives all jobs. The default settings don’t use any type of filtering. That can be changed using coarse filters or more specific search options. For example, you can list jobs that are launched on a specific machine by clicking on the “All machines” button and selecting the desired one. The same can be done with the “All accounts” button if you have more than one.
If your search needs more granularity, you can filter jobs by its characteristics using the search function located at the page’s top right corner. It will bring up a form where you can specify constraints such as job ID, job status, QOS or when were they launched.
Once the desired job is located, you can check its properties using either the preview button or the view button. The difference between them is the level of detail that they will give. The preview also uses a pop-up window instead of a full-blown page. We’ll skip the preview and show the “view” option directly, as it is just a subset of data. Here’s an example of a job:
As you can see, the “view” function will give out specific details regarding the job execution. One of the most interesting features is the job metric histogram, which will show how the job has evolved during the time of its execution in terms of CPU usage, memory usage and power consumption. You can also download this information by clicking in the top right corner of each histogram and selecting your prefered format option (including the CSV format!).
This type of information used to be only accessible to support, which had to be specifically requested by the user. With this new system, the user can now quickly access this information.
Maintenance and Machine Status ↩
The standard procedure for notifying machine maintenance dates and incidences was (and still is) sending e-mails to all the affected users. Some users may feel that keeping in mind all the dates is a bit burdensome. Knowing this, the HPC User Portal has included a section where you can keep track of the current operative status of all the machines, alongside all the scheduled maintenances and undergoing system issues that may occur. You can access it in the dropdown menu located at the top right corner of the page, next to your user name. Here’s how it looks:
As you can see, it lists all scheduled and past maintenances, specifying the machines affected and the estimated time that they will (or did) last. The initial row lists all the machines with their associated status color. Green means that the machine is not affected by any maintenance or issues. Otherwise, the color will be red.
Each user can also check how much CPU time has used in each machine over the course of a defined time period. You can access the accounting page through the same menu you used to access the maintenance stats. You will need to specify which account and which machine to display, giving also a start and end date. Here’s how it looks:
Each point in the diagram can be checked to give the exact amount of time spent that day. For regular users, the data given is restricted to its own accounts. If you are a responsible or team leader, you can also check the accounting of all your group members, as well as the whole group acounting. This can be used as a monitoring tool to keep the group CPU time usage in check.
It’s important to note that if there isn’t any relevant time consumption in the specified range of days, the portal might not be able to generate a diagram, as it won’t have the required time consumption data.
Machine statistics ↩
One of the most recent additions is the “Machines Stats” tab, where you can check a chronological diagram of the global resource usage of the machines in terms of CPU allocation, jobs and queues status. This way, every user can check the general usage and state of the machine in a given time.
Here we can see an example of the diagrams:
Periodic benchmarks ↩
One useful feature of the portal is the visualization of historical data of the performance metrics of various programs which are used as benchmarks. Those benchmarks are run several times each day. The goal of this system is to be able to track down system issues that may be slowing down the HPC machines.
This is the list of software used for benchmarking purposes:
- HPCG (GPU version)
- Linpack (GPU version)
And finally, here’s an example of the feature: