Administration Guide

This page holds documentation for the processes that support the development, deployment and continuous integration activities of Spack infrastructure at NERSC.

Login Access

You can access Cori and Perlmutter, for more details see Connecting to NERSC. If either system is down you can access data transfer nodes (dtn[01-04].nersc.gov) and then access the appropriate system. Please check out the NERSC MOTD at https://www.nersc.gov/live-status/motd/ for live updates to system.

In order to access TDS systems like muller or gerty you will need to access one of the systems (cori, perlmutter, dtn) and then run the following:

ssh dtn01.nersc.gov
ssh gerty

It is probably a good idea to run usgrsu once you are in the correct login node otherwise you may be prompted for a password for the e4s user.

The e4s user is a collaboration account which is a shared account used for spack deployments. You will need to login as e4s user via usgrsu command or use sshproxy to get 24hr credential and then ssh as the collaboration account. This will prompt you for a password which is your NERSC password for your username not e4s user.

Only members part of c_e4s unix group have access to use the collaboration account. You can run the following to see list of users that belong to the group. If you don’t belong to this group and should be part of this group, please send a ticket at https://help.nersc.gov

getent group c_e4s

Production Software Stack

The spack stack is installed on shared filesystem at /global/common/software/spackecp. The project space has a quota limit for space and inode count. To check for the quota space please run the following

cfsquota /global/common/software/spackecp

The production installation of e4s stack on Perlmutter is stored in sub-directory perlmutter with a version for each stack as follows

(spack-pyenv) e4s:login22> ls -ld /global/common/software/spackecp/perlmutter/e4s-*
drwxrwsr-x+ 8 e4s spackecp 512 Jun  6 10:09 /global/common/software/spackecp/perlmutter/e4s-21.11
drwxrwsr-x+ 9 e4s spackecp 512 Jan 12 07:34 /global/common/software/spackecp/perlmutter/e4s-22.05
drwxrwsr-x+ 5 e4s spackecp 512 Mar 28 10:24 /global/common/software/spackecp/perlmutter/e4s-22.11

Within the installation you will see several subdirectories which contain a unique identified from the CI job. The default is a symbolic link to the active production stack

(spack-pyenv) e4s:login22> ls -l /global/common/software/spackecp/perlmutter/e4s-22.11/
total 4
drwxrwsr-x+ 3 e4s spackecp 512 Mar  6 14:40 82028
drwxrwsr-x+ 3 e4s spackecp 512 Mar 28 10:16 82069
drwxrwsr-x+ 3 e4s spackecp 512 Mar 28 08:34 83104
lrwxrwxrwx  1 e4s spackecp   5 Mar 28 10:24 default -> 83104

We have one modulefile per e4s stack, they are named as e4s/<version> with a symobolic link spack/e4s-<version>. In the modulefile you will see path to root installation of spack. As we can see from example below, the root of spack is located in /global/common/software/spackecp/perlmutter/e4s-22.11/default/spack

(spack-pyenv) e4s:login22> module --redirect --raw show e4s/22.11 | grep root
local root = "/global/common/software/spackecp/perlmutter/e4s-22.11/default/spack"
         spack_setup = pathJoin(root, "share/spack/setup-env.sh")
         spack_setup = pathJoin(root, "share/spack/setup-env.csh")
         spack_setup = pathJoin(root, "share/spack/setup-env.fish")
   remove_path("PATH", pathJoin(root, "bin"))

(spack-pyenv) e4s:login22> ls -l /global/common/software/spackecp/perlmutter/e4s-22.11/default/spack
total 100
drwxrwsr-x+ 2 e4s spackecp   512 Mar 28 08:54 bin
-rw-rw-r--  1 e4s spackecp 55695 Mar 28 08:35 CHANGELOG.md
-rw-rw-r--  1 e4s spackecp  1941 Mar 28 08:35 CITATION.cff
-rw-rw-r--  1 e4s spackecp  3262 Mar 28 08:35 COPYRIGHT
drwxrwsr-x+ 3 e4s spackecp   512 Mar 28 08:35 etc
drwxrwsr-x+ 3 e4s spackecp   512 Mar 28 08:35 lib
-rw-rw-r--  1 e4s spackecp 11358 Mar 28 08:35 LICENSE-APACHE
-rw-rw-r--  1 e4s spackecp  1107 Mar 28 08:35 LICENSE-MIT
-rw-rw-r--  1 e4s spackecp  1167 Mar 28 08:35 NOTICE
drwxrwsr-x+ 3 e4s spackecp   512 Mar 28 08:35 opt
-rw-rw-r--  1 e4s spackecp  2946 Mar 28 08:35 pyproject.toml
-rw-rw-r--  1 e4s spackecp   764 Mar 28 08:35 pytest.ini
-rw-rw-r--  1 e4s spackecp  6522 Mar 28 08:35 README.md
-rw-rw-r--  1 e4s spackecp   699 Mar 28 08:35 SECURITY.md
drwxrwsr-x+ 3 e4s spackecp   512 Mar 28 08:35 share
drwxrwsr-x+ 3 e4s spackecp   512 Mar 28 08:35 var

Changing Production stack within a release

To change the production path you will need to change the default symbolic link to the latest run. First navigate to the directory where you have the production installation. For example, lets change to the root of e4s-22.11 and remove the symbolic link

cd  /global/common/software/spackecp/perlmutter/e4s-22.11/
unlink default

Next create a symbolic link to the new directory

ln -s <DIRECTORY_ID> default

Troubleshooting GitLab Runner

Once you are logged in, you can login to the desired system to restart the runner. You can check the runner status by navigating to Settings > CI/CD > Runners. If the GitLab runner is down you will need to restart the runner. To check the status of the runner you can do the following, if you see the following message this means the runner is active and running.

● perlmutter-e4s.service - Gitlab runner for e4s runner on perlmutter
  Loaded: loaded (/global/homes/e/e4s/.config/systemd/user/perlmutter-e4s.service; enabled; vendor preset: disabled)
  Active: active (running) since Mon 2023-06-05 10:36:39 PDT; 23h ago
Main PID: 140477 (gitlab-runner)
   Tasks: 47 (limit: 39321)
  Memory: 11.9G
     CPU: 1d 5h 43min 43.685s
  CGroup: /user.slice/user-93315.slice/user@93315.service/app.slice/perlmutter-e4s.service
          └─ 140477 /global/homes/e/e4s/jacamar/gitlab-runner run -c /global/homes/e/e4s/.gitlab-runner/perlmutter.config.toml

If the runner is not active you can restart this by running

systemctl --user restart perlmutter-e4s

The systemd service files are used for managing the gitlab runners. These files are the following

(spack-pyenv) e4s:login22> ls -l ~/.config/systemd/user/*.service
-rw-rw-r-- 1 e4s e4s 326 May  9 07:32 /global/homes/e/e4s/.config/systemd/user/muller-e4s.service
-rw-rw-r-- 1 e4s e4s 334 May  9 07:30 /global/homes/e/e4s/.config/systemd/user/perlmutter-e4s.service

The gitlab-runner command should be accessible via the e4s user. To register a runner you can run gitlab-runner register and follow the prompt. The runner configuration will be written to ~/.gitlab-runner/config.toml. However we recommend you create a separate config.toml or copy the file to separate location. For instance if you want to register a runner for muller you can set gitlab-runner register -c ~/.gitlab-runner/muller.config.toml when registering the runner and it will write the runner configuration to ~/.gitlab-runner/muller.config.toml. For more details regarding runner registration please see https://docs.gitlab.com/runner/register/.

Sometimes you may see unexpected results during CI jobs if you made changes to the GitLab configuration and you have multiple GitLab-runner processes running on different nodes. Therefore, we recommend you use pdsh to search for all process across all nodes to find the process and then terminate it. The command below will search for the gitlab-runner process for service perlmutter-e4s across all Perlmutter login nodes.

pdsh -w login[01-40] systemctl --user status perlmutter-e4s 2>&1 < /dev/null

Jacamar

The GitLab runnners are using Jacamar CI, there should be a jacamar.toml file in the following location:

e4s:login27> ls -l ~/.gitlab-runner/jacamar.toml
-rw-rw-r-- 1 e4s e4s 758 Aug 11 08:57 /global/homes/e/e4s/.gitlab-runner/jacamar.toml

Any updates to the Jacamar configuration are applied to runner and there is no need to restart GitLab runner.

The binaries jacamar and jacamar-auth are located in the following location, if we need to upgrade Jacamar we should place them in this location,

e4s:login27> ls -l ~/jacamar/binaries/
total 15684
-rwxr-xr-x 1 e4s e4s 6283264 Jul  7 15:50 jacamar
-rwxr-xr-x 1 e4s e4s 9773296 Jul  7 15:50 jacamar-auth

Test for NERSC System Changes

NERSC uses ReFrame to test system health after maintenance. In order to ensure the earliest possible notification of system changes that will affect E4S builds, a test has been added. This test can be found at https://gitlab.nersc.gov/nersc/consulting/reframe-at-nersc/reframe-nersc-tests.