Administration Guide
This page holds documentation for the processes that support the development, deployment and continuous integration activities of Spack infrastructure at NERSC.
Troubleshooting GitLab Runner
You will need to login as e4s
user via collabsu
or usgrsu
command.
This will prompt you for a password which is your NERSC password for your
username not e4s user. Note that collabsu
is not present on Perlmutter
so you must use usgrsu
to switch user accounts.
collabsu e4s
Once you are logged in, you can login to the desired system to restart the
runner. You can check the runner status by navigating to
Settings > CI/CD > Runners.
If the GitLab runner is down you will need to restart the runner which is
located in $HOME/cron
directory for the e4s user.
The gitlab-runner
command should be accessible via the e4s user. To register
a runner you can run gitlab-runner register
and follow the prompt. The runner
configuration will be written to ~/.gitlab-runner/config.toml
. However we
recommend you create a separate config.toml
or copy the file to separate
location. For instance if you want to register a runner for muller you can set
gitlab-runner register -c ~/.gitlab-runner/muller.config.toml
when registering
the runner and it will write the runner configuration to
~/.gitlab-runner/muller.config.toml
. For more details regarding runner
registration please see https://docs.gitlab.com/runner/register/.
To restart a runner you can run the
restart-gitlab.sh
script which should be present in $HOME/cron
.
bash $HOME/cron/restart-gitlab.sh
You can check if the GitLab process is running via pgrep
assuming you are on
the right node. Shown below is an example output, you should only see one GitLab
runner process running on a node.
e4s:login27> pgrep -a -u e4s gitlab-runner
52769 gitlab-runner run -c /global/homes/e/e4s/.gitlab-runner/perlmutter.config.toml
Sometimes you may see unexpected results during CI jobs if you made changes to
the GitLab configuration and you have multiple GitLab-runner processes running
on different nodes. Therefore, we recommend you use pdsh
to search for all
process across all nodes to find the process and then terminate it. For instance
you can run pgrep
across all Cori login nodes (cori01-12) to find any
GitLab process, if you see multiple process then you can login to that
particular node and terminate the process.
pdsh -w cori[01-12] pgrep -a -u e4s gitlab-runner
Jacamar
The GitLab runnners are using Jacamar CI,
there should be a jacamar.toml
file in the following location:
e4s:login27> ls -l ~/.gitlab-runner/jacamar.toml
-rw-rw-r-- 1 e4s e4s 758 Aug 11 08:57 /global/homes/e/e4s/.gitlab-runner/jacamar.toml
Any updates to the Jacamar configuration are applied to runner and there is no need to restart GitLab runner.
The binaries jacamar
and jacamar-auth
are located in the following
location, if we need to upgrade Jacamar we should place them in this location,
e4s:login27> ls -l ~/jacamar/binaries/
total 15684
-rwxr-xr-x 1 e4s e4s 6283264 Jul 7 15:50 jacamar
-rwxr-xr-x 1 e4s e4s 9773296 Jul 7 15:50 jacamar-auth
Login Access
You can access Cori and Perlmutter, for more details see Connecting to NERSC.
If either system is down you can access data transfer nodes (dtn[01-04].nersc.gov
)
and then access the appropriate system. Please check out the NERSC MOTD at
https://www.nersc.gov/live-status/motd/ for live updates to system.
In order to access TDS systems like muller
or gerty
you will need to
access one of the systems (cori, perlmutter, dtn) and then run the following:
ssh dtn01.nersc.gov
ssh gerty
It is probably a good idea to run collabsu
or usgrsu
once you are in the
correct login node otherwise you may be prompted for a password for the e4s
user.
Slack Webhook
The restart-gitlab.sh
is responsible for restarting gitlab and sending slack notification to NERSC
Slack at #spack-infrastructure. This action uses Curl and requires a Webhook URL which must
be saved as a secret. The secret is called SLACK_WEBHOOK
which can be updated
at https://software.nersc.gov/NERSC/spack-infrastructure/-/settings/ci_cd. The relevant command
is:
curl -X POST --data-urlencode "payload={\"channel\": \"#spack-infrastructure\", \"username\":
\"webhookbot\", \"text\": \"Restarting E4S runner on ${NERSC_HOST} at node: ${desired_host}. \",
\"icon_emoji\": \":ghost:\"}" $SLACK_WEBHOOK
Test for NERSC System Changes
NERSC uses ReFrame to test system health after maintenance. In order to ensure the earliest possible notification of system changes that will affect E4S builds, a test has been added. This test can be found at https://gitlab.nersc.gov/nersc/consulting/reframe-at-nersc/reframe-nersc-tests.