Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Slurm plugin

The slurm plugin gathers measurements about Slurm jobs.

Requirements

  • A node with Slurm installed and running.
  • The Slurm plugin relies on cgroups for its operation. Knowing that, your slurm cluster should have the cgroups enabled. Here is the official documentation about how to setup this.

Metrics

Here are the metrics collected by the plugin's sources.

NameTypeUnitDescriptionResourceResourceConsumerAttributes
cpu_time_deltaDeltananosecondstime spent by the pod executing on the CPULocalMachineCgroupsee below
cpu_percentGaugePercent (0 to 100)cpu_time_delta / delta_t (1 core used fully = 100%)LocalMachineCgroupsee below
memory_usageGaugeBytestotal pod's memory usageLocalMachineCgroupsee below
cgroup_memory_anonymousGaugeBytesanonymous memory usageLocalMachineCgroupsee below
cgroup_memory_fileGaugeBytesmemory used to cache filesystem dataLocalMachineCgroupsee below
cgroup_memory_kernel_stackGaugeBytesmemory allocated to kernel stacksLocalMachineCgroupsee below
cgroup_memory_pagetablesGaugeBytesmemory reserved for the page tablesLocalMachineCgroupsee below

Attributes

The measurements produced by the slurm plugin have the following attributes:

  • job_id: id of the Slurm job, for example 10707.
  • job_step: id of the Slurm job, for example 2 (the full job id with its step is 10707.2 and the job_step attribute contains only the step number 2).

The cpu measurements have an additional attribute kind, which can be one of:

  • total: time spent in kernel and user mode
  • system: time spent in kernel mode only
  • user: time spent in user mode only

Annotation of the Measurements Provided by Other Plugins

Other plugins, such as the process-to-cgroup-bridge, can produce measurements related to the cgroups of Slurm jobs. However, they cannot add job-specific information (such as the job id) to the measurements.

To do that, use the annotation feature of the slurm plugin by enabling the following configuration option.

annotate_foreign_measurements = true

Be sure to enable the slurm plugin after the plugins that produce the measurements that you want to annotate. For instance, the slurm configuration section should be after the process-to-cgroup-bridge section.

[plugins.process-to-cgroup-bridge]
…

[plugins.slurm]
…

Configuration

Here is an example of how to configure this plugin. Put the following in the configuration file of the Alumet agent (usually alumet-config.toml).

[plugins.slurm]
# Interval between two measurements
poll_interval = "1s"
# Interval between two scans of the cgroup v1 hierarchies.
# Only applies to cgroup v1 hierarchies (cgroupv2 supports inotify).
cgroupv1_refresh_interval = "30s"
# Only monitor the job cgroup related metrics and skip the others
jobs_only = true
# If true, the slurm sources will be started in pause state (only for advanced setup with a control plugin enabled)
add_source_in_pause_state = false