Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

AMD GPU plugin

Allows to measure AMD GPU hardware metrics with the ROCm software and amdsmi library. The new plugin-amdgpu currently allows you to detect AMD architecture-based GPUs installed on a machine, and collect the following metrics on each of them.

Requirements

  • Linux operating system.
  • AMD GPU(s).
  • Installation of amd-smi-lib package.
  • Set and configure some permissions on system to run properly all metrics: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/prerequisites.html#configuring-permissions-for-gpu-access.

Metrics

Here are the metrics collected by the plugin source:

NameTypeUnitDescriptionResourceResourceConsumerAttributes
amd_gpu_activity_usageGaugepercentageGPU activity usageGPULocalMachineactivity_type
amd_gpu_energy_consumptionCounterDiffmilli-jouleAverage between 2 measurement points based on the energy consumed since the last start-upGPULocalMachine
amd_gpu_memory_usageGaugemegabyteVideo compute memory (VRAM) and graphics table translation memory (GTT) usageGPULocalMachinememory_type
amd_gpu_power_consumptionGaugewattEstimated average electricity consumptionGPULocalMachine
amd_gpu_temperatureGaugecelsiusValues ​​from AMD GPUs equipped with different sensors to precisely locate temperature by zoneGPULocalMachinethermal_zone
amd_gpu_voltageGaugemillivoltElectric power consumption by a AMD GPUGPULocalMachine
amd_gpu_process_compute_unit_occupancyGaugepercentageCompute units usedprocesspidprocess_name
amd_gpu_process_memory_usageGaugebyteProcess memory usageprocesspidprocess_name
amd_gpu_process_engine_usage_encodeGaugenanosecondProcess GFX engine usageprocesspidprocess_name
amd_gpu_process_engine_gfxGaugenanosecondProcess encode engine usageprocesspidprocess_name
amd_gpu_process_memory_usage_cpuGaugebyteProcess CPU memory usageprocesspidprocess_name
amd_gpu_process_memory_usage_gttGaugebyteProcess GTT memory usageprocesspidprocess_name
amd_gpu_process_memory_usage_vramGaugebyteProcess VRAM memory usageprocesspidprocess_name

Attributes

activity_type

The activity type defines the type of unit or component use by an AMD GPU :

ValueDescription
graphic_coreMain graphic core of AMD GPU
memory_managementManage memory access and addresses translation
unified_memory_controllerMemory controller managing access to VRAM in organising writing/reading operations

memory_type

The memory type defines the type of consumed memory by an AMD GPU :

ValueDescription
memory_graphic_translation_tableBuffer memory for system management used as interface between GPU and system memory
memory_video_computingGPU dedicated and integrated memory video to store graphics data for rendering

thermal_zone

The architecture of AMD GPUs is broken down into several type zones associated with a thermal sensor, to analyse precisely the GPU hardware temperature :

ValueDescription
thermal_globalThe global temperature measured on a AMD GPU hardware
thermal_hotspotValue measured by a probe able to locating the maximal temperature on a AMD GPU hardware
thermal_high_bandwidth_memory_XTemperature measured on GPU equipped with High Bandwidth Memory, designed to deliver high data transfer while minimizing power consumption in same time. Each "X" index (0 to 3) corresponding to a specific HBM stack
thermal_pci_busTemperature concerning only the data BUS PCI corresponding to the interface between GPU and others components

process_name

ValueDescription
process_nameASCII table which defines the name in process parameters, converted in common UTF-8 encoded string

Configuration

Here is a configuration example of the plugin. It's part of the ALUMET configuration file (eg: alumet-config.toml).

[plugins.amd-gpu]
# Time between each activation of the counter source.
poll_interval = "1s"
# Initial interval between two flushing of AMD GPU measurements.
flush_interval = "5s"
# On startup, the plugin inspects the GPU devices and detect their features.
# If `skip_failed_devices = true`, inspection failures will be logged and the plugin will continue.
# If `skip_failed_devices = false`, the first failure will make the plugin's startup fail.
skip_failed_devices = true

More information

Due to the not truly thread safely behavior of the current amdsmi library, all GPUs are collected and polled by the same source.

Not all software use the GPU to its full extent. For instance, to obtain non-zero values for the video encoding/decoding metrics, use a video software like ffmpeg.