AMD GPU plugin
Allows to measure AMD GPU hardware metrics with the ROCm software and amdsmi library.
The new plugin-amdgpu currently allows you to detect AMD architecture-based GPUs installed on a machine, and collect the following metrics on each of them.
Requirements
- Linux operating system.
- AMD GPU(s).
- Installation of
amd-smi-libpackage. - Set and configure some permissions on system to run properly all metrics: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/prerequisites.html#configuring-permissions-for-gpu-access.
Metrics
Here are the metrics collected by the plugin source:
| Name | Type | Unit | Description | Resource | ResourceConsumer | Attributes |
|---|---|---|---|---|---|---|
amd_gpu_activity_usage | Gauge | percentage | GPU activity usage | GPU | LocalMachine | activity_type |
amd_gpu_energy_consumption | CounterDiff | milli-joule | Average between 2 measurement points based on the energy consumed since the last start-up | GPU | LocalMachine | |
amd_gpu_memory_usage | Gauge | megabyte | Video compute memory (VRAM) and graphics table translation memory (GTT) usage | GPU | LocalMachine | memory_type |
amd_gpu_power_consumption | Gauge | watt | Estimated average electricity consumption | GPU | LocalMachine | |
amd_gpu_temperature | Gauge | celsius | Values from AMD GPUs equipped with different sensors to precisely locate temperature by zone | GPU | LocalMachine | thermal_zone |
amd_gpu_voltage | Gauge | millivolt | Electric power consumption by a AMD GPU | GPU | LocalMachine | |
amd_gpu_process_compute_unit_occupancy | Gauge | percentage | Compute units used | process | pid | process_name |
amd_gpu_process_memory_usage | Gauge | byte | Process memory usage | process | pid | process_name |
amd_gpu_process_engine_usage_encode | Gauge | nanosecond | Process GFX engine usage | process | pid | process_name |
amd_gpu_process_engine_gfx | Gauge | nanosecond | Process encode engine usage | process | pid | process_name |
amd_gpu_process_memory_usage_cpu | Gauge | byte | Process CPU memory usage | process | pid | process_name |
amd_gpu_process_memory_usage_gtt | Gauge | byte | Process GTT memory usage | process | pid | process_name |
amd_gpu_process_memory_usage_vram | Gauge | byte | Process VRAM memory usage | process | pid | process_name |
Attributes
activity_type
The activity type defines the type of unit or component use by an AMD GPU :
| Value | Description |
|---|---|
graphic_core | Main graphic core of AMD GPU |
memory_management | Manage memory access and addresses translation |
unified_memory_controller | Memory controller managing access to VRAM in organising writing/reading operations |
memory_type
The memory type defines the type of consumed memory by an AMD GPU :
| Value | Description |
|---|---|
memory_graphic_translation_table | Buffer memory for system management used as interface between GPU and system memory |
memory_video_computing | GPU dedicated and integrated memory video to store graphics data for rendering |
thermal_zone
The architecture of AMD GPUs is broken down into several type zones associated with a thermal sensor, to analyse precisely the GPU hardware temperature :
| Value | Description |
|---|---|
thermal_global | The global temperature measured on a AMD GPU hardware |
thermal_hotspot | Value measured by a probe able to locating the maximal temperature on a AMD GPU hardware |
thermal_high_bandwidth_memory_X | Temperature measured on GPU equipped with High Bandwidth Memory, designed to deliver high data transfer while minimizing power consumption in same time. Each "X" index (0 to 3) corresponding to a specific HBM stack |
thermal_pci_bus | Temperature concerning only the data BUS PCI corresponding to the interface between GPU and others components |
process_name
| Value | Description |
|---|---|
process_name | ASCII table which defines the name in process parameters, converted in common UTF-8 encoded string |
Configuration
Here is a configuration example of the plugin. It's part of the ALUMET configuration file (eg: alumet-config.toml).
[plugins.amd-gpu]
# Time between each activation of the counter source.
poll_interval = "1s"
# Initial interval between two flushing of AMD GPU measurements.
flush_interval = "5s"
# On startup, the plugin inspects the GPU devices and detect their features.
# If `skip_failed_devices = true`, inspection failures will be logged and the plugin will continue.
# If `skip_failed_devices = false`, the first failure will make the plugin's startup fail.
skip_failed_devices = true
More information
Due to the not truly thread safely behavior of the current amdsmi library, all GPUs are collected and polled by the same source.
Not all software use the GPU to its full extent.
For instance, to obtain non-zero values for the video encoding/decoding metrics, use a video software like ffmpeg.