Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 12 additions & 14 deletions docs/PLUGIN_DOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
| OsPlugin | sh -c '( lsb_release -ds &#124;&#124; (cat /etc/*release &#124; grep PRETTY_NAME) &#124;&#124; uname -om ) 2>/dev/null &#124; head -n1'<br>cat /etc/*release &#124; grep VERSION_ID<br>wmic os get Version /value<br>wmic os get Caption /Value | **Analyzer Args:**<br>- `exp_os`: Union[str, list] — Expected OS name/version string(s) to match (e.g. from lsb_release or /etc/os-release).<br>- `exact_match`: bool — If True, require exact match for exp_os; otherwise substring match. | - | [OsDataModel](#OsDataModel-Model) | [OsCollector](#Collector-Class-OsCollector) | [OsAnalyzer](#Data-Analyzer-Class-OsAnalyzer) |
| PackagePlugin | dnf list --installed<br>dpkg-query -W<br>pacman -Q<br>cat /etc/*release<br>wmic product get name,version | **Analyzer Args:**<br>- `exp_package_ver`: Dict[str, Optional[str]] — Map package name -> expected version (None = any version). Checked against installed packages.<br>- `regex_match`: bool — If True, match package versions with regex; otherwise exact or prefix match.<br>- `rocm_regex`: Optional[str] — Optional regex to identify ROCm package version (used when enable_rocm_regex is True).<br>- `enable_rocm_regex`: bool — If True, use rocm_regex (or default pattern) to extract ROCm version for checks. | - | [PackageDataModel](#PackageDataModel-Model) | [PackageCollector](#Collector-Class-PackageCollector) | [PackageAnalyzer](#Data-Analyzer-Class-PackageAnalyzer) |
| PciePlugin | lspci -d {vendor_id}: -nn<br>lspci -x<br>lspci -xxxx<br>lspci -PP<br>lspci -PP -d {vendor_id}:{dev_id}<br>lspci -vvv<br>lspci -vvvt | **Analyzer Args:**<br>- `exp_speed`: int — Expected PCIe link speed (generation 1–5).<br>- `exp_width`: int — Expected PCIe link width in lanes (1–16).<br>- `exp_sriov_count`: int — Expected SR-IOV virtual function count.<br>- `exp_gpu_count_override`: Optional[int] — Override expected GPU count for validation.<br>- `exp_max_payload_size`: Union[Dict[int, int], int, NoneType] — Expected max payload size: int for all devices, or dict keyed by device ID.<br>- `exp_max_rd_req_size`: Union[Dict[int, int], int, NoneType] — Expected max read request size: int for all devices, or dict keyed by device ID.<br>- `exp_ten_bit_tag_req_en`: Union[Dict[int, int], int, NoneType] — Expected 10-bit tag request enable: int for all devices, or dict keyed by device ID. | - | [PcieDataModel](#PcieDataModel-Model) | [PcieCollector](#Collector-Class-PcieCollector) | [PcieAnalyzer](#Data-Analyzer-Class-PcieAnalyzer) |
| ProcessPlugin | top -b -n 1<br>rocm-smi --showpids<br>top -b -n 1 -o %CPU | **Analyzer Args:**<br>- `max_kfd_processes`: int — Maximum allowed number of KFD (Kernel Fusion Driver) processes; 0 disables the check.<br>- `max_cpu_usage`: float — Maximum allowed CPU usage (percent) for process checks. | **Collection Args:**<br>- `top_n_process`: int — Number of top processes by CPU usage to collect (e.g. for top -b -n 1 -o %CPU). | [ProcessDataModel](#ProcessDataModel-Model) | [ProcessCollector](#Collector-Class-ProcessCollector) | [ProcessAnalyzer](#Data-Analyzer-Class-ProcessAnalyzer) |
| ProcessPlugin | cat /proc/stat<br>shell loop over /proc/*/stat (with ``__SAMPLER__`` marker)<br>batched ``cat /proc/<pid>/comm`` | **Analyzer Args:**<br>- `max_cpu_usage`: float — Maximum allowed aggregate CPU usage (percent). | **Collection Args:**<br>- `top_n_process`: int — Max process rows ranked by CPU share over the sample window.<br>- `sample_interval_seconds`: float — Wall seconds between two /proc samples (default 1.0). | [ProcessDataModel](#ProcessDataModel-Model) | [ProcessCollector](#Collector-Class-ProcessCollector) | [ProcessAnalyzer](#Data-Analyzer-Class-ProcessAnalyzer) |
| RdmaPlugin | rdma link -j<br>rdma dev<br>rdma link<br>rdma statistic -j | - | - | [RdmaDataModel](#RdmaDataModel-Model) | [RdmaCollector](#Collector-Class-RdmaCollector) | [RdmaAnalyzer](#Data-Analyzer-Class-RdmaAnalyzer) |
| RocmPlugin | {rocm_path}/opencl/bin/*/clinfo<br>env &#124; grep -Ei 'rocm&#124;hsa&#124;hip&#124;mpi&#124;openmp&#124;ucx&#124;miopen'<br>ls /sys/class/kfd/kfd/proc/<br>grep -i -E 'rocm' /etc/ld.so.conf.d/*<br>{rocm_path}/bin/rocminfo<br>ls -v -d {rocm_path}*<br>ls -v -d {rocm_path}-[3-7]* &#124; tail -1<br>ldconfig -p &#124; grep -i -E 'rocm'<br>grep . -r {rocm_path}/.info/* | **Analyzer Args:**<br>- `exp_rocm`: Union[str, list] — Expected ROCm version string(s) to match (e.g. from rocminfo).<br>- `exp_rocm_latest`: str — Expected 'latest' ROCm path or version string for versioned installs.<br>- `exp_rocm_sub_versions`: dict[str, Union[str, list]] — Map sub-version name (e.g. version_rocm) to expected string or list of allowed strings. | **Collection Args:**<br>- `rocm_path`: str — Base path to ROCm installation (e.g. /opt/rocm). Used for rocminfo, clinfo, and version discovery. | [RocmDataModel](#RocmDataModel-Model) | [RocmCollector](#Collector-Class-RocmCollector) | [RocmAnalyzer](#Data-Analyzer-Class-RocmAnalyzer) |
| StoragePlugin | sh -c 'df -lH -B1 &#124; grep -v 'boot''<br>wmic LogicalDisk Where DriveType="3" Get DeviceId,Size,FreeSpace | - | **Collection Args:**<br>- `skip_sudo`: bool — If True, do not use sudo when running df and related storage commands. | [StorageDataModel](#StorageDataModel-Model) | [StorageCollector](#Collector-Class-StorageCollector) | [StorageAnalyzer](#Data-Analyzer-Class-StorageAnalyzer) |
Expand Down Expand Up @@ -727,7 +727,7 @@ PcieDataModel

### Description

Collect Process details
Collect aggregate CPU usage and top processes from Linux ``/proc`` (two samples of ``/proc/stat`` and ``/proc/<pid>/stat``; no ``top`` or ROCm SMI).

**Bases**: ['InBandDataCollector']

Expand All @@ -736,19 +736,19 @@ Collect Process details
### Class Variables

- **SUPPORTED_OS_FAMILY**: `{<OSFamily.LINUX: 3>}`
- **CMD_KFD**: `rocm-smi --showpids`
- **CMD_CPU_USAGE**: `top -b -n 1`
- **CMD_PROCESS**: `top -b -n 1 -o %CPU `
- **CMD_PROC_STAT**: read aggregate CPU counters from ``/proc/stat``
- **CMD_PROC_PID_STAT_DUMP**: shell loop dumping ``/proc/<pid>/stat`` with ``__SAMPLER__`` marker
- **CMD_PROC_COMM_BATCH**: batched ``comm`` reads; format with ``{pids}`` (space-separated PID list)

### Provides Data

ProcessDataModel

### Commands

- top -b -n 1
- rocm-smi --showpids
- top -b -n 1 -o %CPU
- ``CMD_PROC_STAT`` (`cat /proc/stat`)
- ``CMD_PROC_PID_STAT_DUMP``
- ``CMD_PROC_COMM_BATCH.format(pids=...)``

## Collector Class RdmaCollector

Expand Down Expand Up @@ -1250,9 +1250,8 @@ class for collection of PCIe data.

### Model annotations and fields

- **kfd_process**: `Optional[int]`
- **cpu_usage**: `Optional[float]`
- **processes**: `Optional[list[tuple[str, str]]]`
- **cpu_usage**: `Optional[float]` — Aggregate non-idle CPU percent over the sample window.
- **processes**: `Optional[list[tuple[str, str]]]` — Up to ``top_n_process`` rows: ``(comm, cpu_share_percent_str)``.

## RdmaDataModel Model

Expand Down Expand Up @@ -1650,7 +1649,7 @@ Check PCIe Data for errors

### Description

Check cpu and kfd processes are within allowed maximum cpu and gpu usage
Check aggregate ``cpu_usage`` against ``max_cpu_usage`` (see [ProcessDataModel](#ProcessDataModel-Model)).

**Bases**: ['DataAnalyzer']

Expand Down Expand Up @@ -2004,8 +2003,7 @@ Arguments for PCIe analyzer

### Annotations / fields

- **max_kfd_processes**: `int` — Maximum allowed number of KFD (Kernel Fusion Driver) processes; 0 disables the check.
- **max_cpu_usage**: `float` — Maximum allowed CPU usage (percent) for process checks.
- **max_cpu_usage**: `float` — Maximum allowed aggregate CPU usage (percent) for process checks.

## Analyzer Args Class RocmAnalyzerArgs

Expand Down
19 changes: 5 additions & 14 deletions nodescraper/plugins/inband/process/analyzer_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,23 +31,14 @@


class ProcessAnalyzerArgs(AnalyzerArgs):
max_kfd_processes: int = Field(
default=0,
description="Maximum allowed number of KFD (Kernel Fusion Driver) processes; 0 disables the check.",
)
max_cpu_usage: float = Field(
default=20.0,
description="Maximum allowed CPU usage (percent) for process checks.",
description="Maximum allowed aggregate CPU usage (percent) for process checks.",
)

@classmethod
def build_from_model(cls, datamodel: ProcessDataModel) -> "ProcessAnalyzerArgs":
"""build analyzer args from data model

Args:
datamodel (ProcessDataModel): data model for plugin

Returns:
ProcessAnalyzerArgs: instance of analyzer args class
"""
return cls(max_kfd_processes=datamodel.kfd_process, max_cpu_usage=datamodel.cpu_usage)
"""Build analyzer args from collected process data (threshold defaults if cpu_usage unset)."""
if datamodel.cpu_usage is not None:
return cls(max_cpu_usage=float(datamodel.cpu_usage))
return cls()
6 changes: 5 additions & 1 deletion nodescraper/plugins/inband/process/collector_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,5 +31,9 @@
class ProcessCollectorArgs(CollectorArgs):
top_n_process: int = Field(
default=10,
description="Number of top processes by CPU usage to collect (e.g. for top -b -n 1 -o %CPU).",
description="Max process rows to return, ranked by CPU share over the sample window (from /proc).",
)
sample_interval_seconds: float = Field(
default=1.0,
description="Wall time between two /proc samples for CPU utilization (must be > 0; invalid values use 1.0).",
)
26 changes: 4 additions & 22 deletions nodescraper/plugins/inband/process/process_analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,43 +34,27 @@


class ProcessAnalyzer(DataAnalyzer[ProcessDataModel, ProcessAnalyzerArgs]):
"""Check cpu and kfd processes are within allowed maximum cpu and gpu usage"""
"""Check aggregate ``cpu_usage`` against ``max_cpu_usage``."""

DATA_MODEL = ProcessDataModel

def analyze_data(
self, data: ProcessDataModel, args: Optional[ProcessAnalyzerArgs] = None
) -> TaskResult:
"""
Analyze the process data to check if the number of KFD processes and CPU usage
are within the allowed limits.
Analyze process data: compare aggregate CPU usage to the configured limit.

Args:
data (ProcessDataModel): The process data to analyze.
args (Optional[ProcessAnalyzerArgs], optional): The process analysis arguments. Defaults to None.
args (Optional[ProcessAnalyzerArgs], optional): Analysis arguments. Defaults to None.

Returns:
TaskResult: The result of the analysis, containing any events logged during the process.
TaskResult: The result of the analysis, including any logged events.
"""
if not args:
args = ProcessAnalyzerArgs()

has_errors = False
if data.kfd_process is not None and data.kfd_process > args.max_kfd_processes:
has_errors = True
self._log_event(
category=EventCategory.OS,
description=f"Kfd processes {data.kfd_process} exeed max limit {args.max_kfd_processes}",
data={
"kfd_process": data.kfd_process,
"kfd_process_limit": args.max_kfd_processes,
},
priority=EventPriority.CRITICAL,
console_log=True,
)

if data.cpu_usage is not None and data.cpu_usage > args.max_cpu_usage:
has_errors = True
self._log_event(
category=EventCategory.OS,
description=f"CPU usage {data.cpu_usage} exceeds limit {args.max_cpu_usage}",
Expand All @@ -81,8 +65,6 @@ def analyze_data(
priority=EventPriority.CRITICAL,
console_log=True,
)

if has_errors:
self.result.status = ExecutionStatus.ERROR
self.result.message = "Process limits exceeded"

Expand Down
Loading
Loading