Skip to content

collector: add ebsnvme collector for Amazon EBS NVMe performance statsFeat/ebsnvme collector#3695

Open
AllenXieSZ wants to merge 3 commits into
prometheus:masterfrom
AllenXieSZ:feat/ebsnvme-collector
Open

collector: add ebsnvme collector for Amazon EBS NVMe performance statsFeat/ebsnvme collector#3695
AllenXieSZ wants to merge 3 commits into
prometheus:masterfrom
AllenXieSZ:feat/ebsnvme-collector

Conversation

@AllenXieSZ

Copy link
Copy Markdown

This adds a new disabled-by-default Linux collector, ebsnvme, that exposes the Amazon EBS detailed performance statistics that Nitro-based EC2 instances vend through the EBS NVMe device log page (log page 0xD0).

What it does

For every EBS-backed NVMe device, the collector:

Maps NVMe devices to EBS volume IDs and mount paths via lsblk -nd --json -o NAME,SERIAL,MOUNTPOINT.
Reads EBS statistics log page 0xD0 via an NVMe admin ioctl (NVME_IOCTL_ADMIN_CMD), opening the device read-only.
Parses the binary EBS statistics structure (validated by its magic number 0x3C23B510).
Exposes the values as Prometheus metrics labelled by volume_id, device, and mount_path.
Metrics (namespace node_ebs_*)
node_ebs_read_ops_total, node_ebs_write_ops_total
node_ebs_read_bytes_total, node_ebs_write_bytes_total
node_ebs_read_seconds_total, node_ebs_write_seconds_total
node_ebs_exceeded_iops_seconds_total, node_ebs_exceeded_tp_seconds_total
node_ebs_ec2_exceeded_iops_seconds_total, node_ebs_ec2_exceeded_tp_seconds_total
node_ebs_volume_queue_length
node_ebs_read_io_latency_seconds, node_ebs_write_io_latency_seconds (histograms)
Each metric Help string references the corresponding official EBS statistic name (e.g. total_read_ops, ebs_volume_performance_exceeded_iops, read_io_latency_histogram).

Why disabled by default

It is Linux + Nitro-EC2 + EBS-NVMe specific and issues an NVMe admin ioctl per device, so it is meaningless elsewhere. Enable with --collector.ebsnvme.

Attribution / licensing

The EBS log-page parsing logic is derived from the Amazon EBS CSI Driver (pkg/metrics/nvme.go, Apache-2.0, Copyright The Kubernetes Authors). This is noted in the file header alongside the standard Prometheus Apache-2.0 header.

Tested on real AWS EBS + EC2

Validated live on a Nitro-based EC2 instance (us-east-2) with 6 attached EBS NVMe volumes running a MySQL workload (data / redo / binlog / undo / relay_log on separate EBS volumes):

gofmt clean; go build OK; unit tests pass (TestParseEBSLogPageInvalidMagic, TestParseEBSLogPageValid, TestConvertEBSHistogram).
Scraped by Prometheus for 30+ minutes: target up 100%, scrape duration steady 24–47 ms, all node_ebs_* series present (6 per metric, one per volume).
Grafana panels (read/write latency + P99, throughput, I/O size, IOPS, EBS/EC2 exceeded counters, queue length) render correctly per-volume, with the mount_path label distinguishing each MySQL data path. Screenshots attached below.
An example Grafana dashboard built entirely on these node_ebs_* metrics (14 panels, Prometheus datasource templated) is attached.

Notes

Commit is DCO signed-off.
mount_path is NotMounted for a device with no direct mount point (e.g. a disk mounted only through one of its partitions).

sample2 sample3 sample1

SuperQ and others added 3 commits April 7, 2026 17:44
* Fix kernel_hung for no data (prometheus#3613)

Return an ErrNoData for the kernel_hung collector if the file does
not exist.

Fixes: prometheus#3612

Signed-off-by: Ben Kochie <superq@gmail.com>

* Release v1.11.1 (prometheus#3615)

* [BUGFIX] Fix kernel_hung for no data prometheus#3613

Signed-off-by: Ben Kochie <superq@gmail.com>

---------

Signed-off-by: Ben Kochie <superq@gmail.com>
Add a new (disabled-by-default) Linux collector, ebsnvme, that exposes the
Amazon EBS detailed performance statistics vended by Nitro-based EC2
instances through the EBS NVMe device log page (log page 0xD0).

The collector reads the EBS statistics log page from each EBS-backed NVMe
device via an NVMe admin ioctl, parses the binary structure, and exposes
read/write ops, bytes, time, IOPS/throughput exceeded counters, queue
length, and read/write latency histograms. Metrics are labelled by
volume_id, device, and mount_path.

The log-page parsing logic is derived from the Amazon EBS CSI Driver
(pkg/metrics/nvme.go), Apache-2.0, Copyright The Kubernetes Authors.

Statistic names and semantics follow the Amazon EBS User Guide:
https://docs.aws.amazon.com/ebs/latest/userguide/nvme-detailed-performance-stats.html

Signed-off-by: Allen Xie <weifeng.xie@qq.com>
Opening the NVMe character device and issuing the admin passthru ioctl
requires CAP_SYS_ADMIN (in practice, root). Note this in the package doc
comment and the README so users running node_exporter as an unprivileged
user understand why no node_ebs_* metrics appear.

Signed-off-by: Allen Xie <weifeng.xie@qq.com>

@SuperQ SuperQ left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is an appropriate feature for the node_exporter. I think it is a bit too vendor specific.

What do you think @discordianfish

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants