`get_disks_observation_counts` maps each serial's count to that
serial's "most recent" device_name (so renames like ata8 -> sdh keep
the badge attached). When several physical disks have passed through
the same kernel name across reboots — common with NVMe, the kernel
probes in a different order depending on which slots are populated —
disk_registry keeps a row per (device_name, serial) seen and the
"most recent" device_name for a serial can now be in use by an
entirely different disk.
Concrete case from the wild: serial 211716800490 was nvme0n1 during
the previous boot and earned a real I/O observation. After removing
four of five NVMes, the surviving disk (serial 243332800236) booted
into nvme0n1. The badge layer mirrored 211716800490's count onto
nvme0n1 — which is now a different physical disk — and showed
"1 obs." on the wrong drive, while the modal (which scopes by the
current (device_name, serial) registry row) found nothing and
rendered an empty history.
Only mirror a serial's count onto its device_name when that
device_name is currently owned by the same serial, determined from
the freshest disk_registry row. The serial-keyed entry stays
unconditional so observations remain reachable when the disk is
re-plugged under another device name.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The UPDATE in `_resolve_error_impl` only touched `resolved_at` — the
`reason` argument every caller passes was silently dropped, and the
`resolution_reason` / `resolution_type` columns stayed NULL for every
auto-resolved error. The columns were added back in a previous sprint
for exactly this audit-log purpose, but the writer was never updated
to populate them.
Fix the SQL to write `resolution_reason = ?` and tag
`resolution_type = COALESCE(existing, 'auto')` so admin-cleared
errors (whose type is set elsewhere) keep their value while the
default auto path correctly labels itself.
Verified end-to-end on the lab host: re-injected the `disk_nvme2n1`
warning, waited one scan cycle, the row now reads
`resolution_type='auto'` and
`resolution_reason='Transient I/O cleared, SMART now reports healthy'`
— previously these columns stayed NULL even though the resolve_error
call passed a descriptive reason.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>