mirror of
https://github.com/MacRimi/ProxMenux.git
synced 2026-06-03 13:54:41 +00:00
9677c5cb19
When a host gets transient I/O events on a disk while smartctl is
momentarily unavailable (the canonical case: late in a noisy
shutdown), the disk-scan code records a `disk_<name>` WARNING tagged
"SMART: unavailable" exactly once and trusts the next scan to clear
it. That trust is misplaced: the clear path only fires when the
device shows up in the current dmesg window with zero events. After
a reboot, dmesg is empty for that device — so the device never gets
iterated, resolve_error is never called, and the dashboard stays
orange for a disk whose SMART now reports PASSED.
Caught on a lab host where `disk_nvme2n1` had been stuck as WARNING
for hours after a reboot. SMART was 100% healthy at the moment of
inspection (Critical Warning 0x00, 0 media errors, 100% spare). The
error's first_seen and last_seen were identical and pre-dated the
current boot, confirming a one-shot record that nothing had cleared.
Fix: add a `_reconcile_stale_disk_warnings()` pass at the top of
`_check_disks_optimized()`. For every active `disk_*` error
(skipping `disk_fs_*`, which is already reconciled separately):
- device gone from /dev/ → resolve "Device no longer present"
- device present + SMART PASSED → resolve "Transient I/O cleared,
SMART now reports healthy"
- device present + SMART UNKNOWN/FAILED → leave active so the
main loop can re-classify on the next dmesg window
Acknowledged errors are left alone so the user's explicit dismiss
intent isn't overridden.
Verified end-to-end: re-injected the original `disk_nvme2n1`
warning into the persistence DB on the lab host, waited one scan
cycle, error was resolved automatically with `resolved_at` set and
`resolution_reason = 'Transient I/O cleared, SMART now reports
healthy'`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>