Commit Graph

1873 Commits

Author SHA1 Message Date
MacRimi 7a1fe0b0fc storage-overview: drive disk-temp breakdown from configurable thresholds
`getDiskHealthBreakdown` carried its own hardcoded ladder (HDD ≤45
normal, ≤55 warning) that was much stricter than the configurable
defaults consumed by `getTempColor` via `useDiskTempThresholds`
(HDD warn 60, hot 65). HDDs at 48 °C therefore rendered a green
"Healthy 48°C" badge on the card but were tallied as "warning" in
the top-of-page "X normal, Y warning, Z critical" summary, leaving
the user with the misleading "6 normal, 5 warning" line.

Use the same threshold map as the per-disk badge so the colour and
the count are always consistent, and so Settings → Health Monitor
Thresholds → Disk temperature actually applies to the breakdown.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 23:13:55 +02:00
MacRimi 3c5beb0286 Persist resolution_reason on resolve_error so the audit log is useful
The UPDATE in `_resolve_error_impl` only touched `resolved_at` — the
`reason` argument every caller passes was silently dropped, and the
`resolution_reason` / `resolution_type` columns stayed NULL for every
auto-resolved error. The columns were added back in a previous sprint
for exactly this audit-log purpose, but the writer was never updated
to populate them.

Fix the SQL to write `resolution_reason = ?` and tag
`resolution_type = COALESCE(existing, 'auto')` so admin-cleared
errors (whose type is set elsewhere) keep their value while the
default auto path correctly labels itself.

Verified end-to-end on the lab host: re-injected the `disk_nvme2n1`
warning, waited one scan cycle, the row now reads
`resolution_type='auto'` and
`resolution_reason='Transient I/O cleared, SMART now reports healthy'`
— previously these columns stayed NULL even though the resolve_error
call passed a descriptive reason.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 23:02:52 +02:00
MacRimi 9677c5cb19 Health Monitor: reconcile stale disk warnings across reboots
When a host gets transient I/O events on a disk while smartctl is
momentarily unavailable (the canonical case: late in a noisy
shutdown), the disk-scan code records a `disk_<name>` WARNING tagged
"SMART: unavailable" exactly once and trusts the next scan to clear
it. That trust is misplaced: the clear path only fires when the
device shows up in the current dmesg window with zero events. After
a reboot, dmesg is empty for that device — so the device never gets
iterated, resolve_error is never called, and the dashboard stays
orange for a disk whose SMART now reports PASSED.

Caught on a lab host where `disk_nvme2n1` had been stuck as WARNING
for hours after a reboot. SMART was 100% healthy at the moment of
inspection (Critical Warning 0x00, 0 media errors, 100% spare). The
error's first_seen and last_seen were identical and pre-dated the
current boot, confirming a one-shot record that nothing had cleared.

Fix: add a `_reconcile_stale_disk_warnings()` pass at the top of
`_check_disks_optimized()`. For every active `disk_*` error
(skipping `disk_fs_*`, which is already reconciled separately):

  - device gone from /dev/   → resolve "Device no longer present"
  - device present + SMART PASSED → resolve "Transient I/O cleared,
    SMART now reports healthy"
  - device present + SMART UNKNOWN/FAILED → leave active so the
    main loop can re-classify on the next dmesg window

Acknowledged errors are left alone so the user's explicit dismiss
intent isn't overridden.

Verified end-to-end: re-injected the original `disk_nvme2n1`
warning into the persistence DB on the lab host, waited one scan
cycle, error was resolved automatically with `resolved_at` set and
`resolution_reason = 'Transient I/O cleared, SMART now reports
healthy'`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:54:14 +02:00
MacRimi d25faedc2b Rebuild AppImage with actual Next.js 15.1.9 + always reconcile node_modules
The previous bump commit (2f24de25) shipped a binary that still carried
Next.js 15.1.6 in the bundled chunks even though AppImage/package.json
was at 15.1.9. Root cause: build_appimage.sh only ran `npm install`
when `node_modules` did not exist; on the .50 build host node_modules
had been cached since the 1.2.1 build cycle, so the bump was silently
ignored and the build re-used the stale tree.

Fix the script: always run `npm install --legacy-peer-deps` on every
build. npm reconciles against the lockfile in under a second when
everything is already in sync, so the change is free on a warm tree
and correct on a stale one.

Rebuild from a clean node_modules on .50, redeploy to all four hosts
(SHA 4602b8d4aa130c6f...), runtime grep confirms the bundle now
contains 15.1.9 with no traces of 15.1.6 left. Same architecture and
threat model as before — Flask serves the static export on :8008,
no Next.js runtime — but the version banner now matches the lockfile.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:37:44 +02:00
MacRimi 2f24de2592 Bump Next.js to 15.1.9 + doc nav handles in-page anchors + help_info_menu
Three changes that fold into the v1.2.2 release PR:

1. AppImage: bump Next.js 15.1.6 -> 15.1.9 (CVE-2025-55182)
   GHSA-9qr9-h5gf-34mp / React2Shell is a pre-auth RCE in React Server
   Components when Server Functions deserialize attacker payloads. The
   ProxMenux Monitor ships Next.js in `output: "export"` mode behind
   Flask on :8008, so there is no runtime Next.js server and no
   "use server" directive in the source tree — the exploitable path is
   not reachable. Bumping to 15.1.9 anyway because OpenVAS and similar
   scanners flag the version string from the JS bundle regardless of
   architecture; raising the floor removes false-positive noise across
   every install. Reported by @rost43 in #219.

2. web/components/ui/doc-navigation.tsx: handle sidebar entries that
   point to in-page anchors. The Storage Share Manager sidebar has
   entries for `/docs/storage-share#host` and
   `/docs/storage-share#lxc-net` as section headers, but
   usePathname() does not include the hash so every visit collapsed
   to the parent page. As a result Next/Previous on /docs/storage-share
   stayed stuck at #host, and Next from .../lxc-mount-points/ pointed
   back at #host instead of #lxc-net. Read window.location.hash on
   mount (and on hashchange) and try the pathname+hash match before
   falling back to the pathname-only lookup. SSR hydrates with an
   empty hash and refreshes once mounted — brief render before
   hydration is the same as the previous behaviour, so no regression.

3. scripts/help_info_menu.sh: user-side improvement (mirrored from
   develop).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:31:12 +02:00
MacRimi 4f3750a8ab Fix CI: add pagefind to web devDeps + portable AppImage cache path
Two regressions surfaced after the 1.2.2 release merge to main, both
in workflows that auto-trigger on push to main:

* Deploy to GitHub Pages — build failed with `pagefind: not found`
  (exit 127) after Next.js prerendered all 241 routes. pagefind was
  not declared in web/package.json; the local build only worked
  because the project root had its own package.json with pagefind
  as a devDep (the one we just gitignored in the previous commit).
  Add `pagefind: ^1.5.2` to web/package.json devDependencies and
  regenerate web/package-lock.json so `npm ci` in CI puts the
  binary at web/node_modules/.bin/pagefind.

* Build ProxMenux Monitor AppImage — failed at the first step with
  `mkdir: cannot create directory '/var/cache/proxmenux-build':
  Permission denied`. The cache path was hardcoded to /var/cache/,
  which is writable when the script runs as root (the .50 host
  manual build) but not as the unprivileged GitHub Actions runner.
  Switch to `${XDG_CACHE_HOME:-$HOME/.cache}/proxmenux-build/` —
  works identically in both environments.

Verified locally: `cd web && npm ci && npm run build` produces 2804
files in out/, 231 pages indexed by pagefind, root redirect intact.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 13:34:30 +02:00
MacRimi 964f2083b6 Merge main into develop to resolve conflicts before 1.2.2 release PR
# Conflicts:
#	AppImage/ProxMenux-Monitor.AppImage.sha256
#	LICENSE
#	install_proxmenux_beta.sh
2026-05-31 13:24:58 +02:00
MacRimi 9e8434a16b Release 1.2.2 stable — consolidated v1.2.1.x cycle
Promote the v1.2.1.x beta cycle to stable: version markers bumped
from 1.2.1.4-beta to 1.2.2 across version.txt, AppImage/package.json,
flask_server.py (3 places) and the four UI labels in login,
proxmox-dashboard, storage-overview and release-notes-modal.

Replace AppImage/ProxMenux-1.2.1.4-beta.AppImage with
ProxMenux-1.2.2.AppImage and regenerate the .sha256 sidecar
(097e2344675d4b21f1dd18c531c956c299a6507fbc3d0c9695418063581ba2b0).
The new binary is verified on all 4 lab hosts (.50 / .55 / .89 /
1.10) — same sha, all services active, runtime version markers
report 1.2.2.

CHANGELOG["1.2.2"] in release-notes-modal.tsx consolidates every beta
in the 1.2.1.x line (12 added / 13 changed / 18 fixed), and
CURRENT_VERSION_FEATURES is rewritten with the four stable highlights:
Health Monitor Thresholds, granular dismiss control (per-event
duration + Active Suppressions panel), Apprise notification channel
parity, and LXC update detection.
2026-05-31 13:15:39 +02:00
MacRimi 853bcbde35 Fix AppRise 2026-05-31 11:03:04 +02:00
MacRimi 2442ca63be Update AppImage 1.2.1.4 2026-05-31 10:36:16 +02:00
MacRimi 91ded0125e Update AppImage 1.2.1.4 2026-05-30 22:14:51 +02:00
MacRimi 4bf49675d2 Update ProxMenux 1.2.1.4-beta 2026-05-30 21:54:32 +02:00
MacRimi fe1297936f Update AppImage 1.2.1.3 2026-05-27 17:55:41 +02:00
MacRimi a3aa5d9c1a Update flask_server.py 2026-05-25 18:01:24 +02:00
MacRimi b299227da2 Update AppImage 1.2.1.3 2026-05-24 17:52:04 +02:00
MacRimi 3286fc315c Update AppImage 1.2.1.3 2026-05-24 16:42:44 +02:00
MacRimi 105576cf17 Update AppImage 2026-05-24 11:37:20 +02:00
MacRimi 4b934db7db Update AppImage 1.2.1.3 2026-05-23 21:27:18 +02:00
MacRimi f2a40b993a Update AppImage 1.2.1.3 2026-05-22 18:47:30 +02:00
MacRimi 840385272c Add ProxMenux beta 1.2.1.3 2026-05-22 18:24:03 +02:00
MacRimi 95d0667077 Update AppImage 1.2.1.2 2026-05-21 22:25:29 +02:00
MacRimi 56fac4c34b Update AppImage 1.2.1.2 2026-05-21 22:00:35 +02:00
MacRimi 2d523b030f Update AppImage 1.2.1.2 2026-05-21 21:41:27 +02:00
MacRimi f5b7a0a74b Update AppImage 1.2.1.2 2026-05-21 21:17:59 +02:00
MacRimi 3e9dd599a6 Update AppImage 1..2.1.2 2026-05-21 19:31:47 +02:00
MacRimi 9545587b67 Update AppImage 1.2.1.2 2026-05-21 17:24:09 +02:00
MacRimi ef22c88861 Update AppImage 1.2.1.2 2026-05-21 17:18:23 +02:00
MacRimi 298cd2c6d4 Update Beta 1.2.1.2 2026-05-20 19:47:42 +02:00
MacRimi 4112323961 Update AppImage 2026-05-20 18:14:32 +02:00
MacRimi 73389d842a Reset auth_fail cooldowns on NotificationManager.start()
Pedro Rico, 19/05: after reinstalling the Monitor from GitHub a real
SSH/web login failure went unnotified. Root cause was the auth_fail
cooldown surviving across the service restart — install_proxmenux_beta
extracts the new AppImage but leaves the notification_last_sent SQLite
table intact (desirable: we don't want to lose legitimate cooldowns
on every update). On startup `_load_cooldowns_from_db()` then loaded
the stale auth_fail row from the previous run into the in-memory
cache, and `_passes_cooldown` blocked the new event.

This extends the existing reset-on-start mechanism (already in place
for update_summary, proxmenux_update, post_install_update, …) to also
clear auth_fail rows. A security-relevant event shouldn't be silenced
because the same source IP happened to fail to log in yesterday.

- Rename `_UPDATE_EVENT_TYPES_RESET_ON_START` → `_EVENT_TYPES_RESET_ON_START`
  (the list no longer covers only update-status reports).
- Rename `_reset_update_cooldowns_on_start()` → `_reset_cooldowns_on_start()`
  for the same reason.
- Add `'auth_fail'` to the curated list.

High-frequency sources (log_critical_*, disk SMART errors, …) are
deliberately NOT on this list — they keep their 24h cooldown across
restarts to prevent inbox floods if the user toggles the service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:18:49 +02:00
MacRimi 4e26c5942f Reset update-type cooldowns on NotificationManager.start()
When the user reinstalls or restarts the Monitor (deploy of a new
beta AppImage), they expect to see a fresh "what's available now"
summary in Telegram/Gotify/etc. instead of silence — even if the
24h anti-spam cooldown for `update_summary` etc. hasn't expired yet.

Without this, the operator had to wait up to 24h after every
deploy before the next `update_summary`, `proxmenux_update`,
`post_install_update`, `pve_update`, `update_available`,
`nvidia_driver_update_available` or `secure_gateway_update_available`
notification fired. The 24h cooldown is the right default for steady
state (don't pester the user every poll cycle with the same "177
packages pending" reminder), but a service restart is an explicit
signal that the user wants a fresh status report.

- New _UPDATE_EVENT_TYPES_RESET_ON_START tuple lists the event types
  to clear (everything in the "*_update*" + "update_*" family).
- New _reset_update_cooldowns_on_start() runs at start() right after
  the running flag flips, before watchers/dispatcher come up.
- Patterns match both fingerprint shapes:
    "<host>:<entity>:<event_type>:"               trailing-colon form
    "<host>:<entity>:<event_type>"                no-suffix form (managed installs)
- In-memory `_cooldowns` cache is also pruned so the live dispatcher
  picks up the reset immediately, without waiting for the next
  `_load_cooldowns_from_db()` cycle.

Non-update cooldowns (auth_fail, log_critical_*, disk errors, …) are
preserved so a restart doesn't unleash a backlog of stale alerts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:03:44 +02:00
MacRimi 06e6ae417e Fix notification dispatch: NameError in _dispatch_to_channels (Quiet Hours)
`_dispatch_to_channels` does NOT receive the NotificationEvent object —
only the rendered primitives (title, body, severity, event_type, …).
The Quiet Hours + Daily Digest merge introduced two references to
`event.severity` / `event` inside this function, which raised
`NameError: name 'event' is not defined` for every event passing
through dispatch.

The dispatch loop swallows the exception with a broad `except`, so the
visible symptom was "the Test button works but no real event ever
arrives" — both for community beta users (multiple reports on
Telegram, 9-18 May) and verified live on a test host (id 905 in
notification_history confirms the pipeline post-fix).

- _dispatch_to_channels: read `severity` / `event_type` directly
  instead of `event.severity` / `event.event_type`.
- _should_buffer_for_digest: take (ch_name, severity, event_type)
  primitives instead of a NotificationEvent.
- _buffer_digest_event: same — take (ch_name, event_type,
  event_group, severity, title, body).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:23:29 +02:00
MacRimi 6eb1312c61 1.2.1.1-beta: notification + LXC + post-install fixes
- flask_notification_routes: PVE webhook X-Webhook-Secret written in
  standard base64 so PVE can decode it (GH #198)
- notification_channels: Gmail SMTP App Password handling — normalize
  tls_mode (None/empty → starttls), reject creds without host (false-
  positive sendmail delivery), surface "AUTH not advertised" hint
- notification_events: is_vzdump_active_on_host() reads /var/log/pve/
  tasks/active directly so backup_start fallback and vm_shutdown
  suppression survive a Monitor restart mid-backup
- notification_templates: extract --storage flag from vzdump log →
  "PBS-Cloud: vm/104/…" instead of generic "PBS:" prefix when multiple
  PBS endpoints exist
- health_monitor: pve_storage_capacity + zfs_pool_capacity respect
  per-item dismiss (don't keep category WARNING/CRITICAL after user
  dismisses); updates_check cache invalidated when /var/log/apt/
  history.log mtime advances
- lxc_mount_points: PVE volume size from subvol quota (df via
  /proc/<host_pid>/root/<target> + lxc.conf size=NNNG fallback);
  host_source_state detects "host detached" zombie binds; per-mount
  subprocess work parallelised via ThreadPoolExecutor so a CT with
  many bind mounts doesn't trip the Caddy 3s reverse-proxy timeout
- virtual-machines: "host detached" badge on bind mounts whose host
  source path disappeared
- auto/customizable_post_install: log2ram FUNC_VERSION 1.1 → 1.2; new
  log2ram-check.sh vacuums journal + truncates non-rotating logs
  (pveproxy/access.log, pveam.log) instead of only calling
  `log2ram write` (which leaves the tmpfs full); auto flow gains the
  missing SystemMaxUse in /etc/systemd/journald.conf

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:06:49 +02:00
MacRimi 81844fa456 Update AppImage 2026-05-18 18:09:15 +02:00
MacRimi 4bedeb9fcd Update appimage 2026-05-18 17:47:25 +02:00
jcastro 70ab072c79 Fix webhook loopback detection and update handoff 2026-05-14 14:33:27 +02:00
MacRimi bd9af49412 Add ProxMenux-Monitor AppImage SHA256 file 2026-05-10 06:10:09 +02:00
MacRimi ab5c7093eb Update AppImage 2026-05-10 05:19:36 +02:00
MacRimi b4e8c5101a Update AppImage 2026-05-10 05:11:51 +02:00
MacRimi 911886b90c Update AppImage 2026-05-10 05:00:00 +02:00
MacRimi c14b72456f Update AppImage 2026-05-10 04:46:33 +02:00
MacRimi 0288c14a29 Update AppImage 2026-05-09 23:22:45 +02:00
MacRimi 2f919de9e3 update beta ProxMenux 1.2.1.1-beta 2026-05-09 18:59:59 +02:00
MacRimi 5ed1fc44fd Update 1.2.1.1-beta 2026-05-09 18:31:47 +02:00
github-actions[bot] b8b49da99e Update AppImage beta build (2026-04-21 19:30:40) 2026-04-21 19:30:40 +00:00
github-actions[bot] 1119a45e93 Update AppImage release build (2026-04-21 19:28:24) 2026-04-21 19:28:24 +00:00
MacRimi c403300cd2 Merge pull request #185 from MacRimi/develop
new version v1.2.1
2026-04-21 21:14:12 +02:00
MacRimi c3a5d6201e Delete AppImage/ProxMenux-Monitor.AppImage.sha256 2026-04-21 21:12:02 +02:00
MacRimi 9e7350c3bb Delete AppImage/ProxMenux-1.2.0.AppImage 2026-04-21 21:09:21 +02:00
github-actions[bot] 9bf99c0fdd Update AppImage beta build (2026-04-19 11:49:17) 2026-04-19 11:49:17 +00:00