Alerts & Runbooks
Prometheus fires the rules below from docker/prometheus/alert-rules.yml. There is no Alertmanager — alerts render as Grafana annotations (red vertical lines) and are inspected through the Security & Access dashboard plus Loki. If/when Alertmanager is added, wire the webhook in docker-compose.yml and extend this doc.
Generated section — the table below is regenerated from
alert-rules.ymlbymake gen-docs. Hand-edit the narrative below the table, not the table itself.
Rules
| Rule | Severity | For | Expression (abridged) | Meaning |
|---|---|---|---|---|
HighUnauthorizedRate | warning | 2m | sum(rate(traefik_entrypoint_requests_total{code=~"401|403"}[5m])) > 0.5 | High unauthorized access rate |
HighServerErrorRate | critical | 3m | sum(rate(traefik_entrypoint_requests_total{code=~"5.."}[5m])) > 0.1 | High server error rate (5xx) |
AuthentikDown | critical | 1m | up{job="authentik"} == 0 | Authentik SSO is down |
TraefikDown | critical | 1m | up{job="traefik"} == 0 | Traefik reverse proxy is down |
DiskSpaceLow | warning | 5m | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.10 | Disk space below 10% |
HighMemoryUsage | warning | 5m | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.90 | Memory usage above 90% |
ContainerRestartLoop | warning | 1m | increase(container_restart_count[15m]) > 3 | Container restarting repeatedly |
ContainerOOMKilled | critical | 0m | increase(container_oom_events_total[5m]) > 0 | Container OOM killed: {{ $labels.name }} |
ContainerMemoryNearLimit | warning | 5m | (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.90 | Container {{ $labels.name }} using >90% of memory limit |
Response runbooks
HighUnauthorizedRate
- Grafana → Explore → Loki:
{job="traefik-access"} |= "401"(or|= "403"). Look for repeated client IPs. docker exec h5h_crowdsec cscli decisions list— confirm CrowdSec already banned them.- If not banned:
docker exec h5h_crowdsec cscli decisions add -i <ip> -d 24h -R "manual ban".
HighServerErrorRate
make health— any container unhealthy?docker compose -f docker/docker-compose.yml logs --tail=200 <service>.- Restart the offender:
make update-service s=<service>(re-pulls + recreates).
AuthentikDown / TraefikDown
See "Locked out — Authentik is down" below.
DiskSpaceLow
du -sh ~/Desktop/hemanth/h5h/data/*— find fat subdir.- Prune Docker:
make clean. Prune backups:ls -t data/backups/*.tar.gz | tail -n +8 | xargs rm(offen keeps 7 days by default). - Check Loki chunks:
data/loki/chunksgrows with log volume; retention is 30 d but compaction runs every 10 min.
HighMemoryUsage
make mem-checkshows containers and OOM events.- If Immich ML is the culprit, confirm
MACHINE_LEARNING_MODEL_TTL=300still set; otherwise stop the photos profile:make stop-profile p=photos.
ContainerRestartLoop / ContainerOOMKilled / ContainerMemoryNearLimit
- Read logs:
docker compose -f docker/docker-compose.yml logs <service>. - Bump
mem_limitindocker-compose.ymlif the OOM is legitimate, otherwise find the leak. - For Postgres,
memswap_limit == mem_limiton purpose (prevents swap thrashing). Don't raise swap.
Recovery: "Locked out — Authentik is down"
Forward-auth routes through auth.h5h.me. If Authentik dies, every protected subdomain will 401 and you can't log in to fix it. Recovery path:
- SSH to the host (no SSO needed). If remote, connect via Tailscale.
docker compose -f code/docker/docker-compose.yml logs authentik-server authentik-postgres authentik-redis— which died?- Most common: Postgres didn't come up.
docker compose -f code/docker/docker-compose.yml restart authentik-postgresfirst, thenauthentik-server. - If DB is corrupt, restore from the offen backup:bash
ls -t ~/Desktop/hemanth/h5h/data/backups/h5h-backup-*.tar.gz | head -1 # Extract, copy postgres/authentik volume back, restart stack. - If Authentik is healthy but you can't log in (forgot admin password), reset with the bootstrap credentials already in
docker/.env:bashdocker exec -it h5h_authentik ak shell >>> from authentik.core.models import User >>> u = User.objects.get(email="<AUTHENTIK_BOOTSTRAP_EMAIL>") >>> u.set_password("<new-password>"); u.save() - Once back up, re-run
make setup-authentikif RBAC bindings look wrong (idempotent — script usesget_or_create).
Emergency stop
Keep only Traefik + Authentik + Tunnel + Dashboard (locks down everything else while leaving you a way to log in):
make emergency-stop
make status # verify