Skip to content

Alerts & Runbooks

Prometheus fires the rules below from docker/prometheus/alert-rules.yml. There is no Alertmanager — alerts render as Grafana annotations (red vertical lines) and are inspected through the Security & Access dashboard plus Loki. If/when Alertmanager is added, wire the webhook in docker-compose.yml and extend this doc.

Generated section — the table below is regenerated from alert-rules.yml by make gen-docs. Hand-edit the narrative below the table, not the table itself.

Rules

RuleSeverityForExpression (abridged)Meaning
HighUnauthorizedRatewarning2msum(rate(traefik_entrypoint_requests_total{code=~"401|403"}[5m])) > 0.5High unauthorized access rate
HighServerErrorRatecritical3msum(rate(traefik_entrypoint_requests_total{code=~"5.."}[5m])) > 0.1High server error rate (5xx)
AuthentikDowncritical1mup{job="authentik"} == 0Authentik SSO is down
TraefikDowncritical1mup{job="traefik"} == 0Traefik reverse proxy is down
DiskSpaceLowwarning5m(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.10Disk space below 10%
HighMemoryUsagewarning5m(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.90Memory usage above 90%
ContainerRestartLoopwarning1mincrease(container_restart_count[15m]) > 3Container restarting repeatedly
ContainerOOMKilledcritical0mincrease(container_oom_events_total[5m]) > 0Container OOM killed: {{ $labels.name }}
ContainerMemoryNearLimitwarning5m(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.90Container {{ $labels.name }} using >90% of memory limit

Response runbooks

HighUnauthorizedRate

  1. Grafana → Explore → Loki: {job="traefik-access"} |= "401" (or |= "403"). Look for repeated client IPs.
  2. docker exec h5h_crowdsec cscli decisions list — confirm CrowdSec already banned them.
  3. If not banned: docker exec h5h_crowdsec cscli decisions add -i <ip> -d 24h -R "manual ban".

HighServerErrorRate

  1. make health — any container unhealthy?
  2. docker compose -f docker/docker-compose.yml logs --tail=200 <service>.
  3. Restart the offender: make update-service s=<service> (re-pulls + recreates).

AuthentikDown / TraefikDown

See "Locked out — Authentik is down" below.

DiskSpaceLow

  • du -sh ~/Desktop/hemanth/h5h/data/* — find fat subdir.
  • Prune Docker: make clean. Prune backups: ls -t data/backups/*.tar.gz | tail -n +8 | xargs rm (offen keeps 7 days by default).
  • Check Loki chunks: data/loki/chunks grows with log volume; retention is 30 d but compaction runs every 10 min.

HighMemoryUsage

  • make mem-check shows containers and OOM events.
  • If Immich ML is the culprit, confirm MACHINE_LEARNING_MODEL_TTL=300 still set; otherwise stop the photos profile: make stop-profile p=photos.

ContainerRestartLoop / ContainerOOMKilled / ContainerMemoryNearLimit

  • Read logs: docker compose -f docker/docker-compose.yml logs <service>.
  • Bump mem_limit in docker-compose.yml if the OOM is legitimate, otherwise find the leak.
  • For Postgres, memswap_limit == mem_limit on purpose (prevents swap thrashing). Don't raise swap.

Recovery: "Locked out — Authentik is down"

Forward-auth routes through auth.h5h.me. If Authentik dies, every protected subdomain will 401 and you can't log in to fix it. Recovery path:

  1. SSH to the host (no SSO needed). If remote, connect via Tailscale.
  2. docker compose -f code/docker/docker-compose.yml logs authentik-server authentik-postgres authentik-redis — which died?
  3. Most common: Postgres didn't come up. docker compose -f code/docker/docker-compose.yml restart authentik-postgres first, then authentik-server.
  4. If DB is corrupt, restore from the offen backup:
    bash
    ls -t ~/Desktop/hemanth/h5h/data/backups/h5h-backup-*.tar.gz | head -1
    # Extract, copy postgres/authentik volume back, restart stack.
  5. If Authentik is healthy but you can't log in (forgot admin password), reset with the bootstrap credentials already in docker/.env:
    bash
    docker exec -it h5h_authentik ak shell
    >>> from authentik.core.models import User
    >>> u = User.objects.get(email="<AUTHENTIK_BOOTSTRAP_EMAIL>")
    >>> u.set_password("<new-password>"); u.save()
  6. Once back up, re-run make setup-authentik if RBAC bindings look wrong (idempotent — script uses get_or_create).

Emergency stop

Keep only Traefik + Authentik + Tunnel + Dashboard (locks down everything else while leaving you a way to log in):

bash
make emergency-stop
make status  # verify

MIT License