Observability
It's often handy to know what's happening under the hood, especially when debugging a scan or task. the observability stack includes:
- Container logs: always available, the first stop
/api/heartbeatendpoint: health + config summary for uptime monitors- Sentry: opt-in error tracking with stack traces
- OpenTelemetry: opt-in distributed tracing + metrics
Logs
environment:
- LOG_LEVEL=INFO # DEBUG | INFO | WARNING | ERROR
- FORCE_COLOR=0 # 1 to force colour even when not a TTY
- NO_COLOR=1 # 1 to disable colour entirely
INFO is the default and the sane choice for production. Drop to DEBUG only while debugging a specific issue, because DEBUG is chatty.
Reading logs
Log lines are prefixed with module + timestamp:
INFO: [RomM][scan_handler][2026-04-18 11:37:40] Identified as PlayStation 🎮
ERROR: [RomM][ra_handler][2026-04-18 11:48:55] Invalid RetroAchievements API key
WARNING: [RomM][config_manager][2026-04-18 12:01:12] config.yml not found, using defaults
Some useful grep commands:
docker logs romm 2>&1 | grep ERROR
docker logs romm 2>&1 | grep -iE 'auth|oidc|oauth'
docker logs romm 2>&1 | grep -iE 'scan_handler.*Identified'
/api/heartbeat
A single-request endpoint to fetch health and config information. Works when not logged in, though some fields only appear for authenticated callers.
Wire this to your uptime monitor. A failure here means that the process is down or the DB/Valkey is unreachable.
# Basic uptime check
curl -fsS https://demo.romm.app/api/heartbeat > /dev/null \
&& echo "up" \
|| echo "down"
Per-metadata provider health:
Useful when a scan is matching poorly and you want to know whether a provider is down on their side or misconfigured on yours.
Sentry
Opt-in error tracking:
What's sent:
- Stack traces for unhandled exceptions
- Per-request timing on slow endpoints
- Redacted URL parameters (secrets stripped)
What's not sent: ROM filenames, user credentials, metadata provider API keys. Sensitive parameters are filtered before reporting.
OpenTelemetry
If you're already using OpenTelemetry with your other apps and want unified observability:
environment:
- OTEL_ENABLED=true
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
- OTEL_SERVICE_NAME=romm
- OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod
Standard OTEL env vars apply. The app emits:
- Traces: HTTP request spans, DB query spans, RQ job spans
- Metrics: request counts, durations, queue depth, scan progress
- Logs: structured log correlation with trace IDs
Exporters:
- OTLP gRPC (default, port
4317) - OTLP HTTP (port
4318): setOTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf.
Send to an OpenTelemetry Collector, then fan out to Tempo/Jaeger/Honeycomb/Datadog/Grafana/whatever you run.
Task status
Live task state is also exposed programmatically:
Returns an array of every scheduled/manual/watcher task with current status (idle, queued, running, failed) and last run time. Scrape this into your monitoring to alert on "Folder Scan hasn't run in 48 hours", which usually means RQ workers are dead.
Anti-patterns
- Don't parse unstructured log lines for metrics (use OTEL instead)
- Don't log at DEBUG in production as the volume is real and scans will drown in it
- Don't scrape HTML pages for health checks; HTML changes between versions while the API endpoint is stable
Minimum recommended stack
- Default
INFOlogs into the container logs → forwarded to Loki/Promtail/whatever you already run /api/heartbeathit every 60 seconds from Uptime Kuma/Gatus