Documentation Index
Fetch the complete documentation index at: https://docs.risklegion.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Risk Legion includes comprehensive health monitoring to ensure system reliability, performance visibility, and quick issue detection. Health checks are available at multiple levels: API, database, cache, and application.
Health Check Endpoints
Primary Health Endpoint
Returns overall system health status:
{
"status": "healthy",
"timestamp": "2026-01-16T10:30:00Z",
"version": "1.0.0",
"components": {
"api": "healthy",
"database": "healthy",
"redis": "healthy"
},
"uptime_seconds": 86400
}
| Status | Description |
|---|
healthy | All systems operational |
degraded | Some components impaired |
unhealthy | Critical components failing |
Component Health
Database Health
{
"status": "healthy",
"latency_ms": 12,
"connection_pool": {
"active": 5,
"idle": 15,
"max": 20
}
}
Redis Health
{
"status": "healthy",
"latency_ms": 2,
"memory_used_mb": 45,
"memory_max_mb": 256
}
Implementation
FastAPI Health Endpoint
# backend/app/routers/health.py
from fastapi import APIRouter, Response
from datetime import datetime
import time
router = APIRouter()
start_time = time.time()
@router.get("/health")
async def health_check():
components = {}
overall_status = "healthy"
# Check database
try:
db_start = time.time()
await db.execute("SELECT 1")
db_latency = (time.time() - db_start) * 1000
components["database"] = {
"status": "healthy",
"latency_ms": round(db_latency, 2)
}
except Exception as e:
components["database"] = {
"status": "unhealthy",
"error": str(e)
}
overall_status = "unhealthy"
# Check Redis
try:
redis_start = time.time()
await redis.ping()
redis_latency = (time.time() - redis_start) * 1000
components["redis"] = {
"status": "healthy",
"latency_ms": round(redis_latency, 2)
}
except Exception as e:
components["redis"] = {
"status": "degraded",
"error": str(e)
}
if overall_status == "healthy":
overall_status = "degraded"
return {
"status": overall_status,
"timestamp": datetime.utcnow().isoformat() + "Z",
"version": settings.APP_VERSION,
"components": components,
"uptime_seconds": int(time.time() - start_time)
}
Docker Health Check
# Dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
Docker Compose Health Check
# docker-compose.yml
services:
backend:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
redis:
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
Monitoring Stack
Metrics Collection
Risk Legion exposes Prometheus-compatible metrics:
Available Metrics:
| Metric | Type | Description |
|---|
http_requests_total | Counter | Total HTTP requests |
http_request_duration_seconds | Histogram | Request latency |
http_requests_in_progress | Gauge | Current active requests |
db_query_duration_seconds | Histogram | Database query latency |
cache_hits_total | Counter | Redis cache hits |
cache_misses_total | Counter | Redis cache misses |
Prometheus Configuration
# prometheus.yml
scrape_configs:
- job_name: 'risk-legion-api'
static_configs:
- targets: ['api:8000']
metrics_path: /metrics
scrape_interval: 15s
Logging
Structured Logging
import structlog
logger = structlog.get_logger()
@app.middleware("http")
async def log_requests(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
logger.info(
"http_request",
method=request.method,
path=request.url.path,
status_code=response.status_code,
duration_ms=round(duration * 1000, 2),
user_id=getattr(request.state, 'user_id', None)
)
return response
{
"timestamp": "2026-01-16T10:30:00.123Z",
"level": "info",
"event": "http_request",
"method": "GET",
"path": "/api/v1/bras",
"status_code": 200,
"duration_ms": 45.23,
"user_id": "user-uuid",
"request_id": "req-uuid"
}
Log Levels
| Level | Usage |
|---|
DEBUG | Detailed debugging information |
INFO | General operational events |
WARNING | Unexpected but handled situations |
ERROR | Errors requiring attention |
CRITICAL | System-level failures |
Alerting
Alert Configuration
# alertmanager.yml
route:
receiver: 'default'
group_by: ['alertname']
receivers:
- name: 'default'
email_configs:
- to: 'ops@risklegion.com'
slack_configs:
- api_url: 'https://hooks.slack.com/...'
channel: '#alerts'
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: High error rate detected
- alert: HighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds) > 2
for: 5m
labels:
severity: warning
annotations:
summary: API latency above 2s
- alert: DatabaseDown
expr: up{job="database"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: Database connection lost
Dashboard Metrics
Application Metrics
| Metric | Description | Alert Threshold |
|---|
| Request Rate | Requests per second | N/A (informational) |
| Error Rate | 5xx errors per second | > 1% for 5 min |
| Latency P95 | 95th percentile response time | > 2 seconds |
| Active Users | Concurrent authenticated users | N/A (informational) |
Infrastructure Metrics
| Metric | Description | Alert Threshold |
|---|
| CPU Usage | Container CPU utilization | > 80% for 5 min |
| Memory Usage | Container memory utilization | > 85% for 5 min |
| Disk Usage | Volume utilization | > 80% |
| Network I/O | Bytes in/out | N/A (informational) |
Database Metrics
| Metric | Description | Alert Threshold |
|---|
| Connection Pool | Active/idle connections | Active > 80% of max |
| Query Latency | Average query duration | > 500ms |
| Query Errors | Failed queries per second | > 0.1/s |
| Table Size | Database table sizes | N/A (informational) |
Error Tracking
Sentry Integration
# backend/app/main.py
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
sentry_sdk.init(
dsn=settings.SENTRY_DSN,
environment=settings.ENVIRONMENT,
integrations=[FastApiIntegration()],
traces_sample_rate=0.1,
profiles_sample_rate=0.1,
)
Error Categorization
| Category | Examples |
|---|
| Authentication | Invalid tokens, session expired |
| Authorization | Permission denied, role mismatch |
| Validation | Invalid input, missing fields |
| Database | Connection errors, constraint violations |
| External | Third-party service failures |
Deployment Health
GitHub Actions Health
The CI/CD pipeline includes health verification:
# Health check after deployment
- name: Verify Deployment
run: |
for i in {1..10}; do
response=$(curl -s -o /dev/null -w "%{http_code}" https://api.risklegion.com/health)
if [ "$response" = "200" ]; then
echo "Health check passed"
exit 0
fi
echo "Attempt $i: Health check returned $response"
sleep 5
done
echo "Health check failed after 10 attempts"
exit 1
Rollback Triggers
Automatic rollback is triggered when:
- Health check fails for 3 consecutive checks
- Error rate exceeds 5% for 5 minutes
- Critical alerts remain unresolved
Runbooks
Database Connection Issues
Check Connection Pool
Query active connections: SELECT count(*) FROM pg_stat_activity
Review Recent Changes
Check deployment history and recent code changes
Restart Connection Pool
Restart the application to reset connection pool
Scale if Needed
Increase max connections if consistently at capacity
High Latency
Check Slow Queries
Review query performance using EXPLAIN ANALYZE
Check Resource Usage
Monitor CPU, memory, and I/O metrics
Review Cache Hit Rate
Check Redis cache effectiveness
Scale Resources
Increase instance size or add replicas