Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.risklegion.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Risk Legion includes comprehensive health monitoring to ensure system reliability, performance visibility, and quick issue detection. Health checks are available at multiple levels: API, database, cache, and application.

Health Check Endpoints

Primary Health Endpoint

GET /health
Returns overall system health status:
{
  "status": "healthy",
  "timestamp": "2026-01-16T10:30:00Z",
  "version": "1.0.0",
  "components": {
    "api": "healthy",
    "database": "healthy",
    "redis": "healthy"
  },
  "uptime_seconds": 86400
}
StatusDescription
healthyAll systems operational
degradedSome components impaired
unhealthyCritical components failing

Component Health

Database Health

GET /health/database
{
  "status": "healthy",
  "latency_ms": 12,
  "connection_pool": {
    "active": 5,
    "idle": 15,
    "max": 20
  }
}

Redis Health

GET /health/redis
{
  "status": "healthy",
  "latency_ms": 2,
  "memory_used_mb": 45,
  "memory_max_mb": 256
}

Implementation

FastAPI Health Endpoint

# backend/app/routers/health.py

from fastapi import APIRouter, Response
from datetime import datetime
import time

router = APIRouter()
start_time = time.time()

@router.get("/health")
async def health_check():
    components = {}
    overall_status = "healthy"
    
    # Check database
    try:
        db_start = time.time()
        await db.execute("SELECT 1")
        db_latency = (time.time() - db_start) * 1000
        components["database"] = {
            "status": "healthy",
            "latency_ms": round(db_latency, 2)
        }
    except Exception as e:
        components["database"] = {
            "status": "unhealthy",
            "error": str(e)
        }
        overall_status = "unhealthy"
    
    # Check Redis
    try:
        redis_start = time.time()
        await redis.ping()
        redis_latency = (time.time() - redis_start) * 1000
        components["redis"] = {
            "status": "healthy",
            "latency_ms": round(redis_latency, 2)
        }
    except Exception as e:
        components["redis"] = {
            "status": "degraded",
            "error": str(e)
        }
        if overall_status == "healthy":
            overall_status = "degraded"
    
    return {
        "status": overall_status,
        "timestamp": datetime.utcnow().isoformat() + "Z",
        "version": settings.APP_VERSION,
        "components": components,
        "uptime_seconds": int(time.time() - start_time)
    }

Docker Health Check

# Dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

Docker Compose Health Check

# docker-compose.yml
services:
  backend:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  redis:
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

Monitoring Stack

Metrics Collection

Risk Legion exposes Prometheus-compatible metrics:
GET /metrics
Available Metrics:
MetricTypeDescription
http_requests_totalCounterTotal HTTP requests
http_request_duration_secondsHistogramRequest latency
http_requests_in_progressGaugeCurrent active requests
db_query_duration_secondsHistogramDatabase query latency
cache_hits_totalCounterRedis cache hits
cache_misses_totalCounterRedis cache misses

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'risk-legion-api'
    static_configs:
      - targets: ['api:8000']
    metrics_path: /metrics
    scrape_interval: 15s

Logging

Structured Logging

import structlog

logger = structlog.get_logger()

@app.middleware("http")
async def log_requests(request: Request, call_next):
    start_time = time.time()
    
    response = await call_next(request)
    
    duration = time.time() - start_time
    
    logger.info(
        "http_request",
        method=request.method,
        path=request.url.path,
        status_code=response.status_code,
        duration_ms=round(duration * 1000, 2),
        user_id=getattr(request.state, 'user_id', None)
    )
    
    return response

Log Format

{
  "timestamp": "2026-01-16T10:30:00.123Z",
  "level": "info",
  "event": "http_request",
  "method": "GET",
  "path": "/api/v1/bras",
  "status_code": 200,
  "duration_ms": 45.23,
  "user_id": "user-uuid",
  "request_id": "req-uuid"
}

Log Levels

LevelUsage
DEBUGDetailed debugging information
INFOGeneral operational events
WARNINGUnexpected but handled situations
ERRORErrors requiring attention
CRITICALSystem-level failures

Alerting

Alert Configuration

# alertmanager.yml
route:
  receiver: 'default'
  group_by: ['alertname']
  
receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@risklegion.com'
    slack_configs:
      - api_url: 'https://hooks.slack.com/...'
        channel: '#alerts'

rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: High error rate detected
      
  - alert: HighLatency
    expr: histogram_quantile(0.95, http_request_duration_seconds) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: API latency above 2s
      
  - alert: DatabaseDown
    expr: up{job="database"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Database connection lost

Dashboard Metrics

Application Metrics

MetricDescriptionAlert Threshold
Request RateRequests per secondN/A (informational)
Error Rate5xx errors per second> 1% for 5 min
Latency P9595th percentile response time> 2 seconds
Active UsersConcurrent authenticated usersN/A (informational)

Infrastructure Metrics

MetricDescriptionAlert Threshold
CPU UsageContainer CPU utilization> 80% for 5 min
Memory UsageContainer memory utilization> 85% for 5 min
Disk UsageVolume utilization> 80%
Network I/OBytes in/outN/A (informational)

Database Metrics

MetricDescriptionAlert Threshold
Connection PoolActive/idle connectionsActive > 80% of max
Query LatencyAverage query duration> 500ms
Query ErrorsFailed queries per second> 0.1/s
Table SizeDatabase table sizesN/A (informational)

Error Tracking

Sentry Integration

# backend/app/main.py

import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration

sentry_sdk.init(
    dsn=settings.SENTRY_DSN,
    environment=settings.ENVIRONMENT,
    integrations=[FastApiIntegration()],
    traces_sample_rate=0.1,
    profiles_sample_rate=0.1,
)

Error Categorization

CategoryExamples
AuthenticationInvalid tokens, session expired
AuthorizationPermission denied, role mismatch
ValidationInvalid input, missing fields
DatabaseConnection errors, constraint violations
ExternalThird-party service failures

Deployment Health

GitHub Actions Health

The CI/CD pipeline includes health verification:
# Health check after deployment
- name: Verify Deployment
  run: |
    for i in {1..10}; do
      response=$(curl -s -o /dev/null -w "%{http_code}" https://api.risklegion.com/health)
      if [ "$response" = "200" ]; then
        echo "Health check passed"
        exit 0
      fi
      echo "Attempt $i: Health check returned $response"
      sleep 5
    done
    echo "Health check failed after 10 attempts"
    exit 1

Rollback Triggers

Automatic rollback is triggered when:
  • Health check fails for 3 consecutive checks
  • Error rate exceeds 5% for 5 minutes
  • Critical alerts remain unresolved

Runbooks

Database Connection Issues

1

Check Connection Pool

Query active connections: SELECT count(*) FROM pg_stat_activity
2

Review Recent Changes

Check deployment history and recent code changes
3

Restart Connection Pool

Restart the application to reset connection pool
4

Scale if Needed

Increase max connections if consistently at capacity

High Latency

1

Check Slow Queries

Review query performance using EXPLAIN ANALYZE
2

Check Resource Usage

Monitor CPU, memory, and I/O metrics
3

Review Cache Hit Rate

Check Redis cache effectiveness
4

Scale Resources

Increase instance size or add replicas