Aller au contenu principal

Resilience & Metrics

This document describes the resilience patterns and metrics collection implemented in the backend.

Circuit Breaker

The circuit breaker pattern prevents cascading failures when external services are unavailable.

States

StateDescription
CLOSEDNormal operation, requests pass through
OPENService is failing, requests are rejected immediately
HALF_OPENTesting if service has recovered

Configuration

Default settings (can be overridden per service):

OptionDefaultDescription
timeout30000msRequest timeout
errorThresholdPercentage50%Error rate to open circuit
resetTimeout30000msTime before trying again
volumeThreshold5Min requests before applying threshold

Usage

Basic Usage

import { CircuitBreakerService } from "@/common/circuit-breaker";

@Injectable()
export class MyService {
constructor(private readonly circuitBreaker: CircuitBreakerService) {}

async callExternalService() {
return this.circuitBreaker.execute(
"my-service",
() => this.httpClient.get("/api/data"),
() => ({ fallback: true }), // Optional fallback
);
}
}

The ResilienceService combines circuit breaker with metrics:

import { ResilienceService } from "@/common/resilience";

@Injectable()
export class MyService {
constructor(private readonly resilience: ResilienceService) {}

async callExternalService() {
return this.resilience.call(
{
service: "python-service",
operation: "embed-text",
timeout: 30000,
},
() => this.httpClient.post("/api/embeddings/text", data),
);
}
}

Monitoring Circuit Breakers

// Get stats for a specific service
const stats = circuitBreaker.getStats("python-service");

// Get all circuit breaker stats
const allStats = circuitBreaker.getAllStats();

// Check if service is healthy
const isHealthy = circuitBreaker.isHealthy("python-service");

Prometheus Metrics

Metrics are exposed at /api/metrics in Prometheus format.

Available Metrics

HTTP Metrics

MetricTypeLabelsDescription
http_request_duration_secondsHistogrammethod, route, status_codeRequest duration
http_requests_totalCountermethod, route, status_codeTotal requests
http_request_errors_totalCountermethod, route, status_code5xx errors

External Service Metrics

MetricTypeLabelsDescription
external_call_duration_secondsHistogramservice, operationCall duration
external_calls_totalCounterservice, operation, statusTotal calls
external_call_errors_totalCounterservice, operation, error_typeErrors

Circuit Breaker Metrics

MetricTypeLabelsDescription
circuit_breaker_stateGaugeserviceState (0=closed, 1=half-open, 2=open)
circuit_breaker_rejects_totalCounterserviceRejected requests

Queue Metrics

MetricTypeLabelsDescription
queue_jobs_totalCounterqueue, statusTotal jobs
queue_job_duration_secondsHistogramqueue, job_nameJob duration
queue_jobs_activeGaugequeueActive jobs
queue_jobs_failed_totalCounterqueue, job_nameFailed jobs

Business Metrics

MetricTypeLabelsDescription
business_sessions_created_totalCountertypeSessions created
business_invoices_created_totalCounterstatusInvoices created
business_notifications_sent_totalCountertype, channelNotifications sent

Grafana Dashboard

Example Prometheus queries for Grafana:

# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_request_errors_total[5m]) / rate(http_requests_total[5m]) * 100

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Circuit breaker states
circuit_breaker_state

# External service error rate
rate(external_call_errors_total[5m])

Built-in Alerting (Slack + Email)

The backend now includes a scheduler-based alerting loop (every minute) that sends notifications to Slack/email for:

  • health/ready status down (and recovery notification)
  • spikes on http_request_errors_total
  • spikes on queue_jobs_total{status="failed"} (failed jobs growth)

Configuration (see .env.example):

  • ALERTING_ENABLED
  • ALERTING_COOLDOWN_MINUTES
  • ALERT_NOTIFICATION_EMAIL
  • ALERT_SLACK_WEBHOOK_URL
  • ALERT_HTTP_ERRORS_SPIKE_ENABLED
  • ALERT_HTTP_ERRORS_SPIKE_THRESHOLD
  • ALERT_QUEUE_FAILED_SPIKE_ENABLED
  • ALERT_QUEUE_FAILED_SPIKE_THRESHOLD
  • ALERT_SLACK_WEBHOOK_URL_DEV / ALERT_SLACK_WEBHOOK_URL_PROD (environment-specific override)

Implementation files:

  • backend/src/observability/observability-alerts.scheduler.ts
  • backend/src/observability/observability-alerts-notifier.service.ts
  • backend/src/observability/observability.module.ts
  • backend/src/app.controller.ts (POST /observability/alerts/test, GET /ops/status)
  • frontend/src/pages/admin/AdminPanelPage/components/ServerTab.tsx (Ops status + synthetic alert action)

Email alert rendering (observability notifier) now includes richer formatting for readability:

  • severity badge with emoji (🚨/⚠️)
  • event header with contextual icon ( recovered, down, 📈 spike)
  • details block (service, event, environment, UTC timestamp)
  • dependency status block when present (Database, Redis)
  • summary block + recommended verification checklist
  • enriched plaintext fallback with the same key information

The notifier parses Database: and Redis: statuses from readiness alert messages to render explicit state chips in the email.

/ops/status reports service state as:

  • up: reachable and responding
  • protected: reachable but access-protected (401/403)
  • down: unreachable or probe error (timeout/5xx/network)

Prometheus server-side rules (in addition to scheduler-based alerts):

  • AapertureApiTargetDown
  • AapertureHttpRequestErrorsSpike
  • AapertureQueueJobsFailedSpike
  • AapertureApiLatencyP95High
  • AapertureHttpErrorRateSloWarning (HTTP error rate > 1% over 10m)
  • AapertureHttpErrorRateSloCritical (HTTP error rate > 5% over 5m)

Grafana SLO Panels (Observability Dashboard)

The provisioned Grafana dashboard aaperture-observability.json now includes SLO-oriented panels in addition to the base operational metrics:

  • API Success Rate (5m)
  • HTTP Error Rate by Route (5m)
  • SLO Burn Rate (1h / 6h)
  • Error Budget Remaining (30d, based on 99.5% SLO target)

This complements the existing p95 latency / queue failed rate / job duration panels and makes regressions visible without switching dashboards.

Local Prometheus + Grafana Stack

Dev compose includes an optional monitoring profile:

docker compose -f infra/docker-compose.dev.yml --profile monitoring up -d prometheus grafana

URLs:

  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3001

Provisioned files:

  • Prometheus config: infra/monitoring/prometheus/prometheus.yml
  • Prometheus rules: infra/monitoring/prometheus/alerts.yml
  • Grafana dashboard: infra/monitoring/grafana/dashboards/aaperture-observability.json

Production Prometheus + Grafana Stack

Production compose now includes Prometheus and Grafana with localhost-only bindings:

  • Prometheus: 127.0.0.1:9090
  • Grafana: 127.0.0.1:3001

Access them through an SSH tunnel:

ssh -L 3001:127.0.0.1:3001 -L 9090:127.0.0.1:9090 <user>@<server>

Bull Board (/ops/queues, exposed via https://queues.aaperture.com in production) must be protected with:

  • QUEUE_BOARD_BASIC_AUTH_ENABLED=true
  • QUEUE_BOARD_BASIC_AUTH_USER
  • QUEUE_BOARD_BASIC_AUTH_PASSWORD

Ops subdomains (TLS/auth/reverse-proxy runbook): docs/OPS_SUBDOMAINS.md

End-to-End Validation Checklist

  1. Alert channels configured in .env:
    • ALERTING_ENABLED=true
    • ALERT_NOTIFICATION_EMAIL=<your email>
    • ALERT_SLACK_WEBHOOK_URL_DEV or ALERT_SLACK_WEBHOOK_URL_PROD
  2. API metrics endpoint returns metrics:
    • curl -s http://127.0.0.1:8080/api/metrics | head
  3. Prometheus target is up:
    • curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[] | {scrapeUrl, health}'
  4. Grafana dashboard loads:
    • Aaperture - Observability (latency p95, HTTP 5xx, failed jobs, job duration p95)
  5. Scheduler alert smoke test:
    • temporarily set ALERT_HTTP_ERRORS_SPIKE_THRESHOLD=1
    • generate a few 5xx responses
    • verify Slack/email alert, then restore threshold

Recording Business Metrics

import { MetricsService } from "@/common/metrics";

@Injectable()
export class SessionsService {
constructor(private readonly metrics: MetricsService) {}

async create(data: CreateSessionDto) {
const session = await this.repository.create(data);
this.metrics.recordSessionCreated(session.type);
return session;
}
}

Architecture

┌─────────────────────────────────────────────────────────────┐
│ NestJS API │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ResilienceService │ │
│ │ (combines circuit breaker + metrics) │ │
│ └───────────────────────┬─────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼─────────────────────────────┐ │
│ │ │ │ │
│ │ ┌────────────────────▼────────────────────────┐ │ │
│ │ │ CircuitBreakerService │ │ │
│ │ │ (opossum library) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ MetricsService │ │ │
│ │ │ (prom-client library) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ MetricsInterceptor │ │
│ │ (auto-records HTTP request metrics) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ External APIs │ │ Prometheus │
│ (OpenAI, Python │ │ (scrapes │
│ Expo, etc.) │ │ /api/metrics) │
└─────────────────┘ └─────────────────┘

Files

FileDescription
src/common/circuit-breaker/circuit-breaker.service.tsCircuit breaker implementation
src/common/circuit-breaker/circuit-breaker.module.tsNestJS module
src/common/metrics/metrics.service.tsMetrics collection
src/common/metrics/metrics.controller.ts/metrics endpoint
src/common/metrics/metrics.interceptor.tsAuto HTTP metrics
src/common/resilience/resilience.service.tsCombined wrapper

Protected Services

The following external services use the resilience wrapper:

ServiceCircuit NameOperationsStatus
Python PDF Servicepython-pdfhtml-to-pdf, generate-invoice, generate-quote, generate-table-exportIntegrated
Python Word Servicepython-wordgenerate-table-export, generate-conversation-export, generate-roadmapIntegrated
Python Excel Servicepython-excelgenerate-single-sheet, generate-multi-sheet, generate-conversation-exportIntegrated
Python Intent Servicepython-intentclassify-intent, generate-filters, batch-classifyIntegrated
Python ML Servicepython-mlembed-text, embed-texts, qdrant-*To integrate
OpenAIopenaichat, embeddings, visionTo integrate
Expo Pushexpo-pushsend-notificationTo integrate
Google Calendargoogle-calendarsync, create-eventTo integrate
PayPalpaypalcreate-payment, verifyTo integrate
Notionnotionsearch, get-pageTo integrate