Resilience & Metrics
This document describes the resilience patterns and metrics collection implemented in the backend.
Circuit Breaker
The circuit breaker pattern prevents cascading failures when external services are unavailable.
States
| State | Description |
|---|---|
| CLOSED | Normal operation, requests pass through |
| OPEN | Service is failing, requests are rejected immediately |
| HALF_OPEN | Testing if service has recovered |
Configuration
Default settings (can be overridden per service):
| Option | Default | Description |
|---|---|---|
timeout | 30000ms | Request timeout |
errorThresholdPercentage | 50% | Error rate to open circuit |
resetTimeout | 30000ms | Time before trying again |
volumeThreshold | 5 | Min requests before applying threshold |
Usage
Basic Usage
import { CircuitBreakerService } from "@/common/circuit-breaker";
@Injectable()
export class MyService {
constructor(private readonly circuitBreaker: CircuitBreakerService) {}
async callExternalService() {
return this.circuitBreaker.execute(
"my-service",
() => this.httpClient.get("/api/data"),
() => ({ fallback: true }), // Optional fallback
);
}
}
With Resilience Service (Recommended)
The ResilienceService combines circuit breaker with metrics:
import { ResilienceService } from "@/common/resilience";
@Injectable()
export class MyService {
constructor(private readonly resilience: ResilienceService) {}
async callExternalService() {
return this.resilience.call(
{
service: "python-service",
operation: "embed-text",
timeout: 30000,
},
() => this.httpClient.post("/api/embeddings/text", data),
);
}
}
Monitoring Circuit Breakers
// Get stats for a specific service
const stats = circuitBreaker.getStats("python-service");
// Get all circuit breaker stats
const allStats = circuitBreaker.getAllStats();
// Check if service is healthy
const isHealthy = circuitBreaker.isHealthy("python-service");
Prometheus Metrics
Metrics are exposed at /api/metrics in Prometheus format.
Available Metrics
HTTP Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
http_request_duration_seconds | Histogram | method, route, status_code | Request duration |
http_requests_total | Counter | method, route, status_code | Total requests |
http_request_errors_total | Counter | method, route, status_code | 5xx errors |
External Service Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
external_call_duration_seconds | Histogram | service, operation | Call duration |
external_calls_total | Counter | service, operation, status | Total calls |
external_call_errors_total | Counter | service, operation, error_type | Errors |
Circuit Breaker Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
circuit_breaker_state | Gauge | service | State (0=closed, 1=half-open, 2=open) |
circuit_breaker_rejects_total | Counter | service | Rejected requests |
Queue Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
queue_jobs_total | Counter | queue, status | Total jobs |
queue_job_duration_seconds | Histogram | queue, job_name | Job duration |
queue_jobs_active | Gauge | queue | Active jobs |
queue_jobs_failed_total | Counter | queue, job_name | Failed jobs |
Business Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
business_sessions_created_total | Counter | type | Sessions created |
business_invoices_created_total | Counter | status | Invoices created |
business_notifications_sent_total | Counter | type, channel | Notifications sent |
Grafana Dashboard
Example Prometheus queries for Grafana:
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_request_errors_total[5m]) / rate(http_requests_total[5m]) * 100
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Circuit breaker states
circuit_breaker_state
# External service error rate
rate(external_call_errors_total[5m])
Built-in Alerting (Slack + Email)
The backend now includes a scheduler-based alerting loop (every minute) that sends notifications to Slack/email for:
health/readystatusdown(and recovery notification)- spikes on
http_request_errors_total - spikes on
queue_jobs_total{status="failed"}(failed jobs growth)
Configuration (see .env.example):
ALERTING_ENABLEDALERTING_COOLDOWN_MINUTESALERT_NOTIFICATION_EMAILALERT_SLACK_WEBHOOK_URLALERT_HTTP_ERRORS_SPIKE_ENABLEDALERT_HTTP_ERRORS_SPIKE_THRESHOLDALERT_QUEUE_FAILED_SPIKE_ENABLEDALERT_QUEUE_FAILED_SPIKE_THRESHOLDALERT_SLACK_WEBHOOK_URL_DEV/ALERT_SLACK_WEBHOOK_URL_PROD(environment-specific override)
Implementation files:
backend/src/observability/observability-alerts.scheduler.tsbackend/src/observability/observability-alerts-notifier.service.tsbackend/src/observability/observability.module.tsbackend/src/app.controller.ts(POST /observability/alerts/test,GET /ops/status)frontend/src/pages/admin/AdminPanelPage/components/ServerTab.tsx(Ops status + synthetic alert action)
Email alert rendering (observability notifier) now includes richer formatting for readability:
- severity badge with emoji (
🚨/⚠️) - event header with contextual icon (
✅recovered,❌down,📈spike) - details block (
service,event,environment, UTC timestamp) - dependency status block when present (
Database,Redis) - summary block + recommended verification checklist
- enriched plaintext fallback with the same key information
The notifier parses Database: and Redis: statuses from readiness alert messages to render explicit state chips in the email.
/ops/status reports service state as:
up: reachable and respondingprotected: reachable but access-protected (401/403)down: unreachable or probe error (timeout/5xx/network)
Prometheus server-side rules (in addition to scheduler-based alerts):
AapertureApiTargetDownAapertureHttpRequestErrorsSpikeAapertureQueueJobsFailedSpikeAapertureApiLatencyP95HighAapertureHttpErrorRateSloWarning(HTTP error rate > 1% over 10m)AapertureHttpErrorRateSloCritical(HTTP error rate > 5% over 5m)
Grafana SLO Panels (Observability Dashboard)
The provisioned Grafana dashboard aaperture-observability.json now includes SLO-oriented panels in addition to the base operational metrics:
- API Success Rate (5m)
- HTTP Error Rate by Route (5m)
- SLO Burn Rate (1h / 6h)
- Error Budget Remaining (30d, based on 99.5% SLO target)
This complements the existing p95 latency / queue failed rate / job duration panels and makes regressions visible without switching dashboards.
Local Prometheus + Grafana Stack
Dev compose includes an optional monitoring profile:
docker compose -f infra/docker-compose.dev.yml --profile monitoring up -d prometheus grafana
URLs:
- Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3001
Provisioned files:
- Prometheus config:
infra/monitoring/prometheus/prometheus.yml - Prometheus rules:
infra/monitoring/prometheus/alerts.yml - Grafana dashboard:
infra/monitoring/grafana/dashboards/aaperture-observability.json
Production Prometheus + Grafana Stack
Production compose now includes Prometheus and Grafana with localhost-only bindings:
- Prometheus:
127.0.0.1:9090 - Grafana:
127.0.0.1:3001
Access them through an SSH tunnel:
ssh -L 3001:127.0.0.1:3001 -L 9090:127.0.0.1:9090 <user>@<server>
Bull Board (/ops/queues, exposed via https://queues.aaperture.com in production) must be protected with:
QUEUE_BOARD_BASIC_AUTH_ENABLED=trueQUEUE_BOARD_BASIC_AUTH_USERQUEUE_BOARD_BASIC_AUTH_PASSWORD
Ops subdomains (TLS/auth/reverse-proxy runbook): docs/OPS_SUBDOMAINS.md
End-to-End Validation Checklist
- Alert channels configured in
.env:ALERTING_ENABLED=trueALERT_NOTIFICATION_EMAIL=<your email>ALERT_SLACK_WEBHOOK_URL_DEVorALERT_SLACK_WEBHOOK_URL_PROD
- API metrics endpoint returns metrics:
curl -s http://127.0.0.1:8080/api/metrics | head
- Prometheus target is up:
curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[] | {scrapeUrl, health}'
- Grafana dashboard loads:
Aaperture - Observability(latency p95, HTTP 5xx, failed jobs, job duration p95)
- Scheduler alert smoke test:
- temporarily set
ALERT_HTTP_ERRORS_SPIKE_THRESHOLD=1 - generate a few 5xx responses
- verify Slack/email alert, then restore threshold
- temporarily set
Recording Business Metrics
import { MetricsService } from "@/common/metrics";
@Injectable()
export class SessionsService {
constructor(private readonly metrics: MetricsService) {}
async create(data: CreateSessionDto) {
const session = await this.repository.create(data);
this.metrics.recordSessionCreated(session.type);
return session;
}
}
Architecture
┌─────────────────────────────────────────────────────────────┐
│ NestJS API │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ResilienceService │ │
│ │ (combines circuit breaker + metrics) │ │
│ └───────────────────────┬─────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼─────────────────────────────┐ │
│ │ │ │ │
│ │ ┌────────────────────▼────────────────────────┐ │ │
│ │ │ CircuitBreakerService │ │ │
│ │ │ (opossum library) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ MetricsService │ │ │
│ │ │ (prom-client library) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ MetricsInterceptor │ │
│ │ (auto-records HTTP request metrics) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ External APIs │ │ Prometheus │
│ (OpenAI, Python │ │ (scrapes │
│ Expo, etc.) │ │ /api/metrics) │
└─────────────────┘ └─────────────────┘
Files
| File | Description |
|---|---|
src/common/circuit-breaker/circuit-breaker.service.ts | Circuit breaker implementation |
src/common/circuit-breaker/circuit-breaker.module.ts | NestJS module |
src/common/metrics/metrics.service.ts | Metrics collection |
src/common/metrics/metrics.controller.ts | /metrics endpoint |
src/common/metrics/metrics.interceptor.ts | Auto HTTP metrics |
src/common/resilience/resilience.service.ts | Combined wrapper |
Protected Services
The following external services use the resilience wrapper:
| Service | Circuit Name | Operations | Status |
|---|---|---|---|
| Python PDF Service | python-pdf | html-to-pdf, generate-invoice, generate-quote, generate-table-export | Integrated |
| Python Word Service | python-word | generate-table-export, generate-conversation-export, generate-roadmap | Integrated |
| Python Excel Service | python-excel | generate-single-sheet, generate-multi-sheet, generate-conversation-export | Integrated |
| Python Intent Service | python-intent | classify-intent, generate-filters, batch-classify | Integrated |
| Python ML Service | python-ml | embed-text, embed-texts, qdrant-* | To integrate |
| OpenAI | openai | chat, embeddings, vision | To integrate |
| Expo Push | expo-push | send-notification | To integrate |
| Google Calendar | google-calendar | sync, create-event | To integrate |
| PayPal | paypal | create-payment, verify | To integrate |
| Notion | notion | search, get-page | To integrate |