Aller au contenu principal

Resilience & Metrics

This document describes the resilience patterns and metrics collection implemented in the backend.

Circuit Breaker

The circuit breaker pattern prevents cascading failures when external services are unavailable.

States

StateDescription
CLOSEDNormal operation, requests pass through
OPENService is failing, requests are rejected immediately
HALF_OPENTesting if service has recovered

Configuration

Default settings (can be overridden per service):

OptionDefaultDescription
timeout30000msRequest timeout
errorThresholdPercentage50%Error rate to open circuit
resetTimeout30000msTime before trying again
volumeThreshold5Min requests before applying threshold

Usage

Basic Usage

import { CircuitBreakerService } from "@/common/circuit-breaker";

@Injectable()
export class MyService {
constructor(private readonly circuitBreaker: CircuitBreakerService) {}

async callExternalService() {
return this.circuitBreaker.execute(
"my-service",
() => this.httpClient.get("/api/data"),
() => ({ fallback: true }), // Optional fallback
);
}
}

The ResilienceService combines circuit breaker with metrics:

import { ResilienceService } from "@/common/resilience";

@Injectable()
export class MyService {
constructor(private readonly resilience: ResilienceService) {}

async callExternalService() {
return this.resilience.call(
{
service: "python-service",
operation: "embed-text",
timeout: 30000,
},
() => this.httpClient.post("/api/embeddings/text", data),
);
}
}

Monitoring Circuit Breakers

// Get stats for a specific service
const stats = circuitBreaker.getStats("python-service");

// Get all circuit breaker stats
const allStats = circuitBreaker.getAllStats();

// Check if service is healthy
const isHealthy = circuitBreaker.isHealthy("python-service");

Prometheus Metrics

Metrics are exposed at /api/metrics in Prometheus format.

Available Metrics

HTTP Metrics

MetricTypeLabelsDescription
http_request_duration_secondsHistogrammethod, route, status_codeRequest duration
http_requests_totalCountermethod, route, status_codeTotal requests
http_request_errors_totalCountermethod, route, status_code5xx errors

External Service Metrics

MetricTypeLabelsDescription
external_call_duration_secondsHistogramservice, operationCall duration
external_calls_totalCounterservice, operation, statusTotal calls
external_call_errors_totalCounterservice, operation, error_typeErrors

Circuit Breaker Metrics

MetricTypeLabelsDescription
circuit_breaker_stateGaugeserviceState (0=closed, 1=half-open, 2=open)
circuit_breaker_rejects_totalCounterserviceRejected requests

Queue Metrics

MetricTypeLabelsDescription
queue_jobs_totalCounterqueue, statusTotal jobs
queue_job_duration_secondsHistogramqueue, job_nameJob duration
queue_jobs_activeGaugequeueActive jobs
queue_jobs_failed_totalCounterqueue, job_nameFailed jobs

Business Metrics

MetricTypeLabelsDescription
business_sessions_created_totalCountertypeSessions created
business_invoices_created_totalCounterstatusInvoices created
business_notifications_sent_totalCountertype, channelNotifications sent

Grafana Dashboard

Example Prometheus queries for Grafana:

# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_request_errors_total[5m]) / rate(http_requests_total[5m]) * 100

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Circuit breaker states
circuit_breaker_state

# External service error rate
rate(external_call_errors_total[5m])

Recording Business Metrics

import { MetricsService } from "@/common/metrics";

@Injectable()
export class SessionsService {
constructor(private readonly metrics: MetricsService) {}

async create(data: CreateSessionDto) {
const session = await this.repository.create(data);
this.metrics.recordSessionCreated(session.type);
return session;
}
}

Architecture

┌─────────────────────────────────────────────────────────────┐
│ NestJS API │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ResilienceService │ │
│ │ (combines circuit breaker + metrics) │ │
│ └───────────────────────┬─────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼─────────────────────────────┐ │
│ │ │ │ │
│ │ ┌────────────────────▼────────────────────────┐ │ │
│ │ │ CircuitBreakerService │ │ │
│ │ │ (opossum library) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ MetricsService │ │ │
│ │ │ (prom-client library) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ MetricsInterceptor │ │
│ │ (auto-records HTTP request metrics) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ External APIs │ │ Prometheus │
│ (OpenAI, Python │ │ (scrapes │
│ Expo, etc.) │ │ /api/metrics) │
└─────────────────┘ └─────────────────┘

Files

FileDescription
src/common/circuit-breaker/circuit-breaker.service.tsCircuit breaker implementation
src/common/circuit-breaker/circuit-breaker.module.tsNestJS module
src/common/metrics/metrics.service.tsMetrics collection
src/common/metrics/metrics.controller.ts/metrics endpoint
src/common/metrics/metrics.interceptor.tsAuto HTTP metrics
src/common/resilience/resilience.service.tsCombined wrapper

Protected Services

The following external services use the resilience wrapper:

ServiceCircuit NameOperationsStatus
Python PDF Servicepython-pdfhtml-to-pdf, generate-invoice, generate-quote, generate-table-exportIntegrated
Python Word Servicepython-wordgenerate-table-export, generate-conversation-export, generate-roadmapIntegrated
Python Excel Servicepython-excelgenerate-single-sheet, generate-multi-sheet, generate-conversation-exportIntegrated
Python Intent Servicepython-intentclassify-intent, generate-filters, batch-classifyIntegrated
Python ML Servicepython-mlembed-text, embed-texts, qdrant-*To integrate
OpenAIopenaichat, embeddings, visionTo integrate
Expo Pushexpo-pushsend-notificationTo integrate
Google Calendargoogle-calendarsync, create-eventTo integrate
PayPalpaypalcreate-payment, verifyTo integrate
Notionnotionsearch, get-pageTo integrate