Resilience & Metrics
This document describes the resilience patterns and metrics collection implemented in the backend.
Circuit Breaker
The circuit breaker pattern prevents cascading failures when external services are unavailable.
States
| State | Description |
|---|---|
| CLOSED | Normal operation, requests pass through |
| OPEN | Service is failing, requests are rejected immediately |
| HALF_OPEN | Testing if service has recovered |
Configuration
Default settings (can be overridden per service):
| Option | Default | Description |
|---|---|---|
timeout | 30000ms | Request timeout |
errorThresholdPercentage | 50% | Error rate to open circuit |
resetTimeout | 30000ms | Time before trying again |
volumeThreshold | 5 | Min requests before applying threshold |
Usage
Basic Usage
import { CircuitBreakerService } from "@/common/circuit-breaker";
@Injectable()
export class MyService {
constructor(private readonly circuitBreaker: CircuitBreakerService) {}
async callExternalService() {
return this.circuitBreaker.execute(
"my-service",
() => this.httpClient.get("/api/data"),
() => ({ fallback: true }), // Optional fallback
);
}
}
With Resilience Service (Recommended)
The ResilienceService combines circuit breaker with metrics:
import { ResilienceService } from "@/common/resilience";
@Injectable()
export class MyService {
constructor(private readonly resilience: ResilienceService) {}
async callExternalService() {
return this.resilience.call(
{
service: "python-service",
operation: "embed-text",
timeout: 30000,
},
() => this.httpClient.post("/api/embeddings/text", data),
);
}
}
Monitoring Circuit Breakers
// Get stats for a specific service
const stats = circuitBreaker.getStats("python-service");
// Get all circuit breaker stats
const allStats = circuitBreaker.getAllStats();
// Check if service is healthy
const isHealthy = circuitBreaker.isHealthy("python-service");
Prometheus Metrics
Metrics are exposed at /api/metrics in Prometheus format.
Available Metrics
HTTP Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
http_request_duration_seconds | Histogram | method, route, status_code | Request duration |
http_requests_total | Counter | method, route, status_code | Total requests |
http_request_errors_total | Counter | method, route, status_code | 5xx errors |
External Service Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
external_call_duration_seconds | Histogram | service, operation | Call duration |
external_calls_total | Counter | service, operation, status | Total calls |
external_call_errors_total | Counter | service, operation, error_type | Errors |
Circuit Breaker Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
circuit_breaker_state | Gauge | service | State (0=closed, 1=half-open, 2=open) |
circuit_breaker_rejects_total | Counter | service | Rejected requests |
Queue Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
queue_jobs_total | Counter | queue, status | Total jobs |
queue_job_duration_seconds | Histogram | queue, job_name | Job duration |
queue_jobs_active | Gauge | queue | Active jobs |
queue_jobs_failed_total | Counter | queue, job_name | Failed jobs |
Business Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
business_sessions_created_total | Counter | type | Sessions created |
business_invoices_created_total | Counter | status | Invoices created |
business_notifications_sent_total | Counter | type, channel | Notifications sent |
Grafana Dashboard
Example Prometheus queries for Grafana:
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_request_errors_total[5m]) / rate(http_requests_total[5m]) * 100
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Circuit breaker states
circuit_breaker_state
# External service error rate
rate(external_call_errors_total[5m])
Recording Business Metrics
import { MetricsService } from "@/common/metrics";
@Injectable()
export class SessionsService {
constructor(private readonly metrics: MetricsService) {}
async create(data: CreateSessionDto) {
const session = await this.repository.create(data);
this.metrics.recordSessionCreated(session.type);
return session;
}
}
Architecture
┌─────────────────────────────────────────────────────────────┐
│ NestJS API │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ResilienceService │ │
│ │ (combines circuit breaker + metrics) │ │
│ └───────────────────────┬─────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼─────────────────────────────┐ │
│ │ │ │ │
│ │ ┌────────────────────▼────────────────────────┐ │ │
│ │ │ CircuitBreakerService │ │ │
│ │ │ (opossum library) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ MetricsService │ │ │
│ │ │ (prom-client library) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ MetricsInterceptor │ │
│ │ (auto-records HTTP request metrics) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ External APIs │ │ Prometheus │
│ (OpenAI, Python │ │ (scrapes │
│ Expo, etc.) │ │ /api/metrics) │
└─────────────────┘ └─────────────────┘
Files
| File | Description |
|---|---|
src/common/circuit-breaker/circuit-breaker.service.ts | Circuit breaker implementation |
src/common/circuit-breaker/circuit-breaker.module.ts | NestJS module |
src/common/metrics/metrics.service.ts | Metrics collection |
src/common/metrics/metrics.controller.ts | /metrics endpoint |
src/common/metrics/metrics.interceptor.ts | Auto HTTP metrics |
src/common/resilience/resilience.service.ts | Combined wrapper |
Protected Services
The following external services use the resilience wrapper:
| Service | Circuit Name | Operations | Status |
|---|---|---|---|
| Python PDF Service | python-pdf | html-to-pdf, generate-invoice, generate-quote, generate-table-export | Integrated |
| Python Word Service | python-word | generate-table-export, generate-conversation-export, generate-roadmap | Integrated |
| Python Excel Service | python-excel | generate-single-sheet, generate-multi-sheet, generate-conversation-export | Integrated |
| Python Intent Service | python-intent | classify-intent, generate-filters, batch-classify | Integrated |
| Python ML Service | python-ml | embed-text, embed-texts, qdrant-* | To integrate |
| OpenAI | openai | chat, embeddings, vision | To integrate |
| Expo Push | expo-push | send-notification | To integrate |
| Google Calendar | google-calendar | sync, create-event | To integrate |
| PayPal | paypal | create-payment, verify | To integrate |
| Notion | notion | search, get-page | To integrate |