Zum Hauptinhalt springen

Monitoring and Service Health

This guide describes the observability capabilities of Digital Retail Engine and how to verify service health.

Health Check Endpoint

Digital Retail Engine provides a health check endpoint that reports the status of all critical components:

GET /health

The response includes the status of:

ComponentDescription
ApplicationOverall service availability
DatabaseSAP HANA Cloud connectivity and response time
CacheRedis connectivity and response time

Example Response

{
"status": "UP",
"version": "1.0.0",
"uptime": 12345,
"timestamp": "2026-03-27T10:30:00Z",
"service": "digital-retail-engine",
"checks": {
"database": {
"status": "UP",
"responseTime": 12
},
"redis": {
"status": "UP",
"responseTime": 3
},
"webhookDelivery": {
"status": "active"
}
}
}

The endpoint always returns HTTP 200 for the Cloud Foundry liveness probe. When the database or Redis is unreachable, the top-level status reports DEGRADED (not UP) while still returning 200; the affected check reports DOWN. The webhookDelivery check reports active only when Redis is configured and reachable (the webhook outbox worker depends on it), otherwise inactive; it is informational and never degrades overall status.

Service Level Objectives

The Digital Retail Engine service is operated with the following targets:

MetricTargetDescription
Availability99.5%Monthly uptime target
POS API Latency (P95)< 500ms95th percentile response time for promotion evaluation
POS API Latency (P99)< 2s99th percentile response time for promotion evaluation
Error Rate< 1%HTTP 5xx error rate

Available Metrics

Digital Retail Engine exposes Prometheus-compatible metrics at the /metrics endpoint. These metrics can be integrated into your existing monitoring infrastructure.

Application Metrics

MetricDescription
http_requests_totalTotal HTTP requests processed (by method, route, status code)
http_request_duration_secondsRequest latency distribution
http_requests_in_flightCurrently active requests

Business Metrics

MetricDescription
evaluations_totalTotal promotion evaluations (by tenant, status)
evaluate_duration_secondsDuration of promotion evaluation (by tenant)
promotions_applied_totalNumber of promotions applied to baskets
cache_hit_totalCache hits for active promotions and evaluation results
cache_miss_totalCache misses for active promotions and evaluation results

Integration Metrics (Public API + DRFOUT)

These metrics provide visibility into bulk imports, webhook delivery, and DRFOUT ingestion.

MetricDescription
dre_imports_totalTotal imports processed (by tenant, type, status)
dre_imports_duration_secondsImport job duration in seconds (by tenant, type)
dre_webhook_deliveries_totalWebhook deliveries (by tenant, status)
dre_webhook_dead_letters_totalWebhook dead-letter transitions (by tenant)
dre_drfout_skipped_totalDRFOUT items skipped (by tenant, reason)

Integrating with Your Monitoring Stack

The /metrics endpoint returns data in Prometheus text exposition format and can be scraped by any Prometheus-compatible monitoring tool (e.g., Grafana, Datadog, SAP Cloud ALM).

SAP Cloud ALM

For enterprise-grade monitoring with SAP Cloud ALM:

  1. Enable Real User Monitoring and Health Monitoring capabilities in your SAP Cloud ALM tenant.
  2. Configure the data collector to scrape the /metrics endpoint.
  3. Set up Situation Automation rules for alerting on error rate and latency thresholds.

Troubleshooting

Service Appears Slow

  1. Check the /health endpoint to verify database and cache connectivity.
  2. Review the evaluate_duration_seconds metric — high values may indicate complex promotion configurations or a large number of active promotions.
  3. Check the cache hit rate — a low hit rate increases database load. Cache is invalidated when promotions are modified; frequent updates may reduce cache effectiveness.
  4. Contact DRE support if latency persists beyond the published SLOs.

Service Unavailable

  1. Verify the /health endpoint returns a response.
  2. Check the SAP BTP cockpit for any platform-level incidents.
  3. Contact DRE support with the health check response and timestamp for investigation.