Skip to main content

Monitoring and Service Health

This guide describes the observability capabilities of Digital Retail Engine and how to verify service health.

Health Check Endpoint

Digital Retail Engine provides a health check endpoint that reports the status of all critical components:

GET /health

The response includes the status of:

ComponentDescription
ApplicationOverall service availability
DatabaseSAP HANA Cloud connectivity and response time
CacheRedis connectivity and response time

Example Response

{
"status": "UP",
"version": "1.0.0",
"uptime": 12345,
"timestamp": "2026-03-27T10:30:00Z",
"service": "digital-retail-engine",
"checks": {
"database": {
"status": "UP",
"responseTime": 12
},
"redis": {
"status": "UP",
"responseTime": 3
},
"webhookDelivery": {
"status": "active"
}
}
}

The endpoint always returns HTTP 200 for the Cloud Foundry liveness probe. When the database or Redis is unreachable, the top-level status reports DEGRADED (not UP) while still returning 200; the affected check reports DOWN. The webhookDelivery check reports active only when Redis is configured and reachable (the webhook outbox worker depends on it), otherwise inactive; it is informational and never degrades overall status.

Service Level Objectives

The Digital Retail Engine service is operated with the following targets:

MetricTargetDescription
Availability99.5%Monthly uptime target
POS API Latency (P95)< 500ms95th percentile response time for promotion evaluation
POS API Latency (P99)< 2s99th percentile response time for promotion evaluation
Error Rate< 1%HTTP 5xx error rate

Available Metrics

Digital Retail Engine exposes Prometheus-compatible metrics at the /metrics endpoint. These metrics can be integrated into your existing monitoring infrastructure.

Application Metrics

MetricDescription
http_requests_totalTotal HTTP requests processed (by method, route, status code)
http_request_duration_secondsRequest latency distribution
http_requests_in_flightCurrently active requests

Business Metrics

MetricDescription
evaluations_totalTotal promotion evaluations (by tenant, status)
evaluate_duration_secondsDuration of promotion evaluation (by tenant)
promotions_applied_totalNumber of promotions applied to baskets
cache_hit_totalCache hits for active promotions and evaluation results
cache_miss_totalCache misses for active promotions and evaluation results

Integration Metrics (Public API + DRFOUT)

These metrics provide visibility into bulk imports, webhook delivery, and DRFOUT ingestion.

MetricDescription
dre_imports_totalTotal imports processed (by tenant, type, status)
dre_imports_duration_secondsImport job duration in seconds (by tenant, type)
dre_webhook_deliveries_totalWebhook deliveries (by tenant, status)
dre_webhook_dead_letters_totalWebhook dead-letter transitions (by tenant)
dre_drfout_skipped_totalDRFOUT items skipped (by tenant, reason)

Integrating with Your Monitoring Stack

The /metrics endpoint returns data in Prometheus text exposition format and can be scraped by any Prometheus-compatible monitoring tool (e.g., Grafana, Datadog, SAP Cloud ALM).

SAP Cloud ALM

For enterprise-grade monitoring with SAP Cloud ALM:

  1. Enable Real User Monitoring and Health Monitoring capabilities in your SAP Cloud ALM tenant.
  2. Configure the data collector to scrape the /metrics endpoint.
  3. Set up Situation Automation rules for alerting on error rate and latency thresholds.

Troubleshooting

Service Appears Slow

  1. Check the /health endpoint to verify database and cache connectivity.
  2. Review the evaluate_duration_seconds metric — high values may indicate complex promotion configurations or a large number of active promotions.
  3. Check the cache hit rate — a low hit rate increases database load. Cache is invalidated when promotions are modified; frequent updates may reduce cache effectiveness.
  4. Contact DRE support if latency persists beyond the published SLOs.

Service Unavailable

  1. Verify the /health endpoint returns a response.
  2. Check the SAP BTP cockpit for any platform-level incidents.
  3. Contact DRE support with the health check response and timestamp for investigation.