Monitoring and Service Health
This guide describes the observability capabilities of Digital Retail Engine and how to verify service health.
Health Check Endpoint
Digital Retail Engine provides a health check endpoint that reports the status of all critical components:
GET /health
The response includes the status of:
| Component | Description |
|---|---|
| Application | Overall service availability |
| Database | SAP HANA Cloud connectivity and response time |
| Cache | Redis connectivity and response time |
Example Response
{
"status": "UP",
"version": "1.0.0",
"uptime": 12345,
"timestamp": "2026-03-27T10:30:00Z",
"service": "digital-retail-engine",
"checks": {
"database": {
"status": "UP",
"responseTime": 12
},
"redis": {
"status": "UP",
"responseTime": 3
},
"webhookDelivery": {
"status": "active"
}
}
}
The endpoint always returns HTTP 200 for the Cloud Foundry liveness probe. When the database or Redis is unreachable, the top-level status reports DEGRADED (not UP) while still returning 200; the affected check reports DOWN. The webhookDelivery check reports active only when Redis is configured and reachable (the webhook outbox worker depends on it), otherwise inactive; it is informational and never degrades overall status.
Service Level Objectives
The Digital Retail Engine service is operated with the following targets:
| Metric | Target | Description |
|---|---|---|
| Availability | 99.5% | Monthly uptime target |
| POS API Latency (P95) | < 500ms | 95th percentile response time for promotion evaluation |
| POS API Latency (P99) | < 2s | 99th percentile response time for promotion evaluation |
| Error Rate | < 1% | HTTP 5xx error rate |
Available Metrics
Digital Retail Engine exposes Prometheus-compatible metrics at the /metrics endpoint. These metrics can be integrated into your existing monitoring infrastructure.
Application Metrics
| Metric | Description |
|---|---|
http_requests_total | Total HTTP requests processed (by method, route, status code) |
http_request_duration_seconds | Request latency distribution |
http_requests_in_flight | Currently active requests |
Business Metrics
| Metric | Description |
|---|---|
evaluations_total | Total promotion evaluations (by tenant, status) |
evaluate_duration_seconds | Duration of promotion evaluation (by tenant) |
promotions_applied_total | Number of promotions applied to baskets |
cache_hit_total | Cache hits for active promotions and evaluation results |
cache_miss_total | Cache misses for active promotions and evaluation results |
Integration Metrics (Public API + DRFOUT)
These metrics provide visibility into bulk imports, webhook delivery, and DRFOUT ingestion.
| Metric | Description |
|---|---|
dre_imports_total | Total imports processed (by tenant, type, status) |
dre_imports_duration_seconds | Import job duration in seconds (by tenant, type) |
dre_webhook_deliveries_total | Webhook deliveries (by tenant, status) |
dre_webhook_dead_letters_total | Webhook dead-letter transitions (by tenant) |
dre_drfout_skipped_total | DRFOUT items skipped (by tenant, reason) |
Integrating with Your Monitoring Stack
The /metrics endpoint returns data in Prometheus text exposition format and can be scraped by any Prometheus-compatible monitoring tool (e.g., Grafana, Datadog, SAP Cloud ALM).
SAP Cloud ALM
For enterprise-grade monitoring with SAP Cloud ALM:
- Enable Real User Monitoring and Health Monitoring capabilities in your SAP Cloud ALM tenant.
- Configure the data collector to scrape the
/metricsendpoint. - Set up Situation Automation rules for alerting on error rate and latency thresholds.
Troubleshooting
Service Appears Slow
- Check the
/healthendpoint to verify database and cache connectivity. - Review the
evaluate_duration_secondsmetric — high values may indicate complex promotion configurations or a large number of active promotions. - Check the cache hit rate — a low hit rate increases database load. Cache is invalidated when promotions are modified; frequent updates may reduce cache effectiveness.
- Contact DRE support if latency persists beyond the published SLOs.
Service Unavailable
- Verify the
/healthendpoint returns a response. - Check the SAP BTP cockpit for any platform-level incidents.
- Contact DRE support with the health check response and timestamp for investigation.