Curve Operations Guide¶

This document describes operational procedures for monitoring, troubleshooting, and recovery in the Curve event publishing system.

Table of Contents¶

DLQ Monitoring
Metrics Interpretation
Outbox Replay API
Troubleshooting Matrix
Recovery Procedures
Alert Configuration
Runbook Checklist

DLQ Monitoring¶

Understanding the 3-Tier Failure Recovery¶

Curve implements a 3-tier failure recovery system to prevent event loss:

Event Send Attempt
        │
        ▼
┌─────────────────┐
│  Tier 1: Main   │──── Success ───▶ Event Published
│     Topic       │
└────────┬────────┘
         │ Failure
         ▼
┌─────────────────┐
│  Tier 2: DLQ    │──── Success ───▶ Event in DLQ Topic
│     Topic       │
└────────┬────────┘
         │ Failure
         ▼
┌─────────────────┐
│ Tier 3: Local   │──── Success ───▶ JSON File Backup
│     Backup      │
└────────┬────────┘
         │ Failure
         ▼
    Event Lost + Alert

Tier	Component	Trigger	Description
1	Main Topic	Normal operation	Events published to configured Kafka topic
2	DLQ Topic	Main topic failure	Failed events sent to Dead Letter Queue
3	Local File	DLQ failure	Events backed up to `./dlq-backup/` directory

Monitoring DLQ Events¶

Via Kafka UI¶

Navigate to Kafka UI (default: http://localhost:8080)
Select Topics from the menu
Find event.audit.dlq.v1 (or your configured DLQ topic)
View Messages tab for failed events

Via Actuator Endpoint¶

# Get DLQ metrics
curl http://localhost:8081/actuator/curve-metrics | jq '.summary'

Response:

{
  "totalEventsPublished": 1523,
  "successfulEvents": 1520,
  "failedEvents": 3,
  "successRate": "99.80%",
  "totalDlqEvents": 3,
  "totalKafkaErrors": 0
}

Via Kafka CLI¶

# Count messages in DLQ topic
kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list localhost:9092 \
  --topic event.audit.dlq.v1

# Consume DLQ messages
kafka-console-consumer.sh \
  --bootstrap-server localhost:9092 \
  --topic event.audit.dlq.v1 \
  --from-beginning

DLQ Message Structure¶

{
  "eventId": "123456789012345678",
  "originalTopic": "event.audit.v1",
  "originalPayload": "{\"eventType\":\"ORDER_CREATED\",...}",
  "exceptionType": "org.apache.kafka.common.errors.TimeoutException",
  "exceptionMessage": "Failed to send message after 3 retries",
  "failedAt": 1704067200000
}

Field	Description
`eventId`	Unique event identifier (Snowflake ID)
`originalTopic`	Topic where the event was supposed to be sent
`originalPayload`	Complete event payload as JSON string
`exceptionType`	Java exception class that caused the failure
`exceptionMessage`	Human-readable error message
`failedAt`	Timestamp (epoch milliseconds) when failure occurred

Local Backup Files¶

Location: ./dlq-backup/ (configurable via curve.kafka.dlq-backup-path)

# List backup files
ls -la ./dlq-backup/

# Example output:
# -rw------- 1 user user 2048 Jan 20 10:30 123456789012345678.json
# -rw------- 1 user user 1856 Jan 20 10:31 123456789012345679.json

File naming: {eventId}.json

File permissions: - POSIX systems: 600 (rw-------) - Windows: ACL restricted to owner only

Metrics Interpretation¶

Accessing Metrics¶

# Full metrics report
curl http://localhost:8081/actuator/curve-metrics

# Summary only
curl http://localhost:8081/actuator/curve-metrics | jq '.summary'

# Specific metric
curl http://localhost:8081/actuator/curve-metrics | jq '.events.published'

Key Metrics Reference¶

Metric	Description	Warning Threshold	Critical Threshold
`successRate`	Event publishing success percentage	< 99%	< 95%
`totalDlqEvents`	Events sent to DLQ	> 0	> 10 (increasing)
`totalKafkaErrors`	Kafka producer errors	> 0	> 5
`curve.events.retry.count`	Retry attempts	Increasing	Rapidly increasing
`curve.events.publish.duration`	Publishing latency	> 100ms avg	> 500ms avg

Health Status Interpretation¶

Status	Indicators	Meaning	Action
Healthy	successRate >= 99.5%, totalDlqEvents = 0	Normal operation	Monitor
Warning	successRate 95-99.5%, totalDlqEvents > 0 stable	Intermittent issues	Investigate
Critical	successRate < 95%, totalDlqEvents increasing	System failure	Immediate action

Outbox Publisher Metrics¶

For Transactional Outbox Pattern users:

Metric	Description	Action if Abnormal
`circuitBreakerState`	CLOSED/OPEN/HALF-OPEN	OPEN = Kafka connectivity issue
`consecutiveFailures`	Consecutive failure count	> 3 = circuit breaker may open
`timeSinceLastSuccessMs`	Time since last success	> 60000 = check Kafka
`totalPending`	Pending outbox events	Should trend toward 0
`totalFailed`	Permanently failed events	Requires manual intervention

Circuit Breaker States¶

State	Behavior	Duration	Transition
CLOSED	Normal operation	-	Opens after 5 consecutive failures
OPEN	All requests blocked	60 seconds	Transitions to HALF-OPEN
HALF-OPEN	Allows test requests	Until success/failure	Success→CLOSED, Failure→OPEN

Outbox Replay API¶

The /actuator/curve-outbox endpoint allows you to replay previously published outbox events for recovery and testing purposes.

Setup¶

Enable the endpoint in your configuration:

application.yml

management:
  endpoints:
    web:
      exposure:
        include: curve-outbox  # Add to existing list

GET /actuator/curve-outbox¶

Retrieve outbox statistics:

curl http://localhost:8081/actuator/curve-outbox

Response:

{
  "total": 1523,           // Total events in outbox
  "pending": 5,            // Events waiting to be published
  "published": 1516,       // Events successfully published
  "failed": 2,             // Events failed after all retries
  "avgProcessingTimeMs": 45
}

POST /actuator/curve-outbox¶

Replay events from a specific timestamp:

curl -X POST http://localhost:8081/actuator/curve-outbox \
  -H "Content-Type: application/vnd.spring-boot.actuator.v3+json" \
  -d '{
    "since": "2026-03-01T00:00:00Z",
    "limit": 100
  }'

Request Parameters:

Parameter	Type	Required	Description
`since`	String (ISO-8601)	Yes	Start timestamp for replay
`limit`	Integer	No	Max events to replay (default: 1000)

Response:

{
  "since": "2026-03-01T00:00:00Z",
  "limit": 100,
  "total": 42,              // Events found since timestamp
  "success": 40,            // Successfully replayed
  "failed": 2,              // Failed during replay
  "failedEventIds": [       // Event IDs that failed
    "evt-001",
    "evt-002"
  ]
}

Common Use Cases¶

Recovery from consumer downtime:

# Consumer was down from 10:00 to 10:30
curl -X POST http://localhost:8081/actuator/curve-outbox \
  -H "Content-Type: application/vnd.spring-boot.actuator.v3+json" \
  -d '{
    "since": "2026-03-04T10:00:00Z"
  }'

Replay for consumer bug fix:

# Reprocess all events from past 1 hour
curl -X POST http://localhost:8081/actuator/curve-outbox \
  -H "Content-Type: application/vnd.spring-boot.actuator.v3+json" \
  -d '{
    "since": "2026-03-04T08:00:00Z",
    "limit": 5000
  }'

Important Notes¶

Idempotency required: Consumers must handle duplicate events using event IDs as unique keys
Already-published events: Replay API will re-publish events regardless of their current status
Default topic used: Replayed events are sent to their original topics
No timestamp validation: Ensure consumer side can process older events appropriately

Troubleshooting Matrix¶

Symptoms and Solutions¶

Symptom	Possible Cause	Verification	Solution
Events not published	AOP disabled	Check `curve.aop.enabled` in config	Set to `true`
Events not published	Method not public	Review method signature	Make method `public`
`TimeoutException`	Kafka unresponsive	`docker-compose ps kafka`	Restart Kafka
`TimeoutException`	Network latency	Ping broker	Increase `request-timeout-ms`
High DLQ count	Kafka broker down	Check broker logs	Restore Kafka, recover DLQ
Circuit breaker OPEN	5+ consecutive failures	Check Kafka health	Wait 60s or fix Kafka
Local backup files exist	Both main and DLQ failed	Check all Kafka connectivity	Manual recovery required
PII encryption error	Missing encryption key	Check `PII_ENCRYPTION_KEY` env	Set environment variable
Worker ID conflict	Duplicate worker IDs	Check instance configurations	Assign unique IDs
Outbox events stuck PENDING	Kafka unreachable	Check circuit breaker state	Fix Kafka connectivity
Slow event publishing	Sync mode under high load	Check `async-mode`	Enable async mode
`ClockMovedBackwardsException`	System time changed	Check NTP sync	Restart application

Common Error Messages¶

Error Message	Cause	Solution
`Kafka topic is required`	Missing topic configuration	Set `curve.kafka.topic`
`workerId must be between 0 and 1023`	Invalid worker ID	Use valid range
`PII encryption key is not configured`	Missing encryption key	Set `PII_ENCRYPTION_KEY` env var
`Failed to send message after N retries`	Kafka connectivity issue	Check broker status
`Circuit breaker is OPEN`	Too many consecutive failures	Wait for half-open or fix Kafka

Health Check Responses¶

curl http://localhost:8081/actuator/health/curve

Status	Details	Meaning	Action
UP	`clusterId`, `nodeCount` present	Healthy, broker connected	None
DOWN	error message	Broker unreachable or connectivity issue	Check Kafka configuration and network

Recovery Procedures¶

Procedure 1: DLQ Event Recovery¶

When to use: Events accumulated in DLQ topic after temporary Kafka issues have been resolved.

Prerequisites: - Kafka is now healthy - kafka-console-producer.sh available in PATH - Access to DLQ topic

Steps:

Verify Kafka is healthy:

# Check Kafka container
docker-compose ps kafka

# Check Curve health
curl http://localhost:8081/actuator/health/curve

List DLQ events to recover:
```
./scripts/dlq-recovery.sh --list
```

Execute recovery:

./scripts/dlq-recovery.sh \
  --topic event.audit.v1 \
  --broker localhost:9092 \
  --dir ./dlq-backup

Recover specific file:

./scripts/dlq-recovery.sh \
  --file 123456789012345678.json \
  --topic event.audit.v1 \
  --broker localhost:9092

Verify recovery:
Check Kafka UI for recovered events
Verify backup files are processed (moved to recovered/ subdirectory)

Procedure 2: Local Backup File Recovery¶

When to use: Both main topic and DLQ failed, events backed up to local files.

Steps:

List backup files:
```
ls -la ./dlq-backup/*.json
```

Validate JSON format:

# Check all files
for f in ./dlq-backup/*.json; do
  jq empty "$f" 2>/dev/null || echo "Invalid: $f"
done

Use recovery script:

./scripts/dlq-recovery.sh \
  --dir ./dlq-backup \
  --topic event.audit.v1 \
  --broker localhost:9092

Manual recovery (if script fails):

# For each backup file
EVENT_ID="123456789012345678"

cat ./dlq-backup/${EVENT_ID}.json | \
  kafka-console-producer.sh \
  --broker-list localhost:9092 \
  --topic event.audit.v1

Archive recovered files:

mkdir -p ./dlq-backup/recovered
mv ./dlq-backup/*.json ./dlq-backup/recovered/

Procedure 3: Outbox Event Recovery¶

When to use: Outbox events stuck in FAILED status after circuit breaker issues.

Steps:

Check outbox statistics:

curl http://localhost:8081/actuator/curve-metrics | jq '.summary'

Query failed events (requires database access):

-- List failed events
SELECT id, event_id, aggregate_type, aggregate_id, status, retry_count, last_error
FROM curve_outbox_event
WHERE status = 'FAILED'
ORDER BY occurred_at DESC
LIMIT 100;

-- Count by status
SELECT status, COUNT(*) as count
FROM curve_outbox_event
GROUP BY status;

Reset failed events for retry:

-- Reset specific event
UPDATE curve_outbox_event
SET status = 'PENDING', retry_count = 0, last_error = NULL, next_retry_at = NOW()
WHERE id = 'specific-event-id';

-- Reset all failed events (use with caution)
UPDATE curve_outbox_event
SET status = 'PENDING', retry_count = 0, last_error = NULL, next_retry_at = NOW()
WHERE status = 'FAILED';

Monitor recovery:

watch -n 5 'curl -s http://localhost:8081/actuator/curve-metrics | jq ".summary"'

Procedure 4: Circuit Breaker Reset¶

When to use: Circuit breaker stuck in OPEN state after Kafka recovery.

Steps:

Verify Kafka is healthy:

curl http://localhost:8081/actuator/health/curve

Check circuit breaker state:

curl http://localhost:8081/actuator/curve-metrics | jq '.summary.circuitBreakerState'

Wait for automatic half-open (60 seconds)

The circuit breaker will automatically transition to HALF-OPEN state after 60 seconds, allowing test requests.

Alternative: Restart application:

# Graceful shutdown
kill -TERM $(pgrep -f 'your-application')

# Or via actuator (if enabled)
curl -X POST http://localhost:8081/actuator/shutdown

Monitor state transition:

watch -n 10 'curl -s http://localhost:8081/actuator/curve-metrics | jq ".summary.circuitBreakerState"'

Alert Configuration¶

Prometheus Alert Rules¶

groups:
  - name: curve-alerts
    rules:
      # DLQ Events Alert
      - alert: CurveDlqEventsHigh
        expr: curve_events_dlq_count_total > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High DLQ event count"
          description: "{{ $value }} events accumulated in DLQ"

      # Success Rate Alert
      - alert: CurveSuccessRateLow
        expr: (curve_events_published_success_total / curve_events_published_total) < 0.95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low event publishing success rate"
          description: "Success rate is {{ $value | humanizePercentage }}"

      # Circuit Breaker Alert
      - alert: CurveCircuitBreakerOpen
        expr: curve_circuit_breaker_state == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker is OPEN"
          description: "Outbox publisher circuit breaker is open, events are not being published"

      # Kafka Producer Down
      - alert: CurveKafkaProducerDown
        expr: curve_health_status == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Curve Kafka producer is down"
          description: "Kafka producer failed to initialize or is unhealthy"

      # High Latency Alert
      - alert: CurvePublishLatencyHigh
        expr: histogram_quantile(0.95, curve_events_publish_duration_seconds_bucket) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High event publishing latency"
          description: "95th percentile latency is {{ $value }}s"

      # Outbox Backlog Alert
      - alert: CurveOutboxBacklogHigh
        expr: curve_outbox_pending_total > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High outbox backlog"
          description: "{{ $value }} events pending in outbox"

Grafana Dashboard Panels¶

Recommended panels for Curve monitoring dashboard:

Event Publishing Rate - rate(curve_events_published_total[5m])
Success Rate Gauge - Current success percentage
DLQ Event Count - curve_events_dlq_count_total over time
Publishing Latency - histogram_quantile(0.95, curve_events_publish_duration_seconds_bucket)
Circuit Breaker State - Current state indicator (CLOSED/OPEN/HALF-OPEN)
Outbox Queue Depth - curve_outbox_pending_total over time
Retry Count - rate(curve_events_retry_count_total[5m])
Kafka Errors - curve_kafka_producer_errors_total over time

Runbook Checklist¶

Daily Operations¶

[ ] Check /actuator/health/curve status
[ ] Review /actuator/curve-metrics summary
[ ] Verify DLQ topic is empty or stable
[ ] Check for local backup files in ./dlq-backup/
[ ] Review application logs for WARN/ERROR entries

Weekly Operations¶

[ ] Review DLQ event patterns and root causes
[ ] Analyze publishing latency trends
[ ] Verify outbox cleanup job ran successfully
[ ] Archive old backup files (if any)
[ ] Review and rotate logs

Incident Response¶

[ ] Identify affected time range
[ ] Check circuit breaker state history
[ ] Count events in DLQ and local backup
[ ] Determine root cause (Kafka, network, configuration)
[ ] Execute appropriate recovery procedure
[ ] Verify event delivery to consumers
[ ] Document incident in post-mortem

Monthly Operations¶

[ ] Review alert thresholds and adjust if needed
[ ] Analyze success rate trends
[ ] Capacity planning based on event volume
[ ] Review and update this runbook if necessary

Additional Resources¶

Configuration Guide - Detailed configuration options
DLQ Recovery Script - Automated recovery tool
Sample Application - Working examples
README - Project overview and quick start