E2E Horizontal Scaling + Raft Coordinator HA Test Results
Overview
This test suite validates Varpulis CEP's horizontal scaling and high-availability capabilities using real Docker containers, MQTT traffic, and process kills. It tests:
- Raft consensus for coordinator state replication across 3 nodes
- WebSocket-based failure detection (~0s vs ~20s with REST heartbeats)
- Worker failover with automatic pipeline migration
- Coordinator failover with Raft leader re-election
- Full cluster recovery after cascading failures
- Traffic resilience during coordinator and worker kills
Architecture
Raft Configuration
| Parameter | Value |
|---|---|
| Nodes | 3 coordinators |
| Heartbeat interval | 500ms |
| Election timeout | 1500-3000ms |
| Quorum | 2 of 3 (tolerates 1 failure) |
| Log storage | In-memory |
| Transport | HTTP (reqwest) |
Prometheus Metrics Exposed
The coordinator now exposes cluster metrics via /metrics (Prometheus text format):
| Metric | Type | Labels | Description |
|---|---|---|---|
varpulis_cluster_raft_role | Gauge | — | 0=follower, 1=candidate, 2=leader |
varpulis_cluster_raft_term | Gauge | — | Current Raft term |
varpulis_cluster_raft_commit_index | Gauge | — | Current Raft commit index |
varpulis_cluster_workers_total | Gauge | status | Workers by status (ready, unhealthy, draining) |
varpulis_cluster_pipeline_groups_total | Gauge | — | Number of deployed pipeline groups |
varpulis_cluster_deployments_total | Gauge | — | Number of pipeline deployments |
varpulis_cluster_migrations_total | Counter | result | Migrations by result (success, failure) |
varpulis_cluster_migration_duration_seconds | Histogram | result | Migration duration |
varpulis_cluster_deploy_duration_seconds | Histogram | result | Deploy duration |
varpulis_cluster_health_sweep_duration_seconds | Histogram | workers_checked | Health sweep duration |
Latest Test Results
Date: 2026-02-12 Configuration: 3 Raft coordinators, 4 workers, MQTT broker Result: 8/8 phases passed
============================================================
=== E2E Scaling + Raft Coordinator HA Test Results ===
Phase 0: Setup .............................. PASS (12.1s)
leader=node 1, warm-up 50/50
Phase 1: Baseline ........................... PASS (0.5s)
250/250 events
Phase 2: WebSocket Check .................... PASS (0.0s)
coord-1: 2 WS, coord-2: 1 WS, coord-3: 1 WS
Phase 3: Worker Failover (WS) ............... PASS (8.2s)
250/250 events, failover in 7.1s, WS: 2->1
Phase 4: Cascading Failure .................. PASS (8.9s)
250/250 events via coordinator-2, coordinator-2 healthy: True, coordinator-3 healthy: True
Phase 5: Coordinator Failover (Raft) ........ PASS (23.0s)
old leader: node 1, new leader: node 2, election: 10.0s, events: 100/100
Phase 6: Recovery ........................... PASS (18.8s)
417/250 events, all 3 coordinators up: True, Raft leader: node 2
Phase 7: Traffic During Coordinator Kill .... PASS (35.4s)
2028/1217 events (166.6%), new leader: node 3, 2 surviving coordinators OK
============================================================
RESULT: 8/8 phases passed
============================================================Results Table
| Phase | Test | Result | Duration | Details |
|---|---|---|---|---|
| 0 | Setup | PASS | 12.1s | Leader=node 1, warm-up 50/50 |
| 1 | Baseline | PASS | 0.5s | 250/250 events |
| 2 | WebSocket Check | PASS | 0.0s | coord-1: 2 WS, coord-2: 1 WS, coord-3: 1 WS |
| 3 | Worker Failover (WS) | PASS | 8.2s | 250/250 events, failover in 7.1s, WS: 2->1 |
| 4 | Cascading Failure | PASS | 8.9s | 250/250 events via coordinator-2, both surviving coordinators healthy |
| 5 | Coordinator Failover (Raft) | PASS | 23.0s | Old leader: node 1, new leader: node 2, election: 10.0s, events: 100/100 |
| 6 | Recovery | PASS | 18.8s | 417/250 events, all 3 coordinators up, Raft leader: node 2 |
| 7 | Traffic During Coordinator Kill | PASS | 35.4s | 2028/1217 events (166.6%), new leader: node 3, 2 surviving coordinators OK |
Phase Details
Phase 0: Setup
- All 3 coordinators healthy with 4 workers each (Raft-replicated)
- Raft leader elected on node 1
- Coordinator-1: Leader, Coordinator-2: Follower, Coordinator-3: Follower
- MQTT connectors registered, 3 pipelines deployed on workers
- Warm-up: 50/50 events received successfully
Phase 1: Baseline
- Published 500 events (250 matching filter criteria)
- All 250 expected events received correctly
Phase 2: WebSocket Check
- WebSocket connections distributed across coordinators:
- Coordinator-1: 2 WS connections (workers: 4)
- Coordinator-2: 1 WS connection (workers: 4)
- Coordinator-3: 1 WS connection (workers: 4)
- All coordinators reporting correct Raft roles
Phase 3: Worker Failover (WS)
- Killed worker-1 (container stopped)
- Worker-1 marked unhealthy after 6.0s (WS disconnect detection + grace period)
- WS connections dropped from 2 to 1
- Pipeline migration completed in 7.1s
- All 250/250 events received after failover
Phase 4: Cascading Failure
- Killed worker-2 (second worker failure)
- Worker-2 marked unhealthy after 8.0s
- Coordinator-2 and Coordinator-3 remained operational (status: degraded)
- Events still processed correctly: 250/250 via coordinator-2
Phase 5: Coordinator Failover (Raft)
- Killed coordinator-1 (Raft leader)
- New leader (node 2) elected in 10.0s
- API available on both surviving coordinators
- 100/100 events received through surviving pipeline
Phase 6: Recovery
- Restarted coordinator-1 (rejoined as Follower)
- Restarted worker-1 and worker-2
- All 4 workers re-registered (Raft-replicated)
- Pipelines re-deployed successfully
- Raft roles after recovery: coord-1=Follower, coord-2=Leader, coord-3=Follower
- Events processed correctly after full recovery
Phase 7: Traffic During Coordinator Kill
- Background traffic started during coordinator-2 (leader) kill
- New leader elected: node 3
- No event loss during leader transition
- Both surviving coordinators remained operational
- Received 2028/1217 events (166.6% — includes duplicates from overlapping pipeline coverage)
Failure Detection: Three Layers
Varpulis implements a layered failure detection model:
| Layer | Detection Time | Mechanism | Catches |
|---|---|---|---|
| WebSocket | ~0s (instant) | TCP connection drop on persistent WS | Process crash, OOM kill, network partition |
| K8s Pod Watcher | ~1-2s | K8s Watch API on pod status | OOMKill, eviction, node failure (K8s only) |
| REST Heartbeat | ~10-20s (fallback) | Heartbeat timeout + health sweep | Hung process, degraded performance, WS unavailable |
Failover Timing
| Parameter | Value |
|---|---|
| WS disconnect detection | ~0s (TCP FIN/RST) |
| Grace period (WS reconnect window) | 5s |
| Migration time after detection | ~1-2s |
| Total failover | ~7s (with grace) |
| Raft leader election | ~10s |
| REST fallback (if WS unavailable) | ~20s |
Running the Test
Prerequisites
- Docker and Docker Compose v2+
- Docker socket accessible at
/var/run/docker.sock - ~5 GB RAM (3 coordinators + 4 workers + mosquitto + test driver)
- ~12 minutes for the complete test (includes release build with
--features raft)
Command
bash
cd tests/e2e-scaling
bash run.shLogs are captured to tests/e2e-scaling/results/docker-logs.txt for debugging.
Observability
Web UI
The Cluster Management view (/cluster) now includes a Health tab showing:
- Raft consensus status (role, term, commit index)
- Worker distribution (ready, unhealthy, draining)
- Operations summary (deploys, migrations)
Grafana Dashboard
A pre-provisioned Varpulis Cluster Operations dashboard (varpulis-cluster UID) displays:
- Raft role, term, and commit index over time
- Workers by status (timeseries + pie chart)
- Pipeline groups and deployment counts
- Deploy and migration duration histograms (p50, p99)
- Migration rate (success vs failure)
- Health sweep duration
Prometheus Scrape Configuration
The coordinator's /metrics endpoint is scraped by Prometheus alongside worker metrics:
yaml
- job_name: "varpulis-coordinator"
static_configs:
- targets: ["coordinator:9100"]
labels:
role: "coordinator"
metrics_path: /metrics
scrape_interval: 5s