E2E Horizontal Scaling + Raft Coordinator HA Test Results

Overview

This test suite validates Varpulis CEP's horizontal scaling and high-availability capabilities using real Docker containers, MQTT traffic, and process kills. It tests:

Raft consensus for coordinator state replication across 3 nodes
WebSocket-based failure detection (~0s vs ~20s with REST heartbeats)
Worker failover with automatic pipeline migration
Coordinator failover with Raft leader re-election
Full cluster recovery after cascading failures
Traffic resilience during coordinator and worker kills

Architecture

E2E Docker test architecture

Raft Configuration

Parameter	Value
Nodes	3 coordinators
Heartbeat interval	500ms
Election timeout	1500-3000ms
Quorum	2 of 3 (tolerates 1 failure)
Log storage	In-memory
Transport	HTTP (reqwest)

Prometheus Metrics Exposed

The coordinator now exposes cluster metrics via /metrics (Prometheus text format):

Metric	Type	Labels	Description
`varpulis_cluster_raft_role`	Gauge	—	0=follower, 1=candidate, 2=leader
`varpulis_cluster_raft_term`	Gauge	—	Current Raft term
`varpulis_cluster_raft_commit_index`	Gauge	—	Current Raft commit index
`varpulis_cluster_workers_total`	Gauge	status	Workers by status (ready, unhealthy, draining)
`varpulis_cluster_pipeline_groups_total`	Gauge	—	Number of deployed pipeline groups
`varpulis_cluster_deployments_total`	Gauge	—	Number of pipeline deployments
`varpulis_cluster_migrations_total`	Counter	result	Migrations by result (success, failure)
`varpulis_cluster_migration_duration_seconds`	Histogram	result	Migration duration
`varpulis_cluster_deploy_duration_seconds`	Histogram	result	Deploy duration
`varpulis_cluster_health_sweep_duration_seconds`	Histogram	workers_checked	Health sweep duration

Latest Test Results

Date: 2026-02-12 Configuration: 3 Raft coordinators, 4 workers, MQTT broker Result: 8/8 phases passed

============================================================
=== E2E Scaling + Raft Coordinator HA Test Results ===
  Phase 0: Setup .............................. PASS (12.1s)
           leader=node 1, warm-up 50/50
  Phase 1: Baseline ........................... PASS (0.5s)
           250/250 events
  Phase 2: WebSocket Check .................... PASS (0.0s)
           coord-1: 2 WS, coord-2: 1 WS, coord-3: 1 WS
  Phase 3: Worker Failover (WS) ............... PASS (8.2s)
           250/250 events, failover in 7.1s, WS: 2->1
  Phase 4: Cascading Failure .................. PASS (8.9s)
           250/250 events via coordinator-2, coordinator-2 healthy: True, coordinator-3 healthy: True
  Phase 5: Coordinator Failover (Raft) ........ PASS (23.0s)
           old leader: node 1, new leader: node 2, election: 10.0s, events: 100/100
  Phase 6: Recovery ........................... PASS (18.8s)
           417/250 events, all 3 coordinators up: True, Raft leader: node 2
  Phase 7: Traffic During Coordinator Kill .... PASS (35.4s)
           2028/1217 events (166.6%), new leader: node 3, 2 surviving coordinators OK
============================================================
RESULT: 8/8 phases passed
============================================================

Results Table

Phase	Test	Result	Duration	Details
0	Setup	PASS	12.1s	Leader=node 1, warm-up 50/50
1	Baseline	PASS	0.5s	250/250 events
2	WebSocket Check	PASS	0.0s	coord-1: 2 WS, coord-2: 1 WS, coord-3: 1 WS
3	Worker Failover (WS)	PASS	8.2s	250/250 events, failover in 7.1s, WS: 2->1
4	Cascading Failure	PASS	8.9s	250/250 events via coordinator-2, both surviving coordinators healthy
5	Coordinator Failover (Raft)	PASS	23.0s	Old leader: node 1, new leader: node 2, election: 10.0s, events: 100/100
6	Recovery	PASS	18.8s	417/250 events, all 3 coordinators up, Raft leader: node 2
7	Traffic During Coordinator Kill	PASS	35.4s	2028/1217 events (166.6%), new leader: node 3, 2 surviving coordinators OK

Phase Details

Phase 0: Setup

All 3 coordinators healthy with 4 workers each (Raft-replicated)
Raft leader elected on node 1
Coordinator-1: Leader, Coordinator-2: Follower, Coordinator-3: Follower
MQTT connectors registered, 3 pipelines deployed on workers
Warm-up: 50/50 events received successfully

Phase 1: Baseline

Published 500 events (250 matching filter criteria)
All 250 expected events received correctly

Phase 2: WebSocket Check

WebSocket connections distributed across coordinators:
- Coordinator-1: 2 WS connections (workers: 4)
- Coordinator-2: 1 WS connection (workers: 4)
- Coordinator-3: 1 WS connection (workers: 4)
All coordinators reporting correct Raft roles

Phase 3: Worker Failover (WS)

Killed worker-1 (container stopped)
Worker-1 marked unhealthy after 6.0s (WS disconnect detection + grace period)
WS connections dropped from 2 to 1
Pipeline migration completed in 7.1s
All 250/250 events received after failover

Phase 4: Cascading Failure

Killed worker-2 (second worker failure)
Worker-2 marked unhealthy after 8.0s
Coordinator-2 and Coordinator-3 remained operational (status: degraded)
Events still processed correctly: 250/250 via coordinator-2

Phase 5: Coordinator Failover (Raft)

Killed coordinator-1 (Raft leader)
New leader (node 2) elected in 10.0s
API available on both surviving coordinators
100/100 events received through surviving pipeline

Phase 6: Recovery

Restarted coordinator-1 (rejoined as Follower)
Restarted worker-1 and worker-2
All 4 workers re-registered (Raft-replicated)
Pipelines re-deployed successfully
Raft roles after recovery: coord-1=Follower, coord-2=Leader, coord-3=Follower
Events processed correctly after full recovery

Phase 7: Traffic During Coordinator Kill

Background traffic started during coordinator-2 (leader) kill
New leader elected: node 3
No event loss during leader transition
Both surviving coordinators remained operational
Received 2028/1217 events (166.6% — includes duplicates from overlapping pipeline coverage)

Failure Detection: Three Layers

Varpulis implements a layered failure detection model:

Layer	Detection Time	Mechanism	Catches
WebSocket	~0s (instant)	TCP connection drop on persistent WS	Process crash, OOM kill, network partition
K8s Pod Watcher	~1-2s	K8s Watch API on pod status	OOMKill, eviction, node failure (K8s only)
REST Heartbeat	~10-20s (fallback)	Heartbeat timeout + health sweep	Hung process, degraded performance, WS unavailable

Failover Timing

Parameter	Value
WS disconnect detection	~0s (TCP FIN/RST)
Grace period (WS reconnect window)	5s
Migration time after detection	~1-2s
Total failover	~7s (with grace)
Raft leader election	~10s
REST fallback (if WS unavailable)	~20s

Running the Test

Prerequisites

Docker and Docker Compose v2+
Docker socket accessible at /var/run/docker.sock
~5 GB RAM (3 coordinators + 4 workers + mosquitto + test driver)
~12 minutes for the complete test (includes release build with --features raft)

Command

bash

cd tests/e2e-scaling
bash run.sh

Logs are captured to tests/e2e-scaling/results/docker-logs.txt for debugging.

Observability

Web UI

The Cluster Management view (/cluster) now includes a Health tab showing:

Raft consensus status (role, term, commit index)
Worker distribution (ready, unhealthy, draining)
Operations summary (deploys, migrations)

Grafana Dashboard

A pre-provisioned Varpulis Cluster Operations dashboard (varpulis-cluster UID) displays:

Raft role, term, and commit index over time
Workers by status (timeseries + pie chart)
Pipeline groups and deployment counts
Deploy and migration duration histograms (p50, p99)
Migration rate (success vs failure)
Health sweep duration

Prometheus Scrape Configuration

The coordinator's /metrics endpoint is scraped by Prometheus alongside worker metrics:

yaml

- job_name: "varpulis-coordinator"
  static_configs:
    - targets: ["coordinator:9100"]
      labels:
        role: "coordinator"
  metrics_path: /metrics
  scrape_interval: 5s

E2E Horizontal Scaling + Raft Coordinator HA Test Results ​

Overview ​

Architecture ​

Raft Configuration ​

Prometheus Metrics Exposed ​

Latest Test Results ​

Results Table ​

Phase Details ​

Phase 0: Setup ​

Phase 1: Baseline ​

Phase 2: WebSocket Check ​

Phase 3: Worker Failover (WS) ​

Phase 4: Cascading Failure ​

Phase 5: Coordinator Failover (Raft) ​

Phase 6: Recovery ​

Phase 7: Traffic During Coordinator Kill ​

Failure Detection: Three Layers ​

Failover Timing ​

Running the Test ​

Prerequisites ​

Command ​

Observability ​

Web UI ​

Grafana Dashboard ​

Prometheus Scrape Configuration ​