Skip to content

E2E Horizontal Scaling + Raft Coordinator HA Test Results

Overview

This test suite validates Varpulis CEP's horizontal scaling and high-availability capabilities using real Docker containers, MQTT traffic, and process kills. It tests:

  • Raft consensus for coordinator state replication across 3 nodes
  • WebSocket-based failure detection (~0s vs ~20s with REST heartbeats)
  • Worker failover with automatic pipeline migration
  • Coordinator failover with Raft leader re-election
  • Full cluster recovery after cascading failures
  • Traffic resilience during coordinator and worker kills

Architecture

E2E Docker test architecture

Raft Configuration

ParameterValue
Nodes3 coordinators
Heartbeat interval500ms
Election timeout1500-3000ms
Quorum2 of 3 (tolerates 1 failure)
Log storageIn-memory
TransportHTTP (reqwest)

Prometheus Metrics Exposed

The coordinator now exposes cluster metrics via /metrics (Prometheus text format):

MetricTypeLabelsDescription
varpulis_cluster_raft_roleGauge0=follower, 1=candidate, 2=leader
varpulis_cluster_raft_termGaugeCurrent Raft term
varpulis_cluster_raft_commit_indexGaugeCurrent Raft commit index
varpulis_cluster_workers_totalGaugestatusWorkers by status (ready, unhealthy, draining)
varpulis_cluster_pipeline_groups_totalGaugeNumber of deployed pipeline groups
varpulis_cluster_deployments_totalGaugeNumber of pipeline deployments
varpulis_cluster_migrations_totalCounterresultMigrations by result (success, failure)
varpulis_cluster_migration_duration_secondsHistogramresultMigration duration
varpulis_cluster_deploy_duration_secondsHistogramresultDeploy duration
varpulis_cluster_health_sweep_duration_secondsHistogramworkers_checkedHealth sweep duration

Latest Test Results

Date: 2026-02-12 Configuration: 3 Raft coordinators, 4 workers, MQTT broker Result: 8/8 phases passed

============================================================
=== E2E Scaling + Raft Coordinator HA Test Results ===
  Phase 0: Setup .............................. PASS (12.1s)
           leader=node 1, warm-up 50/50
  Phase 1: Baseline ........................... PASS (0.5s)
           250/250 events
  Phase 2: WebSocket Check .................... PASS (0.0s)
           coord-1: 2 WS, coord-2: 1 WS, coord-3: 1 WS
  Phase 3: Worker Failover (WS) ............... PASS (8.2s)
           250/250 events, failover in 7.1s, WS: 2->1
  Phase 4: Cascading Failure .................. PASS (8.9s)
           250/250 events via coordinator-2, coordinator-2 healthy: True, coordinator-3 healthy: True
  Phase 5: Coordinator Failover (Raft) ........ PASS (23.0s)
           old leader: node 1, new leader: node 2, election: 10.0s, events: 100/100
  Phase 6: Recovery ........................... PASS (18.8s)
           417/250 events, all 3 coordinators up: True, Raft leader: node 2
  Phase 7: Traffic During Coordinator Kill .... PASS (35.4s)
           2028/1217 events (166.6%), new leader: node 3, 2 surviving coordinators OK
============================================================
RESULT: 8/8 phases passed
============================================================

Results Table

PhaseTestResultDurationDetails
0SetupPASS12.1sLeader=node 1, warm-up 50/50
1BaselinePASS0.5s250/250 events
2WebSocket CheckPASS0.0scoord-1: 2 WS, coord-2: 1 WS, coord-3: 1 WS
3Worker Failover (WS)PASS8.2s250/250 events, failover in 7.1s, WS: 2->1
4Cascading FailurePASS8.9s250/250 events via coordinator-2, both surviving coordinators healthy
5Coordinator Failover (Raft)PASS23.0sOld leader: node 1, new leader: node 2, election: 10.0s, events: 100/100
6RecoveryPASS18.8s417/250 events, all 3 coordinators up, Raft leader: node 2
7Traffic During Coordinator KillPASS35.4s2028/1217 events (166.6%), new leader: node 3, 2 surviving coordinators OK

Phase Details

Phase 0: Setup

  • All 3 coordinators healthy with 4 workers each (Raft-replicated)
  • Raft leader elected on node 1
  • Coordinator-1: Leader, Coordinator-2: Follower, Coordinator-3: Follower
  • MQTT connectors registered, 3 pipelines deployed on workers
  • Warm-up: 50/50 events received successfully

Phase 1: Baseline

  • Published 500 events (250 matching filter criteria)
  • All 250 expected events received correctly

Phase 2: WebSocket Check

  • WebSocket connections distributed across coordinators:
    • Coordinator-1: 2 WS connections (workers: 4)
    • Coordinator-2: 1 WS connection (workers: 4)
    • Coordinator-3: 1 WS connection (workers: 4)
  • All coordinators reporting correct Raft roles

Phase 3: Worker Failover (WS)

  • Killed worker-1 (container stopped)
  • Worker-1 marked unhealthy after 6.0s (WS disconnect detection + grace period)
  • WS connections dropped from 2 to 1
  • Pipeline migration completed in 7.1s
  • All 250/250 events received after failover

Phase 4: Cascading Failure

  • Killed worker-2 (second worker failure)
  • Worker-2 marked unhealthy after 8.0s
  • Coordinator-2 and Coordinator-3 remained operational (status: degraded)
  • Events still processed correctly: 250/250 via coordinator-2

Phase 5: Coordinator Failover (Raft)

  • Killed coordinator-1 (Raft leader)
  • New leader (node 2) elected in 10.0s
  • API available on both surviving coordinators
  • 100/100 events received through surviving pipeline

Phase 6: Recovery

  • Restarted coordinator-1 (rejoined as Follower)
  • Restarted worker-1 and worker-2
  • All 4 workers re-registered (Raft-replicated)
  • Pipelines re-deployed successfully
  • Raft roles after recovery: coord-1=Follower, coord-2=Leader, coord-3=Follower
  • Events processed correctly after full recovery

Phase 7: Traffic During Coordinator Kill

  • Background traffic started during coordinator-2 (leader) kill
  • New leader elected: node 3
  • No event loss during leader transition
  • Both surviving coordinators remained operational
  • Received 2028/1217 events (166.6% — includes duplicates from overlapping pipeline coverage)

Failure Detection: Three Layers

Varpulis implements a layered failure detection model:

LayerDetection TimeMechanismCatches
WebSocket~0s (instant)TCP connection drop on persistent WSProcess crash, OOM kill, network partition
K8s Pod Watcher~1-2sK8s Watch API on pod statusOOMKill, eviction, node failure (K8s only)
REST Heartbeat~10-20s (fallback)Heartbeat timeout + health sweepHung process, degraded performance, WS unavailable

Failover Timing

ParameterValue
WS disconnect detection~0s (TCP FIN/RST)
Grace period (WS reconnect window)5s
Migration time after detection~1-2s
Total failover~7s (with grace)
Raft leader election~10s
REST fallback (if WS unavailable)~20s

Running the Test

Prerequisites

  • Docker and Docker Compose v2+
  • Docker socket accessible at /var/run/docker.sock
  • ~5 GB RAM (3 coordinators + 4 workers + mosquitto + test driver)
  • ~12 minutes for the complete test (includes release build with --features raft)

Command

bash
cd tests/e2e-scaling
bash run.sh

Logs are captured to tests/e2e-scaling/results/docker-logs.txt for debugging.

Observability

Web UI

The Cluster Management view (/cluster) now includes a Health tab showing:

  • Raft consensus status (role, term, commit index)
  • Worker distribution (ready, unhealthy, draining)
  • Operations summary (deploys, migrations)

Grafana Dashboard

A pre-provisioned Varpulis Cluster Operations dashboard (varpulis-cluster UID) displays:

  • Raft role, term, and commit index over time
  • Workers by status (timeseries + pie chart)
  • Pipeline groups and deployment counts
  • Deploy and migration duration histograms (p50, p99)
  • Migration rate (success vs failure)
  • Health sweep duration

Prometheus Scrape Configuration

The coordinator's /metrics endpoint is scraped by Prometheus alongside worker metrics:

yaml
- job_name: "varpulis-coordinator"
  static_configs:
    - targets: ["coordinator:9100"]
      labels:
        role: "coordinator"
  metrics_path: /metrics
  scrape_interval: 5s

Varpulis - Next-generation streaming analytics engine