Skip to content

Varpulis CEP Engine — SLO/SLI Definitions

This document defines the Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for the Varpulis Complex Event Processing engine. It is the authoritative reference for on-call engineers, capacity planners, and product stakeholders when assessing service health and negotiating reliability commitments.


Table of Contents

  1. Overview
  2. Service Level Indicators
  3. Service Level Objectives
  4. Prometheus Metrics Mapping
  5. Error Budget Policy
  6. Grafana SLO Dashboard
  7. Review Cadence

1. Overview

What are SLIs and SLOs?

A Service Level Indicator (SLI) is a specific, quantitative measure of some aspect of the service that matters to users. Good SLIs are:

  • Measurable from existing telemetry (Prometheus metrics in this case)
  • Clearly defined: the numerator and denominator are unambiguous
  • Reflective of user experience, not just internal health

A Service Level Objective (SLO) is a target value or range for an SLI, expressed as a percentage of time that the SLI must meet the target over a rolling window (typically 30 days). SLOs drive error budget policy — the amount of unreliability the team can "spend" before reliability work must take priority over feature work.

Why they matter for Varpulis

Varpulis processes time-sensitive event streams where late or dropped events have direct downstream consequences (missed alerts, incorrect pattern matches, stale forecasts). The SLOs below reflect the operational characteristics established during benchmark testing:

WorkloadObserved throughputSource
Filter (CLI preload)234K events/sec/workerApama comparison benchmark, scenario 01
Sequence (CLI preload)256K events/sec/workerApama comparison benchmark, scenario 07
Kleene (CLI preload, normalized by matches)97K events/sec/workerApama comparison benchmark, scenario 04
Hamlet aggregation (single query)6.9M events/secHamlet vs ZDD benchmark
PST prediction51 ns per callPST forecast benchmark
MQTT connector ceiling~6K events/secI/O-bound, QoS 0 single-message

These numbers establish what "normal" looks like and inform the objective thresholds below.


2. Service Level Indicators

2.1 Availability

Availability measures whether the service is reachable and can process requests. Varpulis has two availability surfaces:

Coordinator availability — the Raft-leader coordinator must be reachable for pipeline deployments and cluster management operations. Its HTTP API at port 9100 is the probe target.

Worker availability — each worker node must be heartbeating successfully and carrying assigned pipelines. Availability is measured as the fraction of workers that are in ready status at any point in time.

SLI formula (coordinator):

coordinator_availability =
  (minutes_coordinator_api_responded_2xx / total_minutes_in_window) * 100

SLI formula (worker pool):

worker_pool_availability =
  avg_over_time(
    varpulis_cluster_workers_total{status="ready"} /
    (varpulis_cluster_workers_total{status="ready"} +
     varpulis_cluster_workers_total{status="unhealthy"})
  [30d]
  ) * 100

2.2 Processing Latency

Processing latency is measured from the moment an event is received by the engine to the moment it has been fully evaluated against all applicable stream definitions on a worker. The varpulis_processing_latency_seconds histogram tracks this per stream.

Two latency classes exist because pattern complexity varies significantly:

  • Simple filter — a stream containing only .where() predicates and/or .trend_aggregate(). No SASE NFA is constructed. Latency target: p99 < 10ms.
  • SASE pattern — a stream containing a sequence or Kleene pattern. The SASE+ engine runs an NFA traversal with active run management. Latency target: p99 < 100ms.

SLI formula (per stream):

latency_sli(quantile, stream) =
  histogram_quantile(
    quantile,
    sum(rate(varpulis_processing_latency_seconds_bucket[5m])) by (le, stream)
  )

2.3 Throughput

Throughput measures sustained event ingestion and processing capacity. It is a saturation SLI: when throughput falls below the objective, the engine is either degraded or under-provisioned relative to the incoming load.

SLI formula:

throughput_sli(worker) =
  sum(rate(varpulis_events_processed[5m])) by (instance)

The per-worker floor is set conservatively at 100K events/sec because:

  • Benchmark filter mode achieves 234K events/sec at full CPU saturation
  • MQTT connector I/O limits cap ingress at ~6K events/sec in connector-bound scenarios
  • 100K provides a safe margin for mixed workloads with pattern matching overhead

2.4 Error Rate

Error rate measures the fraction of ingested events that fail to complete processing. A failure includes: schema parse errors, runtime evaluation panics, connector delivery failures that route to the dead letter queue, and events dropped due to queue overflow.

SLI formula:

error_rate_sli =
  1 - (
    sum(rate(varpulis_events_processed[5m]))
    /
    sum(rate(varpulis_events_total[5m]))
  )

A separate DLQ accumulation SLI tracks events that completed engine processing but could not be delivered to sinks:

dlq_rate_sli =
  sum(rate(varpulis_dlq_events_total[5m]))
  /
  sum(rate(varpulis_output_events_total[5m]))

2.5 Recovery

Recovery SLIs measure how quickly the cluster self-heals after a failure event.

Leader failover time — after the current Raft leader becomes unavailable, how many seconds until a new leader is elected and coordinator API requests succeed again. Chaos tests exercise this path (see tests/e2e-raft/).

SLI proxy (Raft role stabilisation):

leader_failover_proxy =
  time from last varpulis_cluster_raft_role{role="leader"} == 1
  to next varpulis_cluster_raft_role{role="leader"} == 1
  (measured by changes(varpulis_cluster_raft_role[window]))

Pipeline migration time — after a worker goes unhealthy, how many seconds until its pipelines are reassigned and running on a healthy worker.

SLI proxy:

migration_p99 =
  histogram_quantile(
    0.99,
    sum(rate(varpulis_cluster_migration_duration_seconds_bucket[5m])) by (le, result)
  )

Checkpoint restore time — not currently exposed as a Prometheus metric. Tracked via structured logs (tracing spans tagged checkpoint_restore). Target: < 5s for a 1 GB state snapshot.


3. Service Level Objectives

The table below states all objectives as percentages over a 30-day rolling window.

#SLO NameSLIObjectiveWindow
1Coordinator AvailabilityHTTP 2xx response fraction99.9%30 days
2Worker Pool AvailabilityFraction of workers in ready state99.5%30 days
3Filter Latency p99varpulis_processing_latency_seconds p99 (filter streams)< 10ms5 min evaluation window
4SASE Latency p99varpulis_processing_latency_seconds p99 (pattern streams)< 100ms5 min evaluation window
5Throughput Floorevents/sec/worker (sum across all streams)> 100K/worker5 min evaluation window
6Event Processing Error Ratefailed events / total events< 0.1%30 days
7DLQ Delivery Error RateDLQ events / output events< 0.5%30 days
8Leader FailoverTime from leader loss to new leader elected< 30sper incident
9Pipeline Migration p99varpulis_cluster_migration_duration_seconds p99< 10s5 min evaluation window

Rationale for latency targets

The 10ms p99 target for simple filter streams is calibrated from the benchmark showing 234K events/sec throughput with a 4 µs average per-event cost. Even accounting for connector I/O overhead, lock contention, and GC pressure, the 10ms budget provides 2500x headroom over the observed per-event cost.

The 100ms p99 target for SASE pattern streams reflects the SASE active run overhead. With 10K+ active runs (varpulis_sase_peak_active_runs), each event must traverse all active run states. The alert threshold in alerts.yml fires at 100ms for exactly this reason.

Rationale for availability targets

99.9% coordinator uptime permits ~43 minutes of downtime per month. Because Raft provides automatic leader re-election (tested in chaos scenarios), sustained outages should be rare. The 0.1% unplanned budget covers:

  • Rolling upgrades (coordinator restart ~30s, covered by Raft failover)
  • Unexpected leader election storms (mitigated by VarpulisRaftLeaderChurn alert)
  • Infrastructure maintenance windows

99.5% worker pool availability permits ~3.6 hours of reduced worker capacity per month. Workers can be added or replaced without stopping pipelines (migrations handle reassignment).


4. Prometheus Metrics Mapping

4.1 Complete metrics inventory

MetricTypeLabelsDescription
varpulis_events_totalCounterevent_typeEvents received by the engine
varpulis_events_processedCounterstreamEvents fully processed per stream
varpulis_output_events_totalCounterstream, event_typeEvents emitted to sinks
varpulis_processing_latency_secondsHistogramstreamEnd-to-end per-event processing time
varpulis_stream_queue_sizeGaugestreamPending events in stream input queue
varpulis_active_streamsGaugeNumber of deployed stream definitions
varpulis_sase_events_totalCounterEvents evaluated by SASE+ engine
varpulis_sase_matches_totalCounterPattern matches emitted
varpulis_sase_peak_active_runsGaugePeak concurrent NFA runs
varpulis_dlq_events_totalCounterEvents written to dead letter queue
varpulis_connector_healthyGaugeconnector, connector_type1=healthy, 0=unhealthy
varpulis_cluster_workers_totalGaugestatusWorkers by status (ready/unhealthy/draining)
varpulis_cluster_raft_roleGauge0=Follower, 1=Candidate, 2=Leader
varpulis_cluster_raft_termGaugeCurrent Raft consensus term
varpulis_cluster_raft_commit_indexGaugeLast committed log entry index
varpulis_cluster_pipeline_groups_totalGaugePipeline groups registered
varpulis_cluster_deployments_totalCounterCumulative pipeline deployments
varpulis_cluster_deploy_duration_secondsHistogramresultPipeline deployment latency
varpulis_cluster_migration_duration_secondsHistogramresultPipeline migration latency
varpulis_cluster_migrations_totalCounterresultPipeline migrations by result
varpulis_cluster_health_sweep_duration_secondsHistogramworkers_checkedCoordinator health sweep latency
process_resident_memory_bytesGaugejobRSS memory per process (node_exporter)

4.2 PromQL queries for each SLI

SLO 1 — Coordinator Availability

Use the Prometheus up metric (set by the scrape loop) as a proxy for API availability. For full HTTP-level probing, configure blackbox_exporter against http://coordinator:9100/health.

promql
# Fraction of scrape intervals where coordinator was reachable (30d window)
avg_over_time(up{job="varpulis-coordinator"}[30d])

Error budget remaining (as seconds):

promql
# Minutes of downtime budget remaining in the 30-day window
(
  avg_over_time(up{job="varpulis-coordinator"}[30d]) - 0.999
) * 30 * 24 * 60

SLO 2 — Worker Pool Availability

promql
# Current worker pool availability ratio
varpulis_cluster_workers_total{status="ready"}
/
(
  varpulis_cluster_workers_total{status="ready"}
  + varpulis_cluster_workers_total{status="unhealthy"}
)
promql
# Rolling 30-day worker pool availability
avg_over_time(
  (
    varpulis_cluster_workers_total{status="ready"}
    /
    (
      varpulis_cluster_workers_total{status="ready"}
      + varpulis_cluster_workers_total{status="unhealthy"}
    )
  )[30d:5m]
)

SLO 3 — Filter Latency p99

promql
# p99 processing latency per stream (5-minute rate window)
histogram_quantile(
  0.99,
  sum(rate(varpulis_processing_latency_seconds_bucket[5m])) by (le, stream)
)

Fraction of 5-minute windows where p99 was within target (used for burn rate):

promql
# Good evaluation windows: p99 < 10ms
(
  histogram_quantile(
    0.99,
    sum(rate(varpulis_processing_latency_seconds_bucket[5m])) by (le, stream)
  ) < 0.010
) or vector(1)

SLO 4 — SASE Latency p99

promql
# p99 for pattern (SASE) streams — same metric, different threshold
histogram_quantile(
  0.99,
  sum(rate(varpulis_processing_latency_seconds_bucket[5m])) by (le, stream)
) > 0.100

The alert rule VarpulisHighProcessingLatency in deploy/prometheus/alerts.yml already fires at exactly this threshold (0.1 seconds for 5 minutes).

SLO 5 — Throughput Floor

promql
# Per-worker events processed per second
sum(rate(varpulis_events_processed[5m])) by (instance)
promql
# Cluster-wide aggregate throughput
sum(rate(varpulis_events_processed[5m]))

Alert condition (throughput below floor per worker):

promql
sum(rate(varpulis_events_processed[5m])) by (instance) < 100000

SLO 6 — Event Processing Error Rate

promql
# Error rate as a ratio (0.0 to 1.0)
1 - (
  sum(rate(varpulis_events_processed[5m]))
  /
  sum(rate(varpulis_events_total[5m]))
)
promql
# Error rate as a percentage — SLO breached when > 0.1%
(
  1 - (
    sum(rate(varpulis_events_processed[5m]))
    /
    sum(rate(varpulis_events_total[5m]))
  )
) * 100 > 0.1

Note: the existing alert VarpulisHighErrorRate fires at > 1% (10x the SLO threshold). The 0.1% SLO is tracked separately for budget accounting.

SLO 7 — DLQ Delivery Error Rate

promql
# Fraction of output events that ended up in the DLQ
sum(rate(varpulis_dlq_events_total[5m]))
/
sum(rate(varpulis_output_events_total[5m]))

SLO 8 — Leader Failover Time

This cannot be expressed directly in PromQL because it requires knowing the exact instant the leader became unavailable. Use the following proxy to detect failover events and measure duration:

promql
# Time series of Raft role changes (1 when leader, 0 otherwise)
varpulis_cluster_raft_role == 2

# Number of leader elections in a rolling 15-minute window
changes(varpulis_cluster_raft_role[15m])

For post-incident measurement, query the gap in leader continuity:

promql
# Periods with no leader elected (all nodes are follower or candidate)
sum(varpulis_cluster_raft_role == 2) == 0

SLO 9 — Pipeline Migration p99

promql
# p99 migration duration for successful migrations
histogram_quantile(
  0.99,
  sum(rate(varpulis_cluster_migration_duration_seconds_bucket[5m])) by (le, result)
)
promql
# Migration failure rate (should be 0)
rate(varpulis_cluster_migrations_total{result="failure"}[5m])

5. Error Budget Policy

Error budget calculation

An error budget is the complement of the SLO: the amount of unreliability that is acceptable over the measurement window.

error_budget = (1 - SLO_target) * window_duration

Example: Coordinator availability (SLO 1)
  error_budget = (1 - 0.999) * 30 days * 24 hours * 60 min = 43.2 minutes/month

Budget consumption and policy tiers

Budget consumedStatusPolicy
< 50%GreenNormal operations. Feature work proceeds at full pace.
50% – 75%YellowReliability review in the next sprint. Identify top incident contributors. Increase test coverage for affected components.
75% – 95%OrangeReliability freeze. No new features merged to production until root cause of consumption is resolved and a remediation plan is reviewed. On-call rotations reviewed.
> 95% (breached or at risk)RedIncident declared. All engineering focus shifts to reliability. Postmortem required within 5 business days. SLO targets reviewed for realism.

Budget reset events

The following events reset the budget consumption clock:

  • A new 30-day window begins (rolling window, so this is continuous)
  • The SLO target is revised downward (agreed with stakeholders after postmortem)
  • An infrastructure tier change (e.g., scaling from 2 to 4 workers) that materially improves the SLI baseline

Exclusions from budget consumption

The following periods are excluded from budget calculation when approved in advance:

  • Announced maintenance windows (max 4 hours per calendar month, requires 48-hour notice)
  • Force-majeure infrastructure events (cloud provider outages affecting the entire region)
  • Events caused by customer misconfiguration rather than engine defects

Exclusions must be recorded in the incident log with a start/end timestamp before they are applied.


6. Grafana SLO Dashboard

The SLO dashboard supplements the existing varpulis.json and varpulis-cluster.json dashboards. It provides burn rate panels that surface budget consumption faster than simple threshold alerts.

Multi-window burn rate

Google SRE recommends alerting on SLO burn rate across two time windows simultaneously to reduce alert noise while maintaining sensitivity to fast budget drain. The standard windows are 1-hour/5-minute (fast burn detection) and 6-hour/30-minute (slow burn detection).

Burn rate definition:

burn_rate = error_rate_in_window / (1 - SLO_target)

A burn rate of 1 means you are exactly on track to use 100% of the budget.
A burn rate of 14.4 over 1 hour means you will exhaust a 30-day budget in 2 hours.

Panel definitions

Panel 1: Coordinator Availability Burn Rate (6h / 1h windows)

promql
# 6-hour window burn rate
(1 - avg_over_time(up{job="varpulis-coordinator"}[6h])) / (1 - 0.999)

# 1-hour window burn rate (fast burn)
(1 - avg_over_time(up{job="varpulis-coordinator"}[1h])) / (1 - 0.999)

Alert thresholds (based on Google's multiwindow model, 30-day budget):

ConditionSeverityMeaning
6h burn rate > 14.4 AND 1h burn rate > 14.4criticalExhausts monthly budget in 2 hours
6h burn rate > 6 AND 1h burn rate > 6warningExhausts monthly budget in 5 hours
6h burn rate > 3 AND 30m burn rate > 3warningExhausts monthly budget in 10 hours
6h burn rate > 1infoBurning faster than sustainable

Panel 2: Event Processing Error Budget (30-day rolling)

promql
# Budget remaining as a percentage of total budget
(
  0.001                                         -- SLO allowance (0.1%)
  - (
      1 - (
        sum(rate(varpulis_events_processed[30d]))
        /
        sum(rate(varpulis_events_total[30d]))
      )
    )
) / 0.001 * 100

Display as a gauge: green above 50%, yellow 25-50%, red below 25%.

Panel 3: Processing Latency SLO Compliance (30-day)

Since varpulis_processing_latency_seconds is a histogram, compliance is computed using the ratio of good requests (latency within budget) to total requests:

promql
# Good requests: processed within 100ms (SASE target)
sum(rate(varpulis_processing_latency_seconds_bucket{le="0.1"}[30d]))
/
sum(rate(varpulis_processing_latency_seconds_count[30d]))
promql
# Good requests: processed within 10ms (filter target)
sum(rate(varpulis_processing_latency_seconds_bucket{le="0.01"}[30d]))
/
sum(rate(varpulis_processing_latency_seconds_count[30d]))

Panel 4: Latency Burn Rate (6h window)

promql
# SASE latency burn rate (threshold = 100ms, SLO = 99.9% compliance)
(
  1 - (
    sum(rate(varpulis_processing_latency_seconds_bucket{le="0.1"}[6h]))
    /
    sum(rate(varpulis_processing_latency_seconds_count[6h]))
  )
) / 0.001

Panel 5: Worker Pool Health Timeline

promql
# Worker availability ratio over time (for burn rate visualization)
varpulis_cluster_workers_total{status="ready"}
/
(
  varpulis_cluster_workers_total{status="ready"}
  + varpulis_cluster_workers_total{status="unhealthy"}
)
promql
# Worker pool burn rate (6h)
(
  1 - avg_over_time(
    (
      varpulis_cluster_workers_total{status="ready"}
      /
      (
        varpulis_cluster_workers_total{status="ready"}
        + varpulis_cluster_workers_total{status="unhealthy"}
      )
    )[6h:1m]
  )
) / 0.005

Panel 6: DLQ Accumulation Rate

promql
# Events entering DLQ per minute
sum(rate(varpulis_dlq_events_total[5m])) * 60

# DLQ burn rate relative to the 0.5% budget
(
  sum(rate(varpulis_dlq_events_total[1h]))
  /
  sum(rate(varpulis_output_events_total[1h]))
) / 0.005

Panel 7: SASE Active Runs vs Threshold

promql
# Current peak active SASE runs (memory saturation indicator)
varpulis_sase_peak_active_runs

# Alert band at 10K (existing alert threshold)
# Show as a reference line at 10000

Panel 8: Migration Latency p99 vs SLO

promql
# p99 migration duration (target < 10s)
histogram_quantile(
  0.99,
  sum(rate(varpulis_cluster_migration_duration_seconds_bucket[5m])) by (le)
)

Alert rules for SLO burn rates

Add the following groups to deploy/prometheus/alerts.yml:

yaml
- name: varpulis_slo_burn_rate
  rules:
    # Coordinator availability: fast burn (2-hour exhaustion)
    - alert: VarpulisCoordinatorBurnRateCritical
      expr: >
        (1 - avg_over_time(up{job="varpulis-coordinator"}[6h])) / (1 - 0.999) > 14.4
        and
        (1 - avg_over_time(up{job="varpulis-coordinator"}[1h])) / (1 - 0.999) > 14.4
      labels:
        severity: critical
        slo: coordinator_availability
      annotations:
        summary: "Coordinator SLO burning at critical rate"
        description: >
          At the current error rate, the monthly coordinator availability budget
          will be exhausted in approximately 2 hours.

    # Coordinator availability: slow burn (5-hour exhaustion)
    - alert: VarpulisCoordinatorBurnRateWarning
      expr: >
        (1 - avg_over_time(up{job="varpulis-coordinator"}[6h])) / (1 - 0.999) > 6
        and
        (1 - avg_over_time(up{job="varpulis-coordinator"}[1h])) / (1 - 0.999) > 6
      for: 5m
      labels:
        severity: warning
        slo: coordinator_availability
      annotations:
        summary: "Coordinator SLO burning at elevated rate"

    # Error rate SLO burn (0.1% target)
    - alert: VarpulisErrorBudgetBurnRateCritical
      expr: >
        (
          1 - (
            sum(rate(varpulis_events_processed[6h]))
            / sum(rate(varpulis_events_total[6h]))
          )
        ) / 0.001 > 14.4
        and
        (
          1 - (
            sum(rate(varpulis_events_processed[1h]))
            / sum(rate(varpulis_events_total[1h]))
          )
        ) / 0.001 > 14.4
      labels:
        severity: critical
        slo: event_error_rate
      annotations:
        summary: "Event processing error budget burning critically fast"

    # Latency SLO burn (SASE p99 < 100ms)
    - alert: VarpulisLatencyBudgetBurnRateCritical
      expr: >
        (
          1 - (
            sum(rate(varpulis_processing_latency_seconds_bucket{le="0.1"}[6h]))
            / sum(rate(varpulis_processing_latency_seconds_count[6h]))
          )
        ) / 0.001 > 14.4
      labels:
        severity: critical
        slo: sase_latency_p99
      annotations:
        summary: "SASE latency SLO burning critically fast"
        description: >
          More than 0.1% of event processing operations are exceeding the 100ms
          p99 latency target at a rate that will exhaust the monthly budget in ~2 hours.

7. Review Cadence

ActivityFrequencyOwner
SLO status review (budget consumption)Weekly (Monday standup)On-call engineer
Error budget report to stakeholdersMonthlyEngineering lead
SLO target review and adjustmentQuarterlyEngineering + product
Postmortem after SLO breachWithin 5 business days of breachIncident commander
Dashboard and alert rule auditQuarterly (coincides with target review)Platform team
  • docs/operations/alerting.md — alert rule descriptions and investigation guidance
  • docs/operations/runbook.md — step-by-step operational procedures
  • deploy/prometheus/alerts.yml — live alert rule definitions
  • deploy/docker/grafana/dashboards/varpulis.json — engine metrics dashboard
  • deploy/docker/grafana/dashboards/varpulis-cluster.json — cluster health dashboard

Varpulis - Next-generation streaming analytics engine