Varpulis Alerting Reference

This document describes the Prometheus alerting rules defined in deploy/prometheus/alerts.yml and provides guidance on investigating and resolving each alert.

Engine Alerts

VarpulisHighProcessingLatency

Field	Value
Severity	warning
Condition	p99 processing latency > 100ms for 5 minutes
Metric	`varpulis_processing_latency_seconds` (histogram)

Cause: A stream's event processing pipeline is taking longer than expected. Common causes include expensive .where() predicates, large aggregation windows, or high Kleene closure fan-out.

Response:

Identify the affected stream from the stream label.
Check if the stream has unbounded Kleene patterns -- add .within() constraints.
Review aggregation window sizes -- consider reducing window duration.
Check system CPU and I/O load on the worker hosting the stream.

VarpulisHighErrorRate

Field	Value
Severity	warning
Condition	> 1% of events failing processing for 5 minutes
Metrics	`varpulis_events_total`, `varpulis_events_processed`

Cause: Events are being received but not successfully processed. This can indicate schema mismatches, runtime evaluation errors, or connector delivery failures.

Response:

Check application logs for processing errors.
Verify event schemas match the VPL event type definitions.
Check if a recent pipeline deployment introduced a bug.

VarpulisCriticalErrorRate

Field	Value
Severity	critical
Condition	> 5% of events failing processing for 5 minutes
Metrics	`varpulis_events_total`, `varpulis_events_processed`

Cause: Same as VarpulisHighErrorRate but at a severity requiring immediate action.

Response:

Consider rolling back the most recent pipeline deployment.
Check dead letter queue files for patterns in the failing events.
Escalate if the root cause is not immediately apparent.

VarpulisStreamQueueBacklog

Field	Value
Severity	warning
Condition	Stream queue size > 10,000 events for 5 minutes
Metric	`varpulis_stream_queue_size` (gauge)

Cause: The engine is not consuming events as fast as they arrive. Backpressure is building up in the stream's input queue.

Response:

Check processing latency for the affected stream.
Consider adding more workers and partitioning the stream with .partition_by().
Verify upstream sources are not sending burst traffic.

VarpulisSaseRunBacklog

Field	Value
Severity	warning
Condition	Peak active SASE runs > 10,000 for 5 minutes
Metric	`varpulis_sase_peak_active_runs` (gauge)

Cause: The SASE pattern engine has accumulated a large number of concurrent partial matches. This typically happens with broad patterns that lack window constraints.

Response:

Add or tighten .within() time constraints on sequence patterns.
Add more selective .where() predicates to reduce match candidates.
If using Kleene closures, verify max_kleene_events is appropriately bounded.

VarpulisNoEventsReceived

Field	Value
Severity	warning
Condition	Zero events received for 10 minutes
Metric	`varpulis_events_total` (counter)

Cause: No events are arriving at the engine. Upstream sources or connectors may be down.

Response:

Check connector health status.
Verify MQTT broker or Kafka cluster availability.
Check network connectivity between Varpulis and the event source.

Cluster Alerts

VarpulisWorkerUnhealthy

Field	Value
Severity	critical
Condition	Any workers in "unhealthy" status for 2 minutes
Metric	`varpulis_cluster_workers_total{status="unhealthy"}`

Cause: A worker has missed heartbeat deadlines. It may have crashed, lost network connectivity, or be under extreme resource pressure.

Response:

Check if the worker process is still running.
Verify network connectivity between coordinator and worker.
Check worker host for resource exhaustion (CPU, memory, disk).
Pipelines will be auto-migrated; verify migration completes.

VarpulisNoReadyWorkers

Field	Value
Severity	critical
Condition	Zero ready workers for 1 minute
Metric	`varpulis_cluster_workers_total{status="ready"}`

Cause: All workers are either down, draining, or unhealthy. No pipelines can be executed.

Response:

Immediately check all worker processes and hosts.
Check coordinator logs for mass worker deregistration.
Verify no infrastructure-wide issue (network partition, DNS failure).

VarpulisRaftLeaderChurn

Field	Value
Severity	warning
Condition	Raft role changes > 4 times in 15 minutes
Metric	`varpulis_cluster_raft_role`

Cause: The Raft consensus protocol is unable to maintain stable leadership. This causes cluster operations to stall during elections.

Response:

Check network latency and packet loss between cluster nodes.
Verify system clocks are synchronized (NTP).
Ensure the cluster has an odd number of nodes (3 or 5 recommended).
Check if any node is under extreme CPU or I/O pressure.

VarpulisRaftTermAdvancing

Field	Value
Severity	warning
Condition	Raft term increases > 5 in 10 minutes
Metric	`varpulis_cluster_raft_term`

Cause: Closely related to leader churn. Rapidly advancing terms indicate repeated failed elections.

Response: Same as VarpulisRaftLeaderChurn.

VarpulisMigrationFailures

Field	Value
Severity	warning
Condition	Any migration failures in 10 minutes
Metric	`varpulis_cluster_migrations_total{result="failure"}`

Cause: Pipeline state migration between workers is failing. This can happen when target workers are unreachable or have insufficient resources.

Response:

Check coordinator logs for migration error details.
Verify the target worker has capacity for additional pipelines.
Check if state checkpoints are corrupted or too large to transfer.

VarpulisDeploymentFailures

Field	Value
Severity	warning
Condition	Any deployment failures in 10 minutes
Metric	`varpulis_cluster_deploy_duration_seconds_count{result="failure"}`

Cause: Pipeline group deployments are failing on workers.

Response:

Validate VPL source with the /api/v1/cluster/validate endpoint.
Check connector configurations (broker addresses, credentials).
Verify workers have the required feature flags enabled.

VarpulisSlowHealthSweeps

Field	Value
Severity	warning
Condition	Health sweep p99 > 50ms
Metric	`varpulis_cluster_health_sweep_duration_seconds` (histogram)

Cause: The coordinator's periodic health check of all workers is slow.

Response:

Check coordinator CPU and network I/O.
Reduce the number of workers per coordinator if at scale.

Infrastructure Alerts

VarpulisHighMemoryUsage

Field	Value
Severity	warning
Condition	Process RSS > 2 GB for 5 minutes
Metric	`process_resident_memory_bytes` (standard Prometheus metric)

Cause: The Varpulis process is consuming significant memory. Potential causes include large event windows, many concurrent SASE runs, or connector buffer growth.

Response:

Check varpulis_sase_peak_active_runs for unbounded pattern growth.
Review window sizes and aggregation state.
Check stream queue sizes for backlog buildup.
Consider deploying more workers to distribute load.

VarpulisCriticalMemoryUsage

Field	Value
Severity	critical
Condition	Process RSS > 4 GB for 2 minutes
Metric	`process_resident_memory_bytes`

Cause: Same as VarpulisHighMemoryUsage but at an urgent level.

Response:

Consider restarting the process with state persistence enabled.
Immediately investigate and address unbounded state growth.
Set container memory limits to prevent host-level OOM.

VarpulisDlqGrowing

Field	Value
Severity	warning
Condition	> 100 DLQ events in 10 minutes
Metric	`varpulis_dlq_events_total` (counter)

Cause: Events are being written to the dead letter queue, indicating that a sink connector is rejecting or failing to deliver events.

Response:

Check which connector is failing from the DLQ file entries.
Verify the downstream system (Kafka, database, HTTP endpoint) is operational.
Check circuit breaker state -- it may be open due to repeated failures.

VarpulisDlqCritical

Field	Value
Severity	critical
Condition	> 1,000 DLQ events in 10 minutes
Metric	`varpulis_dlq_events_total` (counter)

Cause: A sink connector is experiencing sustained failures.

Response:

Immediately investigate the downstream system health.
Check the DLQ file for the error messages attached to each event.
If the downstream system is irrecoverable, consider deploying a fallback sink.
Plan to replay DLQ events once the sink is restored.

VarpulisConnectorUnhealthy

Field	Value
Severity	critical
Condition	Connector health check reports unhealthy for 2 minutes
Metric	`varpulis_connector_healthy` (gauge, 0 or 1)

Cause: A connector has failed its health check. The circuit breaker may be open, and events are likely being routed to the DLQ.

Response:

Check connector logs for connection errors.
Verify the upstream/downstream system is reachable.
Check credentials and TLS certificates.
The circuit breaker will attempt half-open probes automatically.

VarpulisSlowDeploys

Field	Value
Severity	warning
Condition	Deploy p99 latency > 30 seconds
Metric	`varpulis_cluster_deploy_duration_seconds` (histogram)

Cause: Pipeline deployments are slow, affecting responsiveness of hot-reload and new pipeline creation.

Response:

Check worker load and available resources.
Review the size of pipeline state being transferred.
Check network latency between coordinator and workers.

Varpulis Alerting Reference ​

Engine Alerts ​

VarpulisHighProcessingLatency ​

VarpulisHighErrorRate ​

VarpulisCriticalErrorRate ​

VarpulisStreamQueueBacklog ​

VarpulisSaseRunBacklog ​

VarpulisNoEventsReceived ​

Cluster Alerts ​

VarpulisWorkerUnhealthy ​

VarpulisNoReadyWorkers ​

VarpulisRaftLeaderChurn ​

VarpulisRaftTermAdvancing ​

VarpulisMigrationFailures ​

VarpulisDeploymentFailures ​

VarpulisSlowHealthSweeps ​

Infrastructure Alerts ​

VarpulisHighMemoryUsage ​

VarpulisCriticalMemoryUsage ​

VarpulisDlqGrowing ​

VarpulisDlqCritical ​

VarpulisConnectorUnhealthy ​

VarpulisSlowDeploys ​

Varpulis Alerting Reference

Engine Alerts

VarpulisHighProcessingLatency

VarpulisHighErrorRate

VarpulisCriticalErrorRate

VarpulisStreamQueueBacklog

VarpulisSaseRunBacklog

VarpulisNoEventsReceived

Cluster Alerts

VarpulisWorkerUnhealthy

VarpulisNoReadyWorkers

VarpulisRaftLeaderChurn

VarpulisRaftTermAdvancing

VarpulisMigrationFailures

VarpulisDeploymentFailures

VarpulisSlowHealthSweeps

Infrastructure Alerts

VarpulisHighMemoryUsage

VarpulisCriticalMemoryUsage

VarpulisDlqGrowing

VarpulisDlqCritical

VarpulisConnectorUnhealthy

VarpulisSlowDeploys