Varpulis Alerting Reference
This document describes the Prometheus alerting rules defined in deploy/prometheus/alerts.yml and provides guidance on investigating and resolving each alert.
Engine Alerts
VarpulisHighProcessingLatency
| Field | Value |
|---|---|
| Severity | warning |
| Condition | p99 processing latency > 100ms for 5 minutes |
| Metric | varpulis_processing_latency_seconds (histogram) |
Cause: A stream's event processing pipeline is taking longer than expected. Common causes include expensive .where() predicates, large aggregation windows, or high Kleene closure fan-out.
Response:
- Identify the affected stream from the
streamlabel. - Check if the stream has unbounded Kleene patterns -- add
.within()constraints. - Review aggregation window sizes -- consider reducing window duration.
- Check system CPU and I/O load on the worker hosting the stream.
VarpulisHighErrorRate
| Field | Value |
|---|---|
| Severity | warning |
| Condition | > 1% of events failing processing for 5 minutes |
| Metrics | varpulis_events_total, varpulis_events_processed |
Cause: Events are being received but not successfully processed. This can indicate schema mismatches, runtime evaluation errors, or connector delivery failures.
Response:
- Check application logs for processing errors.
- Verify event schemas match the VPL event type definitions.
- Check if a recent pipeline deployment introduced a bug.
VarpulisCriticalErrorRate
| Field | Value |
|---|---|
| Severity | critical |
| Condition | > 5% of events failing processing for 5 minutes |
| Metrics | varpulis_events_total, varpulis_events_processed |
Cause: Same as VarpulisHighErrorRate but at a severity requiring immediate action.
Response:
- Consider rolling back the most recent pipeline deployment.
- Check dead letter queue files for patterns in the failing events.
- Escalate if the root cause is not immediately apparent.
VarpulisStreamQueueBacklog
| Field | Value |
|---|---|
| Severity | warning |
| Condition | Stream queue size > 10,000 events for 5 minutes |
| Metric | varpulis_stream_queue_size (gauge) |
Cause: The engine is not consuming events as fast as they arrive. Backpressure is building up in the stream's input queue.
Response:
- Check processing latency for the affected stream.
- Consider adding more workers and partitioning the stream with
.partition_by(). - Verify upstream sources are not sending burst traffic.
VarpulisSaseRunBacklog
| Field | Value |
|---|---|
| Severity | warning |
| Condition | Peak active SASE runs > 10,000 for 5 minutes |
| Metric | varpulis_sase_peak_active_runs (gauge) |
Cause: The SASE pattern engine has accumulated a large number of concurrent partial matches. This typically happens with broad patterns that lack window constraints.
Response:
- Add or tighten
.within()time constraints on sequence patterns. - Add more selective
.where()predicates to reduce match candidates. - If using Kleene closures, verify
max_kleene_eventsis appropriately bounded.
VarpulisNoEventsReceived
| Field | Value |
|---|---|
| Severity | warning |
| Condition | Zero events received for 10 minutes |
| Metric | varpulis_events_total (counter) |
Cause: No events are arriving at the engine. Upstream sources or connectors may be down.
Response:
- Check connector health status.
- Verify MQTT broker or Kafka cluster availability.
- Check network connectivity between Varpulis and the event source.
Cluster Alerts
VarpulisWorkerUnhealthy
| Field | Value |
|---|---|
| Severity | critical |
| Condition | Any workers in "unhealthy" status for 2 minutes |
| Metric | varpulis_cluster_workers_total{status="unhealthy"} |
Cause: A worker has missed heartbeat deadlines. It may have crashed, lost network connectivity, or be under extreme resource pressure.
Response:
- Check if the worker process is still running.
- Verify network connectivity between coordinator and worker.
- Check worker host for resource exhaustion (CPU, memory, disk).
- Pipelines will be auto-migrated; verify migration completes.
VarpulisNoReadyWorkers
| Field | Value |
|---|---|
| Severity | critical |
| Condition | Zero ready workers for 1 minute |
| Metric | varpulis_cluster_workers_total{status="ready"} |
Cause: All workers are either down, draining, or unhealthy. No pipelines can be executed.
Response:
- Immediately check all worker processes and hosts.
- Check coordinator logs for mass worker deregistration.
- Verify no infrastructure-wide issue (network partition, DNS failure).
VarpulisRaftLeaderChurn
| Field | Value |
|---|---|
| Severity | warning |
| Condition | Raft role changes > 4 times in 15 minutes |
| Metric | varpulis_cluster_raft_role |
Cause: The Raft consensus protocol is unable to maintain stable leadership. This causes cluster operations to stall during elections.
Response:
- Check network latency and packet loss between cluster nodes.
- Verify system clocks are synchronized (NTP).
- Ensure the cluster has an odd number of nodes (3 or 5 recommended).
- Check if any node is under extreme CPU or I/O pressure.
VarpulisRaftTermAdvancing
| Field | Value |
|---|---|
| Severity | warning |
| Condition | Raft term increases > 5 in 10 minutes |
| Metric | varpulis_cluster_raft_term |
Cause: Closely related to leader churn. Rapidly advancing terms indicate repeated failed elections.
Response: Same as VarpulisRaftLeaderChurn.
VarpulisMigrationFailures
| Field | Value |
|---|---|
| Severity | warning |
| Condition | Any migration failures in 10 minutes |
| Metric | varpulis_cluster_migrations_total{result="failure"} |
Cause: Pipeline state migration between workers is failing. This can happen when target workers are unreachable or have insufficient resources.
Response:
- Check coordinator logs for migration error details.
- Verify the target worker has capacity for additional pipelines.
- Check if state checkpoints are corrupted or too large to transfer.
VarpulisDeploymentFailures
| Field | Value |
|---|---|
| Severity | warning |
| Condition | Any deployment failures in 10 minutes |
| Metric | varpulis_cluster_deploy_duration_seconds_count{result="failure"} |
Cause: Pipeline group deployments are failing on workers.
Response:
- Validate VPL source with the
/api/v1/cluster/validateendpoint. - Check connector configurations (broker addresses, credentials).
- Verify workers have the required feature flags enabled.
VarpulisSlowHealthSweeps
| Field | Value |
|---|---|
| Severity | warning |
| Condition | Health sweep p99 > 50ms |
| Metric | varpulis_cluster_health_sweep_duration_seconds (histogram) |
Cause: The coordinator's periodic health check of all workers is slow.
Response:
- Check coordinator CPU and network I/O.
- Reduce the number of workers per coordinator if at scale.
Infrastructure Alerts
VarpulisHighMemoryUsage
| Field | Value |
|---|---|
| Severity | warning |
| Condition | Process RSS > 2 GB for 5 minutes |
| Metric | process_resident_memory_bytes (standard Prometheus metric) |
Cause: The Varpulis process is consuming significant memory. Potential causes include large event windows, many concurrent SASE runs, or connector buffer growth.
Response:
- Check
varpulis_sase_peak_active_runsfor unbounded pattern growth. - Review window sizes and aggregation state.
- Check stream queue sizes for backlog buildup.
- Consider deploying more workers to distribute load.
VarpulisCriticalMemoryUsage
| Field | Value |
|---|---|
| Severity | critical |
| Condition | Process RSS > 4 GB for 2 minutes |
| Metric | process_resident_memory_bytes |
Cause: Same as VarpulisHighMemoryUsage but at an urgent level.
Response:
- Consider restarting the process with state persistence enabled.
- Immediately investigate and address unbounded state growth.
- Set container memory limits to prevent host-level OOM.
VarpulisDlqGrowing
| Field | Value |
|---|---|
| Severity | warning |
| Condition | > 100 DLQ events in 10 minutes |
| Metric | varpulis_dlq_events_total (counter) |
Cause: Events are being written to the dead letter queue, indicating that a sink connector is rejecting or failing to deliver events.
Response:
- Check which connector is failing from the DLQ file entries.
- Verify the downstream system (Kafka, database, HTTP endpoint) is operational.
- Check circuit breaker state -- it may be open due to repeated failures.
VarpulisDlqCritical
| Field | Value |
|---|---|
| Severity | critical |
| Condition | > 1,000 DLQ events in 10 minutes |
| Metric | varpulis_dlq_events_total (counter) |
Cause: A sink connector is experiencing sustained failures.
Response:
- Immediately investigate the downstream system health.
- Check the DLQ file for the error messages attached to each event.
- If the downstream system is irrecoverable, consider deploying a fallback sink.
- Plan to replay DLQ events once the sink is restored.
VarpulisConnectorUnhealthy
| Field | Value |
|---|---|
| Severity | critical |
| Condition | Connector health check reports unhealthy for 2 minutes |
| Metric | varpulis_connector_healthy (gauge, 0 or 1) |
Cause: A connector has failed its health check. The circuit breaker may be open, and events are likely being routed to the DLQ.
Response:
- Check connector logs for connection errors.
- Verify the upstream/downstream system is reachable.
- Check credentials and TLS certificates.
- The circuit breaker will attempt half-open probes automatically.
VarpulisSlowDeploys
| Field | Value |
|---|---|
| Severity | warning |
| Condition | Deploy p99 latency > 30 seconds |
| Metric | varpulis_cluster_deploy_duration_seconds (histogram) |
Cause: Pipeline deployments are slow, affecting responsiveness of hot-reload and new pipeline creation.
Response:
- Check worker load and available resources.
- Review the size of pipeline state being transferred.
- Check network latency between coordinator and workers.