Performance Tuning Guide
Optimize Varpulis for maximum throughput, minimum latency, and efficient resource usage.
Performance Overview
Varpulis is designed for high-throughput stream processing. Typical performance:
| Configuration | Throughput | Latency (p99) |
|---|---|---|
| Single thread, simple filter | 500K+ events/sec | < 1ms |
| 8 workers, aggregations | 1M+ events/sec | < 5ms |
| Complex patterns, SIMD | 200K+ events/sec | < 10ms |
Throughput Optimization
Batch Processing Mode
By default, simulate runs in fast mode: all events are preloaded into memory and processed as fast as possible with no timing delays. This is optimal for offline analytics and benchmarking.
# Default fast mode (preloads events, no timing delays)
varpulis simulate -p rules.vpl -e data.evt
# Real-time replay with timing delays
varpulis simulate -p rules.vpl -e data.evt --timed
# Line-by-line reading for huge files that don't fit in memory
varpulis simulate -p rules.vpl -e data.evt --streamingParallel Workers
Scale across CPU cores:
# Use 8 workers
varpulis simulate -p rules.vpl -e data.evt --workers 8
# Use all available cores
varpulis simulate -p rules.vpl -e data.evt --workers $(nproc)Partitioning
Events are distributed across workers by partition key:
# Partition by device for IoT data
varpulis simulate ... --workers 8 --partition-by device_id
# Partition by user for user activity
varpulis simulate ... --workers 8 --partition-by user_idBest practices:
- Choose a field with good distribution (many unique values)
- Avoid high-cardinality keys that create too many partitions
- Related events (same user/device) should go to same partition
Configuration File for Workers
processing:
workers: 8
partition_by: device_idContext-Based Parallelism
For multi-core scaling, use named contexts to distribute streams across OS threads:
context ingestion (cores: [0, 1])
context analytics (cores: [2, 3])
context alerts (cores: [4])
stream FastFilter = RawEvents
.context(ingestion)
.where(value > 0)
.emit(context: analytics, data: data)
stream HeavyCompute = FastFilter
.context(analytics)
.window(1m)
.aggregate(avg: avg(value), stddev: stddev(value))Context Tuning Tips
- Separate I/O from compute: Put high-throughput ingestion on its own context
- Filter before crossing: Reduce event volume before cross-context emits
- Match NUMA topology: Pin cooperating contexts to cores on the same NUMA node
- Reserve core 0: System interrupts typically run on core 0
- Monitor per-context: Use Prometheus metrics to identify bottleneck contexts
For a complete guide, see Context-Based Execution.
SIMD Optimization
Varpulis uses SIMD (AVX2) for vectorized aggregations on x86_64.
Automatic SIMD
SIMD is automatically used for:
sum()- 4x parallel f64 additionavg()- Vectorized sum + countmin()/max()- Parallel comparisons
Verifying SIMD is Active
# Check CPU supports AVX2
grep avx2 /proc/cpuinfo
# Run with debug logging to confirm
RUST_LOG=varpulis_runtime::simd=debug varpulis simulate ...Performance Impact
| Operation | Scalar | SIMD (AVX2) | Improvement |
|---|---|---|---|
| sum(1M values) | 2.5ms | 0.6ms | 4x |
| avg(1M values) | 2.5ms | 0.7ms | 3.5x |
| min/max(1M values) | 2.0ms | 0.5ms | 4x |
Loop Unrolling Fallback
On non-AVX2 systems, 4-way loop unrolling provides ~2x speedup:
// Automatic fallback in simd.rs
// 4-way unrolling for better ILP on all architecturesIncremental Aggregation
For sliding windows, incremental aggregation avoids recomputing the entire window.
How It Works
Traditional: Recalculate sum of all N events on every slide Incremental: Add new events, subtract expired events (O(1) vs O(n))
Enabling Incremental Mode
Incremental aggregation is automatic for compatible operations:
sum()- Fully incrementalcount()- Fully incrementalavg()- Incremental (via sum/count)
Operations That Require Full Recomputation
min()/max()- Must track all values (heap-based optimization available)percentile()- Requires sorted datastddev()- Can use Welford's algorithm for incremental
Implementation Details
// IncrementalSum in simd.rs
pub struct IncrementalSum {
sum: f64,
count: usize,
}
impl IncrementalSum {
pub fn add(&mut self, value: f64) { ... } // O(1)
pub fn remove(&mut self, value: f64) { ... } // O(1)
pub fn sum(&self) -> f64 { self.sum } // O(1)
}Memory Optimization
Arc<Event> Sharing
Events are wrapped in Arc for efficient sharing:
pub type SharedEvent = Arc<Event>;Benefits:
- Multiple pattern runs share the same event data
- No deep copying when events enter multiple streams
- Reference counting instead of cloning
Window Memory Management
| Window Type | Memory Usage |
|---|---|
| Tumbling | O(events in current window) |
| Sliding | O(events in window size) |
| Count | O(window size) - fixed |
| Partitioned | O(partitions × window size) |
Recommendations:
- Prefer count windows for bounded memory:
// Fixed memory: exactly 1000 events
.window(1000)- Use shorter time windows:
// 5 minutes instead of 1 hour = 12x less memory
.window(5m)- Limit partition cardinality:
// Bad: unbounded partitions
.partition_by(request_id)
// Good: bounded partitions
.partition_by(region) // Limited number of regionsZDD Optimization (Pattern Matching)
Zero-suppressed Decision Diagrams (ZDD) compress Kleene capture combinations during SASE+ pattern matching:
- Exponential compression: 100 matching events → ~100 ZDD nodes instead of 2^100 subsets
- Efficient subset enumeration for match reporting
- Automatic sharing of common substructures
For multi-query trend aggregation optimization, see the Hamlet engine in trend aggregation.
Monitoring Memory
# Watch process memory
watch -n 1 'ps aux | grep varpulis | awk "{print \$6/1024 \" MB\"}"'
# With metrics enabled
curl -s localhost:9090/metrics | grep memoryPrometheus Monitoring
Enabling Metrics
varpulis server --metrics --metrics-port 9090Key Metrics to Monitor
| Metric | What to Watch |
|---|---|
varpulis_events_processed_total | Throughput rate |
varpulis_event_processing_duration_seconds | Latency percentiles |
varpulis_alerts_generated_total | Alert rate |
varpulis_active_patterns | Pattern state memory |
Grafana Dashboard Example
{
"panels": [
{
"title": "Throughput",
"targets": [{
"expr": "rate(varpulis_events_processed_total[1m])"
}]
},
{
"title": "P99 Latency",
"targets": [{
"expr": "histogram_quantile(0.99, varpulis_event_processing_duration_seconds_bucket)"
}]
}
]
}Alerting Rules
groups:
- name: varpulis
rules:
- alert: HighLatency
expr: histogram_quantile(0.99, varpulis_event_processing_duration_seconds_bucket) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Varpulis p99 latency above 100ms"
- alert: LowThroughput
expr: rate(varpulis_events_processed_total[5m]) < 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Varpulis throughput below 1000 events/sec"Query Optimization
Filter Early
Put selective filters first to reduce data volume:
// Good: filter first, then aggregate
stream Optimized = SensorReading
.where(sensor_type == "temperature") // Filter first
.window(1m)
.aggregate(avg: avg(value))
// Less optimal: aggregate all, then filter
stream LessOptimal = SensorReading
.window(1m)
.aggregate(avg: avg(value), type: first(sensor_type))
.where(type == "temperature") // Filter after aggregationMinimize Pattern Complexity
Simpler patterns match faster:
// Complex: 4 states, Kleene closure
pattern Complex = A+ -> B -> C+ -> D
// Simpler: can often be split
pattern Step1 = A -> B
pattern Step2 = C -> DUse Appropriate Timeouts
Shorter timeouts = less state to track:
// Keeping state for 24 hours is expensive
pattern Long = A -> B within 24h
// If 5 minutes is sufficient, use that
pattern Short = A -> B within 5mAvoid Unbounded Kleene
// Bad: could match millions of events
pattern Unbounded = Event* -> End
// Better: bounded by timeout
pattern Bounded = Event* -> End within 5m
// Even better: add constraints
pattern Constrained = Event[type == "relevant"]* -> End within 5mBenchmarking
Running Benchmarks
# Run all benchmarks
cargo bench -p varpulis-runtime
# Specific benchmark
cargo bench -p varpulis-runtime pattern_benchmarkAvailable Benchmarks
| Benchmark | What It Tests |
|---|---|
pattern_benchmark | SASE+ pattern matching throughput |
kleene_benchmark | Kleene closure overhead |
comparison_benchmark | Cross-engine comparisons |
Creating Custom Benchmarks
#[bench]
fn bench_my_pattern(b: &mut Bencher) {
let events = generate_test_events(10000);
let engine = setup_engine_with_pattern();
b.iter(|| {
for event in &events {
engine.process(event.clone());
}
});
}Comparing Configurations
# Baseline
varpulis simulate -p rules.vpl -e large.evt 2>&1 | grep "Event rate"
# With more workers
varpulis simulate -p rules.vpl -e large.evt --workers 8 2>&1 | grep "Event rate"
# With partitioning
varpulis simulate -p rules.vpl -e large.evt --workers 8 --partition-by device_id 2>&1 | grep "Event rate"System Tuning
Linux Kernel Parameters
# Increase file descriptor limits
ulimit -n 65535
# Increase network buffer sizes
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
# TCP tuning for high throughput
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"NUMA Awareness
For multi-socket systems:
# Pin to specific NUMA node
numactl --cpunodebind=0 --membind=0 varpulis simulate ...
# Check NUMA topology
numactl --hardwareCPU Pinning
# Pin entire process to specific cores
taskset -c 0-7 varpulis simulate ... --workers 8For finer-grained control, use VPL contexts with CPU affinity:
context ingest (cores: [0, 1])
context analyze (cores: [2, 3])Memory Limits
# Set memory limit (prevent OOM killer from killing other processes)
ulimit -v 8000000 # 8GB virtual memory limit
# Or use cgroups
cgcreate -g memory:varpulis
cgset -r memory.limit_in_bytes=8G varpulis
cgexec -g memory:varpulis varpulis simulate ...Performance Checklist
Before Production
- [ ] Consider using contexts for multi-core parallelism
- [ ] Run with
--workersmatching CPU cores - [ ] Enable
--partition-byfor parallel streams - [ ] Set appropriate window timeouts
- [ ] Enable metrics monitoring
- [ ] Set memory limits
- [ ] Configure logging level (info, not debug/trace)
During Load Testing
- [ ] Monitor CPU utilization (should be high but not 100%)
- [ ] Check memory growth rate
- [ ] Measure p99 latency
- [ ] Verify alert generation rate
- [ ] Test failover/restart scenarios
In Production
- [ ] Set up Prometheus scraping
- [ ] Configure alerting rules
- [ ] Monitor disk space (if using persistence)
- [ ] Review logs for warnings
- [ ] Track throughput trends
Troubleshooting Performance
Low Throughput
- Check CPU utilization - if low, add more workers
- Look for blocking operations (HTTP webhooks)
- Simplify complex patterns
- Verify SIMD is being used
High Latency
- Reduce window sizes
- Add more partitions
- Check for memory pressure
- Profile with
perfor flamegraph
Memory Growth
- Add timeouts to patterns
- Reduce window durations
- Limit partition cardinality
- Use shorter window durations or count-based windows
See Also
- Context-Based Execution - Multi-threaded contexts with CPU affinity
- Configuration Guide - Worker and partitioning config
- Troubleshooting - Debug performance issues
- Windows & Aggregations - SIMD details