13 KiB
Path Segment Architecture
Overview
Baffle Hub uses a path segment decomposition strategy to efficiently store and query URL paths in WAF event logs. This architecture provides significant storage compression while enabling fast prefix-based path searches using SQLite's B-tree indexes.
The Problem
WAF systems generate millions of request events. Storing full URL paths like /api/v1/users/123/posts repeatedly wastes storage and makes pattern-based queries inefficient.
Traditional approaches:
- Full path storage: High redundancy, large database size
- String pattern matching with LIKE: No index support, slow queries
- Full-Text Search (FTS): Complex setup, overkill for structured paths
Our Solution: Path Segment Normalization
Architecture Components
Request: /api/v1/users/123/posts
↓
Decompose into segments: ["api", "v1", "users", "123", "posts"]
↓
Normalize to IDs: [1, 2, 3, 4, 5]
↓
Store as JSON array: "[1,2,3,4,5]"
Database Schema
# path_segments table - deduplicated segment dictionary
create_table :path_segments do |t|
t.string :segment, null: false, index: { unique: true }
t.integer :usage_count, default: 1, null: false
t.datetime :first_seen_at, null: false
t.timestamps
end
# events table - references segments by ID
create_table :events do |t|
t.string :request_segment_ids # JSON array: "[1,2,3]"
t.string :request_path # Original path for display
# ... other fields
end
# Critical index for fast lookups
add_index :events, :request_segment_ids
Models
PathSegment - The segment dictionary:
class PathSegment < ApplicationRecord
validates :segment, presence: true, uniqueness: true
validates :usage_count, presence: true, numericality: { greater_than: 0 }
def self.find_or_create_segment(segment)
find_or_create_by(segment: segment) do |path_segment|
path_segment.usage_count = 1
path_segment.first_seen_at = Time.current
end
end
def increment_usage!
increment!(:usage_count)
end
end
Event - Stores segment IDs as JSON array:
class Event < ApplicationRecord
serialize :request_segment_ids, type: Array, coder: JSON
# Path reconstruction helper
def reconstructed_path
return request_path if request_segment_ids.blank?
segments = PathSegment.where(id: request_segment_ids).index_by(&:id)
'/' + request_segment_ids.map { |id| segments[id]&.segment }.compact.join('/')
end
def path_depth
request_segment_ids&.length || 0
end
end
The Indexing Strategy
Why Standard LIKE Doesn't Work
SQLite's B-tree indexes only work with LIKE when the pattern is a simple alphanumeric prefix:
-- ✅ Uses index (alphanumeric prefix)
WHERE column LIKE 'api%'
-- ❌ Full table scan (starts with '[')
WHERE request_segment_ids LIKE '[1,2,%'
The Solution: Range Queries on Lexicographic Sort
JSON arrays sort lexicographically in SQLite:
"[1,2]" (exact match)
"[1,2,3]" (prefix match - has [1,2] as start)
"[1,2,4]" (prefix match - has [1,2] as start)
"[1,2,99]" (prefix match - has [1,2] as start)
"[1,3]" (out of range - different prefix)
To find all paths starting with [1,2]:
-- Exact match OR prefix range
WHERE request_segment_ids = '[1,2]'
OR (request_segment_ids >= '[1,2,' AND request_segment_ids < '[1,3]')
The range >= '[1,2,' AND < '[1,3]' captures all arrays starting with [1,2,...].
Query Performance
EXPLAIN QUERY PLAN:
MULTI-INDEX OR
├─ INDEX 1: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids=?)
└─ INDEX 2: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids>? AND request_segment_ids<?)
Both branches use the B-tree index = O(log n) lookups!
Implementation: with_path_prefix Scope
scope :with_path_prefix, ->(prefix_segment_ids) {
return none if prefix_segment_ids.blank?
# Convert [1, 2] to JSON string "[1,2]"
prefix_str = prefix_segment_ids.to_json
# Build upper bound by incrementing last segment
# [1, 2] + 1 = [1, 3]
upper_prefix = prefix_segment_ids[0..-2] + [prefix_segment_ids.last + 1]
upper_str = upper_prefix.to_json
# Lower bound for prefix matches: "[1,2,"
lower_prefix_str = "#{prefix_str[0..-2]},"
# Range query that uses B-tree index
where("request_segment_ids = ? OR (request_segment_ids >= ? AND request_segment_ids < ?)",
prefix_str, lower_prefix_str, upper_str)
}
Usage Examples
Basic Prefix Search
# Find all /api/v1/* paths
api_seg = PathSegment.find_by(segment: 'api')
v1_seg = PathSegment.find_by(segment: 'v1')
events = Event.with_path_prefix([api_seg.id, v1_seg.id])
# Matches: /api/v1, /api/v1/users, /api/v1/users/123, etc.
Combined with Other Filters
# Blocked requests to /admin/* from specific IP
admin_seg = PathSegment.find_by(segment: 'admin')
Event.where(ip_address: '192.168.1.100')
.where(waf_action: :deny)
.with_path_prefix([admin_seg.id])
Using Composite Index
# POST requests to /api/* on specific host
# Uses: idx_events_host_method_path
host = RequestHost.find_by(hostname: 'api.example.com')
api_seg = PathSegment.find_by(segment: 'api')
Event.where(request_host_id: host.id, request_method: :post)
.with_path_prefix([api_seg.id])
Exact Path Match
# Find exact path /api/v1 (not /api/v1/users)
api_seg = PathSegment.find_by(segment: 'api')
v1_seg = PathSegment.find_by(segment: 'v1')
Event.where(request_segment_ids: [api_seg.id, v1_seg.id].to_json)
Path Reconstruction for Display
events = Event.with_path_prefix([api_seg.id]).limit(10)
events.each do |event|
puts "#{event.reconstructed_path} - #{event.waf_action}"
# => /api/v1/users - allow
# => /api/v1/posts - deny
end
Performance Characteristics
| Operation | Index Used | Complexity | Notes |
|---|---|---|---|
| Exact path match | ✅ B-tree | O(log n) | Single index lookup |
| Prefix path match | ✅ B-tree range | O(log n + k) | k = number of matches |
| Path depth filter | ❌ None | O(n) | Full table scan - use sparingly |
| Host+method+path | ✅ Composite | O(log n + k) | Optimal for WAF queries |
Indexes in Schema
# Single-column index for path queries
add_index :events, :request_segment_ids
# Composite index for common WAF query patterns
add_index :events, [:request_host_id, :request_method, :request_segment_ids],
name: 'idx_events_host_method_path'
Storage Efficiency
Compression Benefits
Example: /api/v1/users appears in 100,000 events
Without normalization:
100,000 events × 15 bytes = 1,500,000 bytes (1.5 MB)
With normalization:
3 segments × 10 bytes (avg) = 30 bytes
100,000 events × 7 bytes ("[1,2,3]") = 700,000 bytes (700 KB)
Total: 700,030 bytes (700 KB)
Savings: 53% reduction
Plus benefits:
- Usage tracking:
usage_countshows hot paths - Analytics: Easy to identify common path patterns
- Flexibility: Can query at segment level
Normalization Process
Event Creation Flow
# 1. Event arrives with full path
payload = {
"request" => { "path" => "/api/v1/users/123" }
}
# 2. Event model extracts path
event = Event.create_from_waf_payload!(event_id, payload, project)
# Sets: request_path = "/api/v1/users/123"
# 3. After validation, EventNormalizer runs
EventNormalizer.normalize_event!(event)
# 4. Path is decomposed into segments
segments = ["/api/v1/users/123"].split('/').reject(&:blank?)
# => ["api", "v1", "users", "123"]
# 5. Each segment is normalized to ID
segment_ids = segments.map do |segment|
path_segment = PathSegment.find_or_create_segment(segment)
path_segment.increment_usage! unless path_segment.new_record?
path_segment.id
end
# => [1, 2, 3, 4]
# 6. IDs stored as JSON array
event.request_segment_ids = segment_ids
# Stored in DB as: "[1,2,3,4]"
EventNormalizer Service
class EventNormalizer
def normalize_path_segments
segments = @event.path_segments_array
return if segments.empty?
segment_ids = segments.map do |segment|
path_segment = PathSegment.find_or_create_segment(segment)
path_segment.increment_usage! unless path_segment.new_record?
path_segment.id
end
# Store as array - serialize will handle JSON encoding
@event.request_segment_ids = segment_ids
end
end
Important: JSON Functions and Performance
❌ Avoid in WHERE Clauses
JSON functions like json_array_length() cannot use indexes:
# ❌ SLOW - Full table scan
Event.where("json_array_length(request_segment_ids) = ?", 3)
# ✅ FAST - Filter in Ruby after indexed query
Event.with_path_prefix([api_id]).select { |e| e.path_depth == 3 }
✅ Use for Analytics (Async)
JSON functions are fine for analytics queries run in background jobs:
# Background job for analytics
class PathDepthAnalysisJob < ApplicationJob
def perform(project_id)
# This is OK in async context
stats = Event.where(project_id: project_id)
.select("json_array_length(request_segment_ids) as depth, COUNT(*) as count")
.group("depth")
.order(:depth)
# Store results for dashboard
PathDepthStats.create!(project_id: project_id, data: stats)
end
end
Edge Cases and Considerations
Empty Paths
request_path = "/"
segments = [] # Empty after split and reject
request_segment_ids = [] # Empty array
# Stored as: "[]"
Trailing Slashes
"/api/v1/" == "/api/v1" # Both normalize to ["api", "v1"]
Special Characters in Segments
# URL-encoded segments are stored as-is
"/search?q=hello%20world"
# Segments: ["search?q=hello%20world"]
Consider normalizing query params separately if needed.
Very Deep Paths
Paths with 10+ segments work fine but consider:
- Are they legitimate? (Could indicate attack)
- Impact on JSON array size
- Consider truncating for analytics
Analytics Use Cases
Most Common Paths
# Top 10 most accessed paths
Event.group(:request_segment_ids)
.order('COUNT(*) DESC')
.limit(10)
.count
.map { |seg_ids, count|
path = PathSegment.where(id: JSON.parse(seg_ids))
.pluck(:segment)
.join('/')
["/#{path}", count]
}
Hot Path Segments
# Most frequently used segments (indicates common endpoints)
PathSegment.order(usage_count: :desc).limit(20)
Attack Pattern Detection
# Paths with unusual depth (possible directory traversal)
Event.where(waf_action: :deny)
.select { |e| e.path_depth > 10 }
.group_by { |e| e.request_segment_ids.first }
Path-Based Rule Generation
# Auto-block paths that are frequently denied
suspicious_paths = Event.where(waf_action: :deny)
.where('created_at > ?', 1.hour.ago)
.group(:request_segment_ids)
.having('COUNT(*) > ?', 100)
.pluck(:request_segment_ids)
suspicious_paths.each do |seg_ids|
# TODO: Implement rule creation for blocking path segments
# Rule.create!(rule_type: 'path_pattern', conditions: { patterns: seg_ids }, action: 'deny')
end
Future Optimizations
Phase 2 Considerations
If performance becomes critical:
- Materialized Path Column: Pre-compute common prefix patterns
- Trie Data Structure: In-memory trie for ultra-fast prefix matching
- Redis Cache: Cache hot path lookups
- Partial Indexes: Index only blocked/challenged events
# Example: Partial index for security-relevant events
add_index :events, :request_segment_ids,
where: "waf_action IN ('deny', 'challenge')",
name: 'idx_events_blocked_paths'
Storage Considerations
For very large deployments (100M+ events):
- Archive old events: Move to separate table
- Aggregate path stats: Pre-compute daily/hourly summaries
- Compress JSON: SQLite JSON1 extension supports compression
Testing
Test Index Usage
# Verify B-tree index is being used
sql = Event.with_path_prefix([1, 2]).to_sql
plan = ActiveRecord::Base.connection.execute("EXPLAIN QUERY PLAN #{sql}")
# Should see: "SEARCH events USING INDEX index_events_on_request_segment_ids"
puts plan.to_a
Benchmark Queries
require 'benchmark'
prefix_ids = [1, 2]
# Test indexed range query
Benchmark.bm do |x|
x.report("Indexed range:") {
Event.with_path_prefix(prefix_ids).count
}
x.report("LIKE query:") {
Event.where("request_segment_ids LIKE ?", "[1,2,%").count
}
end
# Range query should be 10-100x faster
Conclusion
Path segment normalization with JSON array storage provides:
✅ Significant storage savings (50%+ compression) ✅ Fast prefix queries using standard B-tree indexes ✅ Analytics-friendly with usage tracking and pattern detection ✅ Rails-native using built-in serialization ✅ Scalable to millions of events with O(log n) lookups
The key insight: Range queries on lexicographically-sorted JSON strings use B-tree indexes efficiently, avoiding the need for complex full-text search or custom indexing strategies.
Related Documentation:
- Event Ingestion (TODO)
- WAF Rule Engine (TODO)
- Analytics Architecture (TODO)