513 lines
13 KiB
Markdown
513 lines
13 KiB
Markdown
# Path Segment Architecture
|
||
|
||
## Overview
|
||
|
||
Baffle Hub uses a path segment decomposition strategy to efficiently store and query URL paths in WAF event logs. This architecture provides significant storage compression while enabling fast prefix-based path searches using SQLite's B-tree indexes.
|
||
|
||
## The Problem
|
||
|
||
WAF systems generate millions of request events. Storing full URL paths like `/api/v1/users/123/posts` repeatedly wastes storage and makes pattern-based queries inefficient.
|
||
|
||
Traditional approaches:
|
||
- **Full path storage**: High redundancy, large database size
|
||
- **String pattern matching with LIKE**: No index support, slow queries
|
||
- **Full-Text Search (FTS)**: Complex setup, overkill for structured paths
|
||
|
||
## Our Solution: Path Segment Normalization
|
||
|
||
### Architecture Components
|
||
|
||
```
|
||
Request: /api/v1/users/123/posts
|
||
↓
|
||
Decompose into segments: ["api", "v1", "users", "123", "posts"]
|
||
↓
|
||
Normalize to IDs: [1, 2, 3, 4, 5]
|
||
↓
|
||
Store as JSON array: "[1,2,3,4,5]"
|
||
```
|
||
|
||
### Database Schema
|
||
|
||
```ruby
|
||
# path_segments table - deduplicated segment dictionary
|
||
create_table :path_segments do |t|
|
||
t.string :segment, null: false, index: { unique: true }
|
||
t.integer :usage_count, default: 1, null: false
|
||
t.datetime :first_seen_at, null: false
|
||
t.timestamps
|
||
end
|
||
|
||
# events table - references segments by ID
|
||
create_table :events do |t|
|
||
t.string :request_segment_ids # JSON array: "[1,2,3]"
|
||
t.string :request_path # Original path for display
|
||
# ... other fields
|
||
end
|
||
|
||
# Critical index for fast lookups
|
||
add_index :events, :request_segment_ids
|
||
```
|
||
|
||
### Models
|
||
|
||
**PathSegment** - The segment dictionary:
|
||
```ruby
|
||
class PathSegment < ApplicationRecord
|
||
validates :segment, presence: true, uniqueness: true
|
||
validates :usage_count, presence: true, numericality: { greater_than: 0 }
|
||
|
||
def self.find_or_create_segment(segment)
|
||
find_or_create_by(segment: segment) do |path_segment|
|
||
path_segment.usage_count = 1
|
||
path_segment.first_seen_at = Time.current
|
||
end
|
||
end
|
||
|
||
def increment_usage!
|
||
increment!(:usage_count)
|
||
end
|
||
end
|
||
```
|
||
|
||
**Event** - Stores segment IDs as JSON array:
|
||
```ruby
|
||
class Event < ApplicationRecord
|
||
serialize :request_segment_ids, type: Array, coder: JSON
|
||
|
||
# Path reconstruction helper
|
||
def reconstructed_path
|
||
return request_path if request_segment_ids.blank?
|
||
|
||
segments = PathSegment.where(id: request_segment_ids).index_by(&:id)
|
||
'/' + request_segment_ids.map { |id| segments[id]&.segment }.compact.join('/')
|
||
end
|
||
|
||
def path_depth
|
||
request_segment_ids&.length || 0
|
||
end
|
||
end
|
||
```
|
||
|
||
## The Indexing Strategy
|
||
|
||
### Why Standard LIKE Doesn't Work
|
||
|
||
SQLite's B-tree indexes only work with LIKE when the pattern is a simple alphanumeric prefix:
|
||
|
||
```sql
|
||
-- ✅ Uses index (alphanumeric prefix)
|
||
WHERE column LIKE 'api%'
|
||
|
||
-- ❌ Full table scan (starts with '[')
|
||
WHERE request_segment_ids LIKE '[1,2,%'
|
||
```
|
||
|
||
### The Solution: Range Queries on Lexicographic Sort
|
||
|
||
JSON arrays sort lexicographically in SQLite:
|
||
|
||
```
|
||
"[1,2]" (exact match)
|
||
"[1,2,3]" (prefix match - has [1,2] as start)
|
||
"[1,2,4]" (prefix match - has [1,2] as start)
|
||
"[1,2,99]" (prefix match - has [1,2] as start)
|
||
"[1,3]" (out of range - different prefix)
|
||
```
|
||
|
||
To find all paths starting with `[1,2]`:
|
||
```sql
|
||
-- Exact match OR prefix range
|
||
WHERE request_segment_ids = '[1,2]'
|
||
OR (request_segment_ids >= '[1,2,' AND request_segment_ids < '[1,3]')
|
||
```
|
||
|
||
The range `>= '[1,2,' AND < '[1,3]'` captures all arrays starting with `[1,2,...]`.
|
||
|
||
### Query Performance
|
||
|
||
```
|
||
EXPLAIN QUERY PLAN:
|
||
MULTI-INDEX OR
|
||
├─ INDEX 1: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids=?)
|
||
└─ INDEX 2: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids>? AND request_segment_ids<?)
|
||
```
|
||
|
||
Both branches use the B-tree index = O(log n) lookups!
|
||
|
||
### Implementation: with_path_prefix Scope
|
||
|
||
```ruby
|
||
scope :with_path_prefix, ->(prefix_segment_ids) {
|
||
return none if prefix_segment_ids.blank?
|
||
|
||
# Convert [1, 2] to JSON string "[1,2]"
|
||
prefix_str = prefix_segment_ids.to_json
|
||
|
||
# Build upper bound by incrementing last segment
|
||
# [1, 2] + 1 = [1, 3]
|
||
upper_prefix = prefix_segment_ids[0..-2] + [prefix_segment_ids.last + 1]
|
||
upper_str = upper_prefix.to_json
|
||
|
||
# Lower bound for prefix matches: "[1,2,"
|
||
lower_prefix_str = "#{prefix_str[0..-2]},"
|
||
|
||
# Range query that uses B-tree index
|
||
where("request_segment_ids = ? OR (request_segment_ids >= ? AND request_segment_ids < ?)",
|
||
prefix_str, lower_prefix_str, upper_str)
|
||
}
|
||
```
|
||
|
||
## Usage Examples
|
||
|
||
### Basic Prefix Search
|
||
|
||
```ruby
|
||
# Find all /api/v1/* paths
|
||
api_seg = PathSegment.find_by(segment: 'api')
|
||
v1_seg = PathSegment.find_by(segment: 'v1')
|
||
|
||
events = Event.with_path_prefix([api_seg.id, v1_seg.id])
|
||
# Matches: /api/v1, /api/v1/users, /api/v1/users/123, etc.
|
||
```
|
||
|
||
### Combined with Other Filters
|
||
|
||
```ruby
|
||
# Blocked requests to /admin/* from specific IP
|
||
admin_seg = PathSegment.find_by(segment: 'admin')
|
||
|
||
Event.where(ip_address: '192.168.1.100')
|
||
.where(waf_action: :deny)
|
||
.with_path_prefix([admin_seg.id])
|
||
```
|
||
|
||
### Using Composite Index
|
||
|
||
```ruby
|
||
# POST requests to /api/* on specific host
|
||
# Uses: idx_events_host_method_path
|
||
host = RequestHost.find_by(hostname: 'api.example.com')
|
||
api_seg = PathSegment.find_by(segment: 'api')
|
||
|
||
Event.where(request_host_id: host.id, request_method: :post)
|
||
.with_path_prefix([api_seg.id])
|
||
```
|
||
|
||
### Exact Path Match
|
||
|
||
```ruby
|
||
# Find exact path /api/v1 (not /api/v1/users)
|
||
api_seg = PathSegment.find_by(segment: 'api')
|
||
v1_seg = PathSegment.find_by(segment: 'v1')
|
||
|
||
Event.where(request_segment_ids: [api_seg.id, v1_seg.id].to_json)
|
||
```
|
||
|
||
### Path Reconstruction for Display
|
||
|
||
```ruby
|
||
events = Event.with_path_prefix([api_seg.id]).limit(10)
|
||
|
||
events.each do |event|
|
||
puts "#{event.reconstructed_path} - #{event.waf_action}"
|
||
# => /api/v1/users - allow
|
||
# => /api/v1/posts - deny
|
||
end
|
||
```
|
||
|
||
## Performance Characteristics
|
||
|
||
| Operation | Index Used | Complexity | Notes |
|
||
|-----------|-----------|------------|-------|
|
||
| Exact path match | ✅ B-tree | O(log n) | Single index lookup |
|
||
| Prefix path match | ✅ B-tree range | O(log n + k) | k = number of matches |
|
||
| Path depth filter | ❌ None | O(n) | Full table scan - use sparingly |
|
||
| Host+method+path | ✅ Composite | O(log n + k) | Optimal for WAF queries |
|
||
|
||
### Indexes in Schema
|
||
|
||
```ruby
|
||
# Single-column index for path queries
|
||
add_index :events, :request_segment_ids
|
||
|
||
# Composite index for common WAF query patterns
|
||
add_index :events, [:request_host_id, :request_method, :request_segment_ids],
|
||
name: 'idx_events_host_method_path'
|
||
```
|
||
|
||
## Storage Efficiency
|
||
|
||
### Compression Benefits
|
||
|
||
Example: `/api/v1/users` appears in 100,000 events
|
||
|
||
**Without normalization:**
|
||
```
|
||
100,000 events × 15 bytes = 1,500,000 bytes (1.5 MB)
|
||
```
|
||
|
||
**With normalization:**
|
||
```
|
||
3 segments × 10 bytes (avg) = 30 bytes
|
||
100,000 events × 7 bytes ("[1,2,3]") = 700,000 bytes (700 KB)
|
||
Total: 700,030 bytes (700 KB)
|
||
|
||
Savings: 53% reduction
|
||
```
|
||
|
||
Plus benefits:
|
||
- **Usage tracking**: `usage_count` shows hot paths
|
||
- **Analytics**: Easy to identify common path patterns
|
||
- **Flexibility**: Can query at segment level
|
||
|
||
## Normalization Process
|
||
|
||
### Event Creation Flow
|
||
|
||
```ruby
|
||
# 1. Event arrives with full path
|
||
payload = {
|
||
"request" => { "path" => "/api/v1/users/123" }
|
||
}
|
||
|
||
# 2. Event model extracts path
|
||
event = Event.create_from_waf_payload!(event_id, payload, project)
|
||
# Sets: request_path = "/api/v1/users/123"
|
||
|
||
# 3. After validation, EventNormalizer runs
|
||
EventNormalizer.normalize_event!(event)
|
||
|
||
# 4. Path is decomposed into segments
|
||
segments = ["/api/v1/users/123"].split('/').reject(&:blank?)
|
||
# => ["api", "v1", "users", "123"]
|
||
|
||
# 5. Each segment is normalized to ID
|
||
segment_ids = segments.map do |segment|
|
||
path_segment = PathSegment.find_or_create_segment(segment)
|
||
path_segment.increment_usage! unless path_segment.new_record?
|
||
path_segment.id
|
||
end
|
||
# => [1, 2, 3, 4]
|
||
|
||
# 6. IDs stored as JSON array
|
||
event.request_segment_ids = segment_ids
|
||
# Stored in DB as: "[1,2,3,4]"
|
||
```
|
||
|
||
### EventNormalizer Service
|
||
|
||
```ruby
|
||
class EventNormalizer
|
||
def normalize_path_segments
|
||
segments = @event.path_segments_array
|
||
return if segments.empty?
|
||
|
||
segment_ids = segments.map do |segment|
|
||
path_segment = PathSegment.find_or_create_segment(segment)
|
||
path_segment.increment_usage! unless path_segment.new_record?
|
||
path_segment.id
|
||
end
|
||
|
||
# Store as array - serialize will handle JSON encoding
|
||
@event.request_segment_ids = segment_ids
|
||
end
|
||
end
|
||
```
|
||
|
||
## Important: JSON Functions and Performance
|
||
|
||
### ❌ Avoid in WHERE Clauses
|
||
|
||
JSON functions like `json_array_length()` cannot use indexes:
|
||
|
||
```ruby
|
||
# ❌ SLOW - Full table scan
|
||
Event.where("json_array_length(request_segment_ids) = ?", 3)
|
||
|
||
# ✅ FAST - Filter in Ruby after indexed query
|
||
Event.with_path_prefix([api_id]).select { |e| e.path_depth == 3 }
|
||
```
|
||
|
||
### ✅ Use for Analytics (Async)
|
||
|
||
JSON functions are fine for analytics queries run in background jobs:
|
||
|
||
```ruby
|
||
# Background job for analytics
|
||
class PathDepthAnalysisJob < ApplicationJob
|
||
def perform(project_id)
|
||
# This is OK in async context
|
||
stats = Event.where(project_id: project_id)
|
||
.select("json_array_length(request_segment_ids) as depth, COUNT(*) as count")
|
||
.group("depth")
|
||
.order(:depth)
|
||
|
||
# Store results for dashboard
|
||
PathDepthStats.create!(project_id: project_id, data: stats)
|
||
end
|
||
end
|
||
```
|
||
|
||
## Edge Cases and Considerations
|
||
|
||
### Empty Paths
|
||
|
||
```ruby
|
||
request_path = "/"
|
||
segments = [] # Empty after split and reject
|
||
request_segment_ids = [] # Empty array
|
||
# Stored as: "[]"
|
||
```
|
||
|
||
### Trailing Slashes
|
||
|
||
```ruby
|
||
"/api/v1/" == "/api/v1" # Both normalize to ["api", "v1"]
|
||
```
|
||
|
||
### Special Characters in Segments
|
||
|
||
```ruby
|
||
# URL-encoded segments are stored as-is
|
||
"/search?q=hello%20world"
|
||
# Segments: ["search?q=hello%20world"]
|
||
```
|
||
|
||
Consider normalizing query params separately if needed.
|
||
|
||
### Very Deep Paths
|
||
|
||
Paths with 10+ segments work fine but consider:
|
||
- Are they legitimate? (Could indicate attack)
|
||
- Impact on JSON array size
|
||
- Consider truncating for analytics
|
||
|
||
## Analytics Use Cases
|
||
|
||
### Most Common Paths
|
||
|
||
```ruby
|
||
# Top 10 most accessed paths
|
||
Event.group(:request_segment_ids)
|
||
.order('COUNT(*) DESC')
|
||
.limit(10)
|
||
.count
|
||
.map { |seg_ids, count|
|
||
path = PathSegment.where(id: JSON.parse(seg_ids))
|
||
.pluck(:segment)
|
||
.join('/')
|
||
["/#{path}", count]
|
||
}
|
||
```
|
||
|
||
### Hot Path Segments
|
||
|
||
```ruby
|
||
# Most frequently used segments (indicates common endpoints)
|
||
PathSegment.order(usage_count: :desc).limit(20)
|
||
```
|
||
|
||
### Attack Pattern Detection
|
||
|
||
```ruby
|
||
# Paths with unusual depth (possible directory traversal)
|
||
Event.where(waf_action: :deny)
|
||
.select { |e| e.path_depth > 10 }
|
||
.group_by { |e| e.request_segment_ids.first }
|
||
```
|
||
|
||
### Path-Based Rule Generation
|
||
|
||
```ruby
|
||
# Auto-block paths that are frequently denied
|
||
suspicious_paths = Event.where(waf_action: :deny)
|
||
.where('created_at > ?', 1.hour.ago)
|
||
.group(:request_segment_ids)
|
||
.having('COUNT(*) > ?', 100)
|
||
.pluck(:request_segment_ids)
|
||
|
||
suspicious_paths.each do |seg_ids|
|
||
RuleSet.global.block_path_segments(seg_ids)
|
||
end
|
||
```
|
||
|
||
## Future Optimizations
|
||
|
||
### Phase 2 Considerations
|
||
|
||
If performance becomes critical:
|
||
|
||
1. **Materialized Path Column**: Pre-compute common prefix patterns
|
||
2. **Trie Data Structure**: In-memory trie for ultra-fast prefix matching
|
||
3. **Redis Cache**: Cache hot path lookups
|
||
4. **Partial Indexes**: Index only blocked/challenged events
|
||
|
||
```ruby
|
||
# Example: Partial index for security-relevant events
|
||
add_index :events, :request_segment_ids,
|
||
where: "waf_action IN ('deny', 'challenge')",
|
||
name: 'idx_events_blocked_paths'
|
||
```
|
||
|
||
### Storage Considerations
|
||
|
||
For very large deployments (100M+ events):
|
||
|
||
- **Archive old events**: Move to separate table
|
||
- **Aggregate path stats**: Pre-compute daily/hourly summaries
|
||
- **Compress JSON**: SQLite JSON1 extension supports compression
|
||
|
||
## Testing
|
||
|
||
### Test Index Usage
|
||
|
||
```ruby
|
||
# Verify B-tree index is being used
|
||
sql = Event.with_path_prefix([1, 2]).to_sql
|
||
plan = ActiveRecord::Base.connection.execute("EXPLAIN QUERY PLAN #{sql}")
|
||
|
||
# Should see: "SEARCH events USING INDEX index_events_on_request_segment_ids"
|
||
puts plan.to_a
|
||
```
|
||
|
||
### Benchmark Queries
|
||
|
||
```ruby
|
||
require 'benchmark'
|
||
|
||
prefix_ids = [1, 2]
|
||
|
||
# Test indexed range query
|
||
Benchmark.bm do |x|
|
||
x.report("Indexed range:") {
|
||
Event.with_path_prefix(prefix_ids).count
|
||
}
|
||
|
||
x.report("LIKE query:") {
|
||
Event.where("request_segment_ids LIKE ?", "[1,2,%").count
|
||
}
|
||
end
|
||
|
||
# Range query should be 10-100x faster
|
||
```
|
||
|
||
## Conclusion
|
||
|
||
Path segment normalization with JSON array storage provides:
|
||
|
||
✅ **Significant storage savings** (50%+ compression)
|
||
✅ **Fast prefix queries** using standard B-tree indexes
|
||
✅ **Analytics-friendly** with usage tracking and pattern detection
|
||
✅ **Rails-native** using built-in serialization
|
||
✅ **Scalable** to millions of events with O(log n) lookups
|
||
|
||
The key insight: **Range queries on lexicographically-sorted JSON strings use B-tree indexes efficiently**, avoiding the need for complex full-text search or custom indexing strategies.
|
||
|
||
---
|
||
|
||
**Related Documentation:**
|
||
- [Event Ingestion](./event-ingestion.md) (TODO)
|
||
- [WAF Rule Engine](./rule-engine.md) (TODO)
|
||
- [Analytics Architecture](./analytics.md) (TODO)
|