Files
baffle-hub/docs/path-segment-architecture.md

514 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Path Segment Architecture
## Overview
Baffle Hub uses a path segment decomposition strategy to efficiently store and query URL paths in WAF event logs. This architecture provides significant storage compression while enabling fast prefix-based path searches using SQLite's B-tree indexes.
## The Problem
WAF systems generate millions of request events. Storing full URL paths like `/api/v1/users/123/posts` repeatedly wastes storage and makes pattern-based queries inefficient.
Traditional approaches:
- **Full path storage**: High redundancy, large database size
- **String pattern matching with LIKE**: No index support, slow queries
- **Full-Text Search (FTS)**: Complex setup, overkill for structured paths
## Our Solution: Path Segment Normalization
### Architecture Components
```
Request: /api/v1/users/123/posts
Decompose into segments: ["api", "v1", "users", "123", "posts"]
Normalize to IDs: [1, 2, 3, 4, 5]
Store as JSON array: "[1,2,3,4,5]"
```
### Database Schema
```ruby
# path_segments table - deduplicated segment dictionary
create_table :path_segments do |t|
t.string :segment, null: false, index: { unique: true }
t.integer :usage_count, default: 1, null: false
t.datetime :first_seen_at, null: false
t.timestamps
end
# events table - references segments by ID
create_table :events do |t|
t.string :request_segment_ids # JSON array: "[1,2,3]"
t.string :request_path # Original path for display
# ... other fields
end
# Critical index for fast lookups
add_index :events, :request_segment_ids
```
### Models
**PathSegment** - The segment dictionary:
```ruby
class PathSegment < ApplicationRecord
validates :segment, presence: true, uniqueness: true
validates :usage_count, presence: true, numericality: { greater_than: 0 }
def self.find_or_create_segment(segment)
find_or_create_by(segment: segment) do |path_segment|
path_segment.usage_count = 1
path_segment.first_seen_at = Time.current
end
end
def increment_usage!
increment!(:usage_count)
end
end
```
**Event** - Stores segment IDs as JSON array:
```ruby
class Event < ApplicationRecord
serialize :request_segment_ids, type: Array, coder: JSON
# Path reconstruction helper
def reconstructed_path
return request_path if request_segment_ids.blank?
segments = PathSegment.where(id: request_segment_ids).index_by(&:id)
'/' + request_segment_ids.map { |id| segments[id]&.segment }.compact.join('/')
end
def path_depth
request_segment_ids&.length || 0
end
end
```
## The Indexing Strategy
### Why Standard LIKE Doesn't Work
SQLite's B-tree indexes only work with LIKE when the pattern is a simple alphanumeric prefix:
```sql
-- ✅ Uses index (alphanumeric prefix)
WHERE column LIKE 'api%'
-- ❌ Full table scan (starts with '[')
WHERE request_segment_ids LIKE '[1,2,%'
```
### The Solution: Range Queries on Lexicographic Sort
JSON arrays sort lexicographically in SQLite:
```
"[1,2]" (exact match)
"[1,2,3]" (prefix match - has [1,2] as start)
"[1,2,4]" (prefix match - has [1,2] as start)
"[1,2,99]" (prefix match - has [1,2] as start)
"[1,3]" (out of range - different prefix)
```
To find all paths starting with `[1,2]`:
```sql
-- Exact match OR prefix range
WHERE request_segment_ids = '[1,2]'
OR (request_segment_ids >= '[1,2,' AND request_segment_ids < '[1,3]')
```
The range `>= '[1,2,' AND < '[1,3]'` captures all arrays starting with `[1,2,...]`.
### Query Performance
```
EXPLAIN QUERY PLAN:
MULTI-INDEX OR
├─ INDEX 1: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids=?)
└─ INDEX 2: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids>? AND request_segment_ids<?)
```
Both branches use the B-tree index = O(log n) lookups!
### Implementation: with_path_prefix Scope
```ruby
scope :with_path_prefix, ->(prefix_segment_ids) {
return none if prefix_segment_ids.blank?
# Convert [1, 2] to JSON string "[1,2]"
prefix_str = prefix_segment_ids.to_json
# Build upper bound by incrementing last segment
# [1, 2] + 1 = [1, 3]
upper_prefix = prefix_segment_ids[0..-2] + [prefix_segment_ids.last + 1]
upper_str = upper_prefix.to_json
# Lower bound for prefix matches: "[1,2,"
lower_prefix_str = "#{prefix_str[0..-2]},"
# Range query that uses B-tree index
where("request_segment_ids = ? OR (request_segment_ids >= ? AND request_segment_ids < ?)",
prefix_str, lower_prefix_str, upper_str)
}
```
## Usage Examples
### Basic Prefix Search
```ruby
# Find all /api/v1/* paths
api_seg = PathSegment.find_by(segment: 'api')
v1_seg = PathSegment.find_by(segment: 'v1')
events = Event.with_path_prefix([api_seg.id, v1_seg.id])
# Matches: /api/v1, /api/v1/users, /api/v1/users/123, etc.
```
### Combined with Other Filters
```ruby
# Blocked requests to /admin/* from specific IP
admin_seg = PathSegment.find_by(segment: 'admin')
Event.where(ip_address: '192.168.1.100')
.where(waf_action: :deny)
.with_path_prefix([admin_seg.id])
```
### Using Composite Index
```ruby
# POST requests to /api/* on specific host
# Uses: idx_events_host_method_path
host = RequestHost.find_by(hostname: 'api.example.com')
api_seg = PathSegment.find_by(segment: 'api')
Event.where(request_host_id: host.id, request_method: :post)
.with_path_prefix([api_seg.id])
```
### Exact Path Match
```ruby
# Find exact path /api/v1 (not /api/v1/users)
api_seg = PathSegment.find_by(segment: 'api')
v1_seg = PathSegment.find_by(segment: 'v1')
Event.where(request_segment_ids: [api_seg.id, v1_seg.id].to_json)
```
### Path Reconstruction for Display
```ruby
events = Event.with_path_prefix([api_seg.id]).limit(10)
events.each do |event|
puts "#{event.reconstructed_path} - #{event.waf_action}"
# => /api/v1/users - allow
# => /api/v1/posts - deny
end
```
## Performance Characteristics
| Operation | Index Used | Complexity | Notes |
|-----------|-----------|------------|-------|
| Exact path match | ✅ B-tree | O(log n) | Single index lookup |
| Prefix path match | ✅ B-tree range | O(log n + k) | k = number of matches |
| Path depth filter | ❌ None | O(n) | Full table scan - use sparingly |
| Host+method+path | ✅ Composite | O(log n + k) | Optimal for WAF queries |
### Indexes in Schema
```ruby
# Single-column index for path queries
add_index :events, :request_segment_ids
# Composite index for common WAF query patterns
add_index :events, [:request_host_id, :request_method, :request_segment_ids],
name: 'idx_events_host_method_path'
```
## Storage Efficiency
### Compression Benefits
Example: `/api/v1/users` appears in 100,000 events
**Without normalization:**
```
100,000 events × 15 bytes = 1,500,000 bytes (1.5 MB)
```
**With normalization:**
```
3 segments × 10 bytes (avg) = 30 bytes
100,000 events × 7 bytes ("[1,2,3]") = 700,000 bytes (700 KB)
Total: 700,030 bytes (700 KB)
Savings: 53% reduction
```
Plus benefits:
- **Usage tracking**: `usage_count` shows hot paths
- **Analytics**: Easy to identify common path patterns
- **Flexibility**: Can query at segment level
## Normalization Process
### Event Creation Flow
```ruby
# 1. Event arrives with full path
payload = {
"request" => { "path" => "/api/v1/users/123" }
}
# 2. Event model extracts path
event = Event.create_from_waf_payload!(event_id, payload, project)
# Sets: request_path = "/api/v1/users/123"
# 3. After validation, EventNormalizer runs
EventNormalizer.normalize_event!(event)
# 4. Path is decomposed into segments
segments = ["/api/v1/users/123"].split('/').reject(&:blank?)
# => ["api", "v1", "users", "123"]
# 5. Each segment is normalized to ID
segment_ids = segments.map do |segment|
path_segment = PathSegment.find_or_create_segment(segment)
path_segment.increment_usage! unless path_segment.new_record?
path_segment.id
end
# => [1, 2, 3, 4]
# 6. IDs stored as JSON array
event.request_segment_ids = segment_ids
# Stored in DB as: "[1,2,3,4]"
```
### EventNormalizer Service
```ruby
class EventNormalizer
def normalize_path_segments
segments = @event.path_segments_array
return if segments.empty?
segment_ids = segments.map do |segment|
path_segment = PathSegment.find_or_create_segment(segment)
path_segment.increment_usage! unless path_segment.new_record?
path_segment.id
end
# Store as array - serialize will handle JSON encoding
@event.request_segment_ids = segment_ids
end
end
```
## Important: JSON Functions and Performance
### ❌ Avoid in WHERE Clauses
JSON functions like `json_array_length()` cannot use indexes:
```ruby
# ❌ SLOW - Full table scan
Event.where("json_array_length(request_segment_ids) = ?", 3)
# ✅ FAST - Filter in Ruby after indexed query
Event.with_path_prefix([api_id]).select { |e| e.path_depth == 3 }
```
### ✅ Use for Analytics (Async)
JSON functions are fine for analytics queries run in background jobs:
```ruby
# Background job for analytics
class PathDepthAnalysisJob < ApplicationJob
def perform(project_id)
# This is OK in async context
stats = Event.where(project_id: project_id)
.select("json_array_length(request_segment_ids) as depth, COUNT(*) as count")
.group("depth")
.order(:depth)
# Store results for dashboard
PathDepthStats.create!(project_id: project_id, data: stats)
end
end
```
## Edge Cases and Considerations
### Empty Paths
```ruby
request_path = "/"
segments = [] # Empty after split and reject
request_segment_ids = [] # Empty array
# Stored as: "[]"
```
### Trailing Slashes
```ruby
"/api/v1/" == "/api/v1" # Both normalize to ["api", "v1"]
```
### Special Characters in Segments
```ruby
# URL-encoded segments are stored as-is
"/search?q=hello%20world"
# Segments: ["search?q=hello%20world"]
```
Consider normalizing query params separately if needed.
### Very Deep Paths
Paths with 10+ segments work fine but consider:
- Are they legitimate? (Could indicate attack)
- Impact on JSON array size
- Consider truncating for analytics
## Analytics Use Cases
### Most Common Paths
```ruby
# Top 10 most accessed paths
Event.group(:request_segment_ids)
.order('COUNT(*) DESC')
.limit(10)
.count
.map { |seg_ids, count|
path = PathSegment.where(id: JSON.parse(seg_ids))
.pluck(:segment)
.join('/')
["/#{path}", count]
}
```
### Hot Path Segments
```ruby
# Most frequently used segments (indicates common endpoints)
PathSegment.order(usage_count: :desc).limit(20)
```
### Attack Pattern Detection
```ruby
# Paths with unusual depth (possible directory traversal)
Event.where(waf_action: :deny)
.select { |e| e.path_depth > 10 }
.group_by { |e| e.request_segment_ids.first }
```
### Path-Based Rule Generation
```ruby
# Auto-block paths that are frequently denied
suspicious_paths = Event.where(waf_action: :deny)
.where('created_at > ?', 1.hour.ago)
.group(:request_segment_ids)
.having('COUNT(*) > ?', 100)
.pluck(:request_segment_ids)
suspicious_paths.each do |seg_ids|
# TODO: Implement rule creation for blocking path segments
# Rule.create!(rule_type: 'path_pattern', conditions: { patterns: seg_ids }, action: 'deny')
end
```
## Future Optimizations
### Phase 2 Considerations
If performance becomes critical:
1. **Materialized Path Column**: Pre-compute common prefix patterns
2. **Trie Data Structure**: In-memory trie for ultra-fast prefix matching
3. **Redis Cache**: Cache hot path lookups
4. **Partial Indexes**: Index only blocked/challenged events
```ruby
# Example: Partial index for security-relevant events
add_index :events, :request_segment_ids,
where: "waf_action IN ('deny', 'challenge')",
name: 'idx_events_blocked_paths'
```
### Storage Considerations
For very large deployments (100M+ events):
- **Archive old events**: Move to separate table
- **Aggregate path stats**: Pre-compute daily/hourly summaries
- **Compress JSON**: SQLite JSON1 extension supports compression
## Testing
### Test Index Usage
```ruby
# Verify B-tree index is being used
sql = Event.with_path_prefix([1, 2]).to_sql
plan = ActiveRecord::Base.connection.execute("EXPLAIN QUERY PLAN #{sql}")
# Should see: "SEARCH events USING INDEX index_events_on_request_segment_ids"
puts plan.to_a
```
### Benchmark Queries
```ruby
require 'benchmark'
prefix_ids = [1, 2]
# Test indexed range query
Benchmark.bm do |x|
x.report("Indexed range:") {
Event.with_path_prefix(prefix_ids).count
}
x.report("LIKE query:") {
Event.where("request_segment_ids LIKE ?", "[1,2,%").count
}
end
# Range query should be 10-100x faster
```
## Conclusion
Path segment normalization with JSON array storage provides:
**Significant storage savings** (50%+ compression)
**Fast prefix queries** using standard B-tree indexes
**Analytics-friendly** with usage tracking and pattern detection
**Rails-native** using built-in serialization
**Scalable** to millions of events with O(log n) lookups
The key insight: **Range queries on lexicographically-sorted JSON strings use B-tree indexes efficiently**, avoiding the need for complex full-text search or custom indexing strategies.
---
**Related Documentation:**
- [Event Ingestion](./event-ingestion.md) (TODO)
- [WAF Rule Engine](./rule-engine.md) (TODO)
- [Analytics Architecture](./analytics.md) (TODO)