# Path Segment Architecture ## Overview Baffle Hub uses a path segment decomposition strategy to efficiently store and query URL paths in WAF event logs. This architecture provides significant storage compression while enabling fast prefix-based path searches using SQLite's B-tree indexes. ## The Problem WAF systems generate millions of request events. Storing full URL paths like `/api/v1/users/123/posts` repeatedly wastes storage and makes pattern-based queries inefficient. Traditional approaches: - **Full path storage**: High redundancy, large database size - **String pattern matching with LIKE**: No index support, slow queries - **Full-Text Search (FTS)**: Complex setup, overkill for structured paths ## Our Solution: Path Segment Normalization ### Architecture Components ``` Request: /api/v1/users/123/posts ↓ Decompose into segments: ["api", "v1", "users", "123", "posts"] ↓ Normalize to IDs: [1, 2, 3, 4, 5] ↓ Store as JSON array: "[1,2,3,4,5]" ``` ### Database Schema ```ruby # path_segments table - deduplicated segment dictionary create_table :path_segments do |t| t.string :segment, null: false, index: { unique: true } t.integer :usage_count, default: 1, null: false t.datetime :first_seen_at, null: false t.timestamps end # events table - references segments by ID create_table :events do |t| t.string :request_segment_ids # JSON array: "[1,2,3]" t.string :request_path # Original path for display # ... other fields end # Critical index for fast lookups add_index :events, :request_segment_ids ``` ### Models **PathSegment** - The segment dictionary: ```ruby class PathSegment < ApplicationRecord validates :segment, presence: true, uniqueness: true validates :usage_count, presence: true, numericality: { greater_than: 0 } def self.find_or_create_segment(segment) find_or_create_by(segment: segment) do |path_segment| path_segment.usage_count = 1 path_segment.first_seen_at = Time.current end end def increment_usage! increment!(:usage_count) end end ``` **Event** - Stores segment IDs as JSON array: ```ruby class Event < ApplicationRecord serialize :request_segment_ids, type: Array, coder: JSON # Path reconstruction helper def reconstructed_path return request_path if request_segment_ids.blank? segments = PathSegment.where(id: request_segment_ids).index_by(&:id) '/' + request_segment_ids.map { |id| segments[id]&.segment }.compact.join('/') end def path_depth request_segment_ids&.length || 0 end end ``` ## The Indexing Strategy ### Why Standard LIKE Doesn't Work SQLite's B-tree indexes only work with LIKE when the pattern is a simple alphanumeric prefix: ```sql -- ✅ Uses index (alphanumeric prefix) WHERE column LIKE 'api%' -- ❌ Full table scan (starts with '[') WHERE request_segment_ids LIKE '[1,2,%' ``` ### The Solution: Range Queries on Lexicographic Sort JSON arrays sort lexicographically in SQLite: ``` "[1,2]" (exact match) "[1,2,3]" (prefix match - has [1,2] as start) "[1,2,4]" (prefix match - has [1,2] as start) "[1,2,99]" (prefix match - has [1,2] as start) "[1,3]" (out of range - different prefix) ``` To find all paths starting with `[1,2]`: ```sql -- Exact match OR prefix range WHERE request_segment_ids = '[1,2]' OR (request_segment_ids >= '[1,2,' AND request_segment_ids < '[1,3]') ``` The range `>= '[1,2,' AND < '[1,3]'` captures all arrays starting with `[1,2,...]`. ### Query Performance ``` EXPLAIN QUERY PLAN: MULTI-INDEX OR ├─ INDEX 1: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids=?) └─ INDEX 2: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids>? AND request_segment_ids(prefix_segment_ids) { return none if prefix_segment_ids.blank? # Convert [1, 2] to JSON string "[1,2]" prefix_str = prefix_segment_ids.to_json # Build upper bound by incrementing last segment # [1, 2] + 1 = [1, 3] upper_prefix = prefix_segment_ids[0..-2] + [prefix_segment_ids.last + 1] upper_str = upper_prefix.to_json # Lower bound for prefix matches: "[1,2," lower_prefix_str = "#{prefix_str[0..-2]}," # Range query that uses B-tree index where("request_segment_ids = ? OR (request_segment_ids >= ? AND request_segment_ids < ?)", prefix_str, lower_prefix_str, upper_str) } ``` ## Usage Examples ### Basic Prefix Search ```ruby # Find all /api/v1/* paths api_seg = PathSegment.find_by(segment: 'api') v1_seg = PathSegment.find_by(segment: 'v1') events = Event.with_path_prefix([api_seg.id, v1_seg.id]) # Matches: /api/v1, /api/v1/users, /api/v1/users/123, etc. ``` ### Combined with Other Filters ```ruby # Blocked requests to /admin/* from specific IP admin_seg = PathSegment.find_by(segment: 'admin') Event.where(ip_address: '192.168.1.100') .where(waf_action: :deny) .with_path_prefix([admin_seg.id]) ``` ### Using Composite Index ```ruby # POST requests to /api/* on specific host # Uses: idx_events_host_method_path host = RequestHost.find_by(hostname: 'api.example.com') api_seg = PathSegment.find_by(segment: 'api') Event.where(request_host_id: host.id, request_method: :post) .with_path_prefix([api_seg.id]) ``` ### Exact Path Match ```ruby # Find exact path /api/v1 (not /api/v1/users) api_seg = PathSegment.find_by(segment: 'api') v1_seg = PathSegment.find_by(segment: 'v1') Event.where(request_segment_ids: [api_seg.id, v1_seg.id].to_json) ``` ### Path Reconstruction for Display ```ruby events = Event.with_path_prefix([api_seg.id]).limit(10) events.each do |event| puts "#{event.reconstructed_path} - #{event.waf_action}" # => /api/v1/users - allow # => /api/v1/posts - deny end ``` ## Performance Characteristics | Operation | Index Used | Complexity | Notes | |-----------|-----------|------------|-------| | Exact path match | ✅ B-tree | O(log n) | Single index lookup | | Prefix path match | ✅ B-tree range | O(log n + k) | k = number of matches | | Path depth filter | ❌ None | O(n) | Full table scan - use sparingly | | Host+method+path | ✅ Composite | O(log n + k) | Optimal for WAF queries | ### Indexes in Schema ```ruby # Single-column index for path queries add_index :events, :request_segment_ids # Composite index for common WAF query patterns add_index :events, [:request_host_id, :request_method, :request_segment_ids], name: 'idx_events_host_method_path' ``` ## Storage Efficiency ### Compression Benefits Example: `/api/v1/users` appears in 100,000 events **Without normalization:** ``` 100,000 events × 15 bytes = 1,500,000 bytes (1.5 MB) ``` **With normalization:** ``` 3 segments × 10 bytes (avg) = 30 bytes 100,000 events × 7 bytes ("[1,2,3]") = 700,000 bytes (700 KB) Total: 700,030 bytes (700 KB) Savings: 53% reduction ``` Plus benefits: - **Usage tracking**: `usage_count` shows hot paths - **Analytics**: Easy to identify common path patterns - **Flexibility**: Can query at segment level ## Normalization Process ### Event Creation Flow ```ruby # 1. Event arrives with full path payload = { "request" => { "path" => "/api/v1/users/123" } } # 2. Event model extracts path event = Event.create_from_waf_payload!(event_id, payload, project) # Sets: request_path = "/api/v1/users/123" # 3. After validation, EventNormalizer runs EventNormalizer.normalize_event!(event) # 4. Path is decomposed into segments segments = ["/api/v1/users/123"].split('/').reject(&:blank?) # => ["api", "v1", "users", "123"] # 5. Each segment is normalized to ID segment_ids = segments.map do |segment| path_segment = PathSegment.find_or_create_segment(segment) path_segment.increment_usage! unless path_segment.new_record? path_segment.id end # => [1, 2, 3, 4] # 6. IDs stored as JSON array event.request_segment_ids = segment_ids # Stored in DB as: "[1,2,3,4]" ``` ### EventNormalizer Service ```ruby class EventNormalizer def normalize_path_segments segments = @event.path_segments_array return if segments.empty? segment_ids = segments.map do |segment| path_segment = PathSegment.find_or_create_segment(segment) path_segment.increment_usage! unless path_segment.new_record? path_segment.id end # Store as array - serialize will handle JSON encoding @event.request_segment_ids = segment_ids end end ``` ## Important: JSON Functions and Performance ### ❌ Avoid in WHERE Clauses JSON functions like `json_array_length()` cannot use indexes: ```ruby # ❌ SLOW - Full table scan Event.where("json_array_length(request_segment_ids) = ?", 3) # ✅ FAST - Filter in Ruby after indexed query Event.with_path_prefix([api_id]).select { |e| e.path_depth == 3 } ``` ### ✅ Use for Analytics (Async) JSON functions are fine for analytics queries run in background jobs: ```ruby # Background job for analytics class PathDepthAnalysisJob < ApplicationJob def perform(project_id) # This is OK in async context stats = Event.where(project_id: project_id) .select("json_array_length(request_segment_ids) as depth, COUNT(*) as count") .group("depth") .order(:depth) # Store results for dashboard PathDepthStats.create!(project_id: project_id, data: stats) end end ``` ## Edge Cases and Considerations ### Empty Paths ```ruby request_path = "/" segments = [] # Empty after split and reject request_segment_ids = [] # Empty array # Stored as: "[]" ``` ### Trailing Slashes ```ruby "/api/v1/" == "/api/v1" # Both normalize to ["api", "v1"] ``` ### Special Characters in Segments ```ruby # URL-encoded segments are stored as-is "/search?q=hello%20world" # Segments: ["search?q=hello%20world"] ``` Consider normalizing query params separately if needed. ### Very Deep Paths Paths with 10+ segments work fine but consider: - Are they legitimate? (Could indicate attack) - Impact on JSON array size - Consider truncating for analytics ## Analytics Use Cases ### Most Common Paths ```ruby # Top 10 most accessed paths Event.group(:request_segment_ids) .order('COUNT(*) DESC') .limit(10) .count .map { |seg_ids, count| path = PathSegment.where(id: JSON.parse(seg_ids)) .pluck(:segment) .join('/') ["/#{path}", count] } ``` ### Hot Path Segments ```ruby # Most frequently used segments (indicates common endpoints) PathSegment.order(usage_count: :desc).limit(20) ``` ### Attack Pattern Detection ```ruby # Paths with unusual depth (possible directory traversal) Event.where(waf_action: :deny) .select { |e| e.path_depth > 10 } .group_by { |e| e.request_segment_ids.first } ``` ### Path-Based Rule Generation ```ruby # Auto-block paths that are frequently denied suspicious_paths = Event.where(waf_action: :deny) .where('created_at > ?', 1.hour.ago) .group(:request_segment_ids) .having('COUNT(*) > ?', 100) .pluck(:request_segment_ids) suspicious_paths.each do |seg_ids| # TODO: Implement rule creation for blocking path segments # Rule.create!(rule_type: 'path_pattern', conditions: { patterns: seg_ids }, action: 'deny') end ``` ## Future Optimizations ### Phase 2 Considerations If performance becomes critical: 1. **Materialized Path Column**: Pre-compute common prefix patterns 2. **Trie Data Structure**: In-memory trie for ultra-fast prefix matching 3. **Redis Cache**: Cache hot path lookups 4. **Partial Indexes**: Index only blocked/challenged events ```ruby # Example: Partial index for security-relevant events add_index :events, :request_segment_ids, where: "waf_action IN ('deny', 'challenge')", name: 'idx_events_blocked_paths' ``` ### Storage Considerations For very large deployments (100M+ events): - **Archive old events**: Move to separate table - **Aggregate path stats**: Pre-compute daily/hourly summaries - **Compress JSON**: SQLite JSON1 extension supports compression ## Testing ### Test Index Usage ```ruby # Verify B-tree index is being used sql = Event.with_path_prefix([1, 2]).to_sql plan = ActiveRecord::Base.connection.execute("EXPLAIN QUERY PLAN #{sql}") # Should see: "SEARCH events USING INDEX index_events_on_request_segment_ids" puts plan.to_a ``` ### Benchmark Queries ```ruby require 'benchmark' prefix_ids = [1, 2] # Test indexed range query Benchmark.bm do |x| x.report("Indexed range:") { Event.with_path_prefix(prefix_ids).count } x.report("LIKE query:") { Event.where("request_segment_ids LIKE ?", "[1,2,%").count } end # Range query should be 10-100x faster ``` ## Conclusion Path segment normalization with JSON array storage provides: ✅ **Significant storage savings** (50%+ compression) ✅ **Fast prefix queries** using standard B-tree indexes ✅ **Analytics-friendly** with usage tracking and pattern detection ✅ **Rails-native** using built-in serialization ✅ **Scalable** to millions of events with O(log n) lookups The key insight: **Range queries on lexicographically-sorted JSON strings use B-tree indexes efficiently**, avoiding the need for complex full-text search or custom indexing strategies. --- **Related Documentation:** - [Event Ingestion](./event-ingestion.md) (TODO) - [WAF Rule Engine](./rule-engine.md) (TODO) - [Analytics Architecture](./analytics.md) (TODO)