Files

Dan Milne fc567f0b91 Migrate to Postgresql for better network handling. Add more user functionality.

2025-11-06 14:08:39 +11:00

13 KiB

Raw Blame History

Path Segment Architecture

Overview

Baffle Hub uses a path segment decomposition strategy to efficiently store and query URL paths in WAF event logs. This architecture provides significant storage compression while enabling fast prefix-based path searches using SQLite's B-tree indexes.

The Problem

WAF systems generate millions of request events. Storing full URL paths like /api/v1/users/123/posts repeatedly wastes storage and makes pattern-based queries inefficient.

Traditional approaches:

Full path storage: High redundancy, large database size
String pattern matching with LIKE: No index support, slow queries
Full-Text Search (FTS): Complex setup, overkill for structured paths

Our Solution: Path Segment Normalization

Architecture Components

Request: /api/v1/users/123/posts
    ↓
Decompose into segments: ["api", "v1", "users", "123", "posts"]
    ↓
Normalize to IDs: [1, 2, 3, 4, 5]
    ↓
Store as JSON array: "[1,2,3,4,5]"

Database Schema

# path_segments table - deduplicated segment dictionary
create_table :path_segments do |t|
  t.string :segment, null: false, index: { unique: true }
  t.integer :usage_count, default: 1, null: false
  t.datetime :first_seen_at, null: false
  t.timestamps
end

# events table - references segments by ID
create_table :events do |t|
  t.string :request_segment_ids  # JSON array: "[1,2,3]"
  t.string :request_path         # Original path for display
  # ... other fields
end

# Critical index for fast lookups
add_index :events, :request_segment_ids

Models

PathSegment - The segment dictionary:

class PathSegment < ApplicationRecord
  validates :segment, presence: true, uniqueness: true
  validates :usage_count, presence: true, numericality: { greater_than: 0 }

  def self.find_or_create_segment(segment)
    find_or_create_by(segment: segment) do |path_segment|
      path_segment.usage_count = 1
      path_segment.first_seen_at = Time.current
    end
  end

  def increment_usage!
    increment!(:usage_count)
  end
end

Event - Stores segment IDs as JSON array:

class Event < ApplicationRecord
  serialize :request_segment_ids, type: Array, coder: JSON

  # Path reconstruction helper
  def reconstructed_path
    return request_path if request_segment_ids.blank?

    segments = PathSegment.where(id: request_segment_ids).index_by(&:id)
    '/' + request_segment_ids.map { |id| segments[id]&.segment }.compact.join('/')
  end

  def path_depth
    request_segment_ids&.length || 0
  end
end

The Indexing Strategy

Why Standard LIKE Doesn't Work

SQLite's B-tree indexes only work with LIKE when the pattern is a simple alphanumeric prefix:

-- ✅ Uses index (alphanumeric prefix)
WHERE column LIKE 'api%'

-- ❌ Full table scan (starts with '[')
WHERE request_segment_ids LIKE '[1,2,%'

The Solution: Range Queries on Lexicographic Sort

JSON arrays sort lexicographically in SQLite:

"[1,2]"      (exact match)
"[1,2,3]"    (prefix match - has [1,2] as start)
"[1,2,4]"    (prefix match - has [1,2] as start)
"[1,2,99]"   (prefix match - has [1,2] as start)
"[1,3]"      (out of range - different prefix)

To find all paths starting with [1,2]:

-- Exact match OR prefix range
WHERE request_segment_ids = '[1,2]'
   OR (request_segment_ids >= '[1,2,' AND request_segment_ids < '[1,3]')

The range >= '[1,2,' AND < '[1,3]' captures all arrays starting with [1,2,...].

Query Performance

EXPLAIN QUERY PLAN:
  MULTI-INDEX OR
    ├─ INDEX 1: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids=?)
    └─ INDEX 2: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids>? AND request_segment_ids<?)

Both branches use the B-tree index = O(log n) lookups!

Implementation: with_path_prefix Scope

scope :with_path_prefix, ->(prefix_segment_ids) {
  return none if prefix_segment_ids.blank?

  # Convert [1, 2] to JSON string "[1,2]"
  prefix_str = prefix_segment_ids.to_json

  # Build upper bound by incrementing last segment
  # [1, 2] + 1 = [1, 3]
  upper_prefix = prefix_segment_ids[0..-2] + [prefix_segment_ids.last + 1]
  upper_str = upper_prefix.to_json

  # Lower bound for prefix matches: "[1,2,"
  lower_prefix_str = "#{prefix_str[0..-2]},"

  # Range query that uses B-tree index
  where("request_segment_ids = ? OR (request_segment_ids >= ? AND request_segment_ids < ?)",
        prefix_str, lower_prefix_str, upper_str)
}

Usage Examples

Basic Prefix Search

# Find all /api/v1/* paths
api_seg = PathSegment.find_by(segment: 'api')
v1_seg = PathSegment.find_by(segment: 'v1')

events = Event.with_path_prefix([api_seg.id, v1_seg.id])
# Matches: /api/v1, /api/v1/users, /api/v1/users/123, etc.

Combined with Other Filters

# Blocked requests to /admin/* from specific IP
admin_seg = PathSegment.find_by(segment: 'admin')

Event.where(ip_address: '192.168.1.100')
     .where(waf_action: :deny)
     .with_path_prefix([admin_seg.id])

Using Composite Index

# POST requests to /api/* on specific host
# Uses: idx_events_host_method_path
host = RequestHost.find_by(hostname: 'api.example.com')
api_seg = PathSegment.find_by(segment: 'api')

Event.where(request_host_id: host.id, request_method: :post)
     .with_path_prefix([api_seg.id])

Exact Path Match

# Find exact path /api/v1 (not /api/v1/users)
api_seg = PathSegment.find_by(segment: 'api')
v1_seg = PathSegment.find_by(segment: 'v1')

Event.where(request_segment_ids: [api_seg.id, v1_seg.id].to_json)

Path Reconstruction for Display

events = Event.with_path_prefix([api_seg.id]).limit(10)

events.each do |event|
  puts "#{event.reconstructed_path} - #{event.waf_action}"
  # => /api/v1/users - allow
  # => /api/v1/posts - deny
end

Performance Characteristics

Operation	Index Used	Complexity	Notes
Exact path match	✅ B-tree	O(log n)	Single index lookup
Prefix path match	✅ B-tree range	O(log n + k)	k = number of matches
Path depth filter	❌ None	O(n)	Full table scan - use sparingly
Host+method+path	✅ Composite	O(log n + k)	Optimal for WAF queries

Indexes in Schema

# Single-column index for path queries
add_index :events, :request_segment_ids

# Composite index for common WAF query patterns
add_index :events, [:request_host_id, :request_method, :request_segment_ids],
  name: 'idx_events_host_method_path'

Storage Efficiency

Compression Benefits

Example: /api/v1/users appears in 100,000 events

Without normalization:

100,000 events × 15 bytes = 1,500,000 bytes (1.5 MB)

With normalization:

3 segments × 10 bytes (avg) = 30 bytes
100,000 events × 7 bytes ("[1,2,3]") = 700,000 bytes (700 KB)
Total: 700,030 bytes (700 KB)

Savings: 53% reduction

Plus benefits:

Usage tracking: usage_count shows hot paths
Analytics: Easy to identify common path patterns
Flexibility: Can query at segment level

Normalization Process

Event Creation Flow

# 1. Event arrives with full path
payload = {
  "request" => { "path" => "/api/v1/users/123" }
}

# 2. Event model extracts path
event = Event.create_from_waf_payload!(event_id, payload, project)
# Sets: request_path = "/api/v1/users/123"

# 3. After validation, EventNormalizer runs
EventNormalizer.normalize_event!(event)

# 4. Path is decomposed into segments
segments = ["/api/v1/users/123"].split('/').reject(&:blank?)
# => ["api", "v1", "users", "123"]

# 5. Each segment is normalized to ID
segment_ids = segments.map do |segment|
  path_segment = PathSegment.find_or_create_segment(segment)
  path_segment.increment_usage! unless path_segment.new_record?
  path_segment.id
end
# => [1, 2, 3, 4]

# 6. IDs stored as JSON array
event.request_segment_ids = segment_ids
# Stored in DB as: "[1,2,3,4]"

EventNormalizer Service

class EventNormalizer
  def normalize_path_segments
    segments = @event.path_segments_array
    return if segments.empty?

    segment_ids = segments.map do |segment|
      path_segment = PathSegment.find_or_create_segment(segment)
      path_segment.increment_usage! unless path_segment.new_record?
      path_segment.id
    end

    # Store as array - serialize will handle JSON encoding
    @event.request_segment_ids = segment_ids
  end
end

Important: JSON Functions and Performance

❌ Avoid in WHERE Clauses

JSON functions like json_array_length() cannot use indexes:

# ❌ SLOW - Full table scan
Event.where("json_array_length(request_segment_ids) = ?", 3)

# ✅ FAST - Filter in Ruby after indexed query
Event.with_path_prefix([api_id]).select { |e| e.path_depth == 3 }

✅ Use for Analytics (Async)

JSON functions are fine for analytics queries run in background jobs:

# Background job for analytics
class PathDepthAnalysisJob < ApplicationJob
  def perform(project_id)
    # This is OK in async context
    stats = Event.where(project_id: project_id)
                 .select("json_array_length(request_segment_ids) as depth, COUNT(*) as count")
                 .group("depth")
                 .order(:depth)

    # Store results for dashboard
    PathDepthStats.create!(project_id: project_id, data: stats)
  end
end

Edge Cases and Considerations

Empty Paths

request_path = "/"
segments = []  # Empty after split and reject
request_segment_ids = []  # Empty array
# Stored as: "[]"

Trailing Slashes

"/api/v1/" == "/api/v1"  # Both normalize to ["api", "v1"]

Special Characters in Segments

# URL-encoded segments are stored as-is
"/search?q=hello%20world"
# Segments: ["search?q=hello%20world"]

Consider normalizing query params separately if needed.

Very Deep Paths

Paths with 10+ segments work fine but consider:

Are they legitimate? (Could indicate attack)
Impact on JSON array size
Consider truncating for analytics

Analytics Use Cases

Most Common Paths

# Top 10 most accessed paths
Event.group(:request_segment_ids)
     .order('COUNT(*) DESC')
     .limit(10)
     .count
     .map { |seg_ids, count|
       path = PathSegment.where(id: JSON.parse(seg_ids))
                        .pluck(:segment)
                        .join('/')
       ["/#{path}", count]
     }

Hot Path Segments

# Most frequently used segments (indicates common endpoints)
PathSegment.order(usage_count: :desc).limit(20)

Attack Pattern Detection

# Paths with unusual depth (possible directory traversal)
Event.where(waf_action: :deny)
     .select { |e| e.path_depth > 10 }
     .group_by { |e| e.request_segment_ids.first }

Path-Based Rule Generation

# Auto-block paths that are frequently denied
suspicious_paths = Event.where(waf_action: :deny)
                        .where('created_at > ?', 1.hour.ago)
                        .group(:request_segment_ids)
                        .having('COUNT(*) > ?', 100)
                        .pluck(:request_segment_ids)

suspicious_paths.each do |seg_ids|
  # TODO: Implement rule creation for blocking path segments
  # Rule.create!(rule_type: 'path_pattern', conditions: { patterns: seg_ids }, action: 'deny')
end

Future Optimizations

Phase 2 Considerations

If performance becomes critical:

Materialized Path Column: Pre-compute common prefix patterns
Trie Data Structure: In-memory trie for ultra-fast prefix matching
Redis Cache: Cache hot path lookups
Partial Indexes: Index only blocked/challenged events

# Example: Partial index for security-relevant events
add_index :events, :request_segment_ids,
  where: "waf_action IN ('deny', 'challenge')",
  name: 'idx_events_blocked_paths'

Storage Considerations

For very large deployments (100M+ events):

Archive old events: Move to separate table
Aggregate path stats: Pre-compute daily/hourly summaries
Compress JSON: SQLite JSON1 extension supports compression

Testing

Test Index Usage

# Verify B-tree index is being used
sql = Event.with_path_prefix([1, 2]).to_sql
plan = ActiveRecord::Base.connection.execute("EXPLAIN QUERY PLAN #{sql}")

# Should see: "SEARCH events USING INDEX index_events_on_request_segment_ids"
puts plan.to_a

Benchmark Queries

require 'benchmark'

prefix_ids = [1, 2]

# Test indexed range query
Benchmark.bm do |x|
  x.report("Indexed range:") {
    Event.with_path_prefix(prefix_ids).count
  }

  x.report("LIKE query:") {
    Event.where("request_segment_ids LIKE ?", "[1,2,%").count
  }
end

# Range query should be 10-100x faster

Conclusion

Path segment normalization with JSON array storage provides:

✅ Significant storage savings (50%+ compression) ✅ Fast prefix queries using standard B-tree indexes ✅ Analytics-friendly with usage tracking and pattern detection ✅ Rails-native using built-in serialization ✅ Scalable to millions of events with O(log n) lookups

The key insight: Range queries on lexicographically-sorted JSON strings use B-tree indexes efficiently, avoiding the need for complex full-text search or custom indexing strategies.

Related Documentation:

13 KiB Raw Blame History Unescape Escape