baffle-hub/docs/path-segment-architecture.md

# Path Segment Architecture

## Overview

Baffle Hub uses a path segment decomposition strategy to efficiently store and query URL paths in WAF event logs. This architecture provides significant storage compression while enabling fast prefix-based path searches using SQLite's B-tree indexes.

## The Problem

WAF systems generate millions of request events. Storing full URL paths like `/api/v1/users/123/posts` repeatedly wastes storage and makes pattern-based queries inefficient.

Traditional approaches:
- **Full path storage**: High redundancy, large database size
- **String pattern matching with LIKE**: No index support, slow queries
- **Full-Text Search (FTS)**: Complex setup, overkill for structured paths

## Our Solution: Path Segment Normalization

### Architecture Components

```
Request: /api/v1/users/123/posts
    ↓
Decompose into segments: ["api", "v1", "users", "123", "posts"]
    ↓
Normalize to IDs: [1, 2, 3, 4, 5]
    ↓
Store as JSON array: "[1,2,3,4,5]"
```

### Database Schema

```ruby
# path_segments table - deduplicated segment dictionary
create_table :path_segments do |t|
  t.string :segment, null: false, index: { unique: true }
  t.integer :usage_count, default: 1, null: false
  t.datetime :first_seen_at, null: false
  t.timestamps
end

# events table - references segments by ID
create_table :events do |t|
  t.string :request_segment_ids  # JSON array: "[1,2,3]"
  t.string :request_path         # Original path for display
  # ... other fields
end

# Critical index for fast lookups
add_index :events, :request_segment_ids
```

### Models

**PathSegment** - The segment dictionary:
```ruby
class PathSegment < ApplicationRecord
  validates :segment, presence: true, uniqueness: true
  validates :usage_count, presence: true, numericality: { greater_than: 0 }

  def self.find_or_create_segment(segment)
    find_or_create_by(segment: segment) do |path_segment|
      path_segment.usage_count = 1
      path_segment.first_seen_at = Time.current
    end
  end

  def increment_usage!
    increment!(:usage_count)
  end
end
```

**Event** - Stores segment IDs as JSON array:
```ruby
class Event < ApplicationRecord
  serialize :request_segment_ids, type: Array, coder: JSON

  # Path reconstruction helper
  def reconstructed_path
    return request_path if request_segment_ids.blank?

    segments = PathSegment.where(id: request_segment_ids).index_by(&:id)
    '/' + request_segment_ids.map { |id| segments[id]&.segment }.compact.join('/')
  end

  def path_depth
    request_segment_ids&.length || 0
  end
end
```

## The Indexing Strategy

### Why Standard LIKE Doesn't Work

SQLite's B-tree indexes only work with LIKE when the pattern is a simple alphanumeric prefix:

```sql
-- ✅ Uses index (alphanumeric prefix)
WHERE column LIKE 'api%'

-- ❌ Full table scan (starts with '[')
WHERE request_segment_ids LIKE '[1,2,%'
```

### The Solution: Range Queries on Lexicographic Sort

JSON arrays sort lexicographically in SQLite:

```
"[1,2]"      (exact match)
"[1,2,3]"    (prefix match - has [1,2] as start)
"[1,2,4]"    (prefix match - has [1,2] as start)
"[1,2,99]"   (prefix match - has [1,2] as start)
"[1,3]"      (out of range - different prefix)
```

To find all paths starting with `[1,2]`:
```sql
-- Exact match OR prefix range
WHERE request_segment_ids = '[1,2]'
   OR (request_segment_ids >= '[1,2,' AND request_segment_ids < '[1,3]')
```

The range `>= '[1,2,' AND < '[1,3]'` captures all arrays starting with `[1,2,...]`.

### Query Performance

```
EXPLAIN QUERY PLAN:
  MULTI-INDEX OR
    ├─ INDEX 1: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids=?)
    └─ INDEX 2: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids>? AND request_segment_ids<?)
```

Both branches use the B-tree index = O(log n) lookups!

### Implementation: with_path_prefix Scope

```ruby
scope :with_path_prefix, ->(prefix_segment_ids) {
  return none if prefix_segment_ids.blank?

  # Convert [1, 2] to JSON string "[1,2]"
  prefix_str = prefix_segment_ids.to_json

  # Build upper bound by incrementing last segment
  # [1, 2] + 1 = [1, 3]
  upper_prefix = prefix_segment_ids[0..-2] + [prefix_segment_ids.last + 1]
  upper_str = upper_prefix.to_json

  # Lower bound for prefix matches: "[1,2,"
  lower_prefix_str = "#{prefix_str[0..-2]},"

  # Range query that uses B-tree index
  where("request_segment_ids = ? OR (request_segment_ids >= ? AND request_segment_ids < ?)",
        prefix_str, lower_prefix_str, upper_str)
}
```

## Usage Examples

### Basic Prefix Search

```ruby
# Find all /api/v1/* paths
api_seg = PathSegment.find_by(segment: 'api')
v1_seg = PathSegment.find_by(segment: 'v1')

events = Event.with_path_prefix([api_seg.id, v1_seg.id])
# Matches: /api/v1, /api/v1/users, /api/v1/users/123, etc.
```

### Combined with Other Filters

```ruby
# Blocked requests to /admin/* from specific IP
admin_seg = PathSegment.find_by(segment: 'admin')

Event.where(ip_address: '192.168.1.100')
     .where(waf_action: :deny)
     .with_path_prefix([admin_seg.id])
```

### Using Composite Index

```ruby
# POST requests to /api/* on specific host
# Uses: idx_events_host_method_path
host = RequestHost.find_by(hostname: 'api.example.com')
api_seg = PathSegment.find_by(segment: 'api')

Event.where(request_host_id: host.id, request_method: :post)
     .with_path_prefix([api_seg.id])
```

### Exact Path Match

```ruby
# Find exact path /api/v1 (not /api/v1/users)
api_seg = PathSegment.find_by(segment: 'api')
v1_seg = PathSegment.find_by(segment: 'v1')

Event.where(request_segment_ids: [api_seg.id, v1_seg.id].to_json)
```

### Path Reconstruction for Display

```ruby
events = Event.with_path_prefix([api_seg.id]).limit(10)

events.each do |event|
  puts "#{event.reconstructed_path} - #{event.waf_action}"
  # => /api/v1/users - allow
  # => /api/v1/posts - deny
end
```

## Performance Characteristics

| Operation | Index Used | Complexity | Notes |
|-----------|-----------|------------|-------|
| Exact path match | ✅ B-tree | O(log n) | Single index lookup |
| Prefix path match | ✅ B-tree range | O(log n + k) | k = number of matches |
| Path depth filter | ❌ None | O(n) | Full table scan - use sparingly |
| Host+method+path | ✅ Composite | O(log n + k) | Optimal for WAF queries |

### Indexes in Schema

```ruby
# Single-column index for path queries
add_index :events, :request_segment_ids

# Composite index for common WAF query patterns
add_index :events, [:request_host_id, :request_method, :request_segment_ids],
  name: 'idx_events_host_method_path'
```

## Storage Efficiency

### Compression Benefits

Example: `/api/v1/users` appears in 100,000 events

**Without normalization:**
```
100,000 events × 15 bytes = 1,500,000 bytes (1.5 MB)
```

**With normalization:**
```
3 segments × 10 bytes (avg) = 30 bytes
100,000 events × 7 bytes ("[1,2,3]") = 700,000 bytes (700 KB)
Total: 700,030 bytes (700 KB)

Savings: 53% reduction
```

Plus benefits:
- **Usage tracking**: `usage_count` shows hot paths
- **Analytics**: Easy to identify common path patterns
- **Flexibility**: Can query at segment level

## Normalization Process

### Event Creation Flow

```ruby
# 1. Event arrives with full path
payload = {
  "request" => { "path" => "/api/v1/users/123" }
}

# 2. Event model extracts path
event = Event.create_from_waf_payload!(event_id, payload, project)
# Sets: request_path = "/api/v1/users/123"

# 3. After validation, EventNormalizer runs
EventNormalizer.normalize_event!(event)

# 4. Path is decomposed into segments
segments = ["/api/v1/users/123"].split('/').reject(&:blank?)
# => ["api", "v1", "users", "123"]

# 5. Each segment is normalized to ID
segment_ids = segments.map do |segment|
  path_segment = PathSegment.find_or_create_segment(segment)
  path_segment.increment_usage! unless path_segment.new_record?
  path_segment.id
end
# => [1, 2, 3, 4]

# 6. IDs stored as JSON array
event.request_segment_ids = segment_ids
# Stored in DB as: "[1,2,3,4]"
```

### EventNormalizer Service

```ruby
class EventNormalizer
  def normalize_path_segments
    segments = @event.path_segments_array
    return if segments.empty?

    segment_ids = segments.map do |segment|
      path_segment = PathSegment.find_or_create_segment(segment)
      path_segment.increment_usage! unless path_segment.new_record?
      path_segment.id
    end

    # Store as array - serialize will handle JSON encoding
    @event.request_segment_ids = segment_ids
  end
end
```

## Important: JSON Functions and Performance

### ❌ Avoid in WHERE Clauses

JSON functions like `json_array_length()` cannot use indexes:

```ruby
# ❌ SLOW - Full table scan
Event.where("json_array_length(request_segment_ids) = ?", 3)

# ✅ FAST - Filter in Ruby after indexed query
Event.with_path_prefix([api_id]).select { |e| e.path_depth == 3 }
```

### ✅ Use for Analytics (Async)

JSON functions are fine for analytics queries run in background jobs:

```ruby
# Background job for analytics
class PathDepthAnalysisJob < ApplicationJob
  def perform(project_id)
    # This is OK in async context
    stats = Event.where(project_id: project_id)
                 .select("json_array_length(request_segment_ids) as depth, COUNT(*) as count")
                 .group("depth")
                 .order(:depth)

    # Store results for dashboard
    PathDepthStats.create!(project_id: project_id, data: stats)
  end
end
```

## Edge Cases and Considerations

### Empty Paths

```ruby
request_path = "/"
segments = []  # Empty after split and reject
request_segment_ids = []  # Empty array
# Stored as: "[]"
```

### Trailing Slashes

```ruby
"/api/v1/" == "/api/v1"  # Both normalize to ["api", "v1"]
```

### Special Characters in Segments

```ruby
# URL-encoded segments are stored as-is
"/search?q=hello%20world"
# Segments: ["search?q=hello%20world"]
```

Consider normalizing query params separately if needed.

### Very Deep Paths

Paths with 10+ segments work fine but consider:
- Are they legitimate? (Could indicate attack)
- Impact on JSON array size
- Consider truncating for analytics

## Analytics Use Cases

### Most Common Paths

```ruby
# Top 10 most accessed paths
Event.group(:request_segment_ids)
     .order('COUNT(*) DESC')
     .limit(10)
     .count
     .map { |seg_ids, count|
       path = PathSegment.where(id: JSON.parse(seg_ids))
                        .pluck(:segment)
                        .join('/')
       ["/#{path}", count]
     }
```

### Hot Path Segments

```ruby
# Most frequently used segments (indicates common endpoints)
PathSegment.order(usage_count: :desc).limit(20)
```

### Attack Pattern Detection

```ruby
# Paths with unusual depth (possible directory traversal)
Event.where(waf_action: :deny)
     .select { |e| e.path_depth > 10 }
     .group_by { |e| e.request_segment_ids.first }
```

### Path-Based Rule Generation

```ruby
# Auto-block paths that are frequently denied
suspicious_paths = Event.where(waf_action: :deny)
                        .where('created_at > ?', 1.hour.ago)
                        .group(:request_segment_ids)
                        .having('COUNT(*) > ?', 100)
                        .pluck(:request_segment_ids)

suspicious_paths.each do |seg_ids|
  RuleSet.global.block_path_segments(seg_ids)
end
```

## Future Optimizations

### Phase 2 Considerations

If performance becomes critical:

1. **Materialized Path Column**: Pre-compute common prefix patterns
2. **Trie Data Structure**: In-memory trie for ultra-fast prefix matching
3. **Redis Cache**: Cache hot path lookups
4. **Partial Indexes**: Index only blocked/challenged events

```ruby
# Example: Partial index for security-relevant events
add_index :events, :request_segment_ids,
  where: "waf_action IN ('deny', 'challenge')",
  name: 'idx_events_blocked_paths'
```

### Storage Considerations

For very large deployments (100M+ events):

- **Archive old events**: Move to separate table
- **Aggregate path stats**: Pre-compute daily/hourly summaries
- **Compress JSON**: SQLite JSON1 extension supports compression

## Testing

### Test Index Usage

```ruby
# Verify B-tree index is being used
sql = Event.with_path_prefix([1, 2]).to_sql
plan = ActiveRecord::Base.connection.execute("EXPLAIN QUERY PLAN #{sql}")

# Should see: "SEARCH events USING INDEX index_events_on_request_segment_ids"
puts plan.to_a
```

### Benchmark Queries

```ruby
require 'benchmark'

prefix_ids = [1, 2]

# Test indexed range query
Benchmark.bm do |x|
  x.report("Indexed range:") {
    Event.with_path_prefix(prefix_ids).count
  }

  x.report("LIKE query:") {
    Event.where("request_segment_ids LIKE ?", "[1,2,%").count
  }
end

# Range query should be 10-100x faster
```

## Conclusion

Path segment normalization with JSON array storage provides:

✅ **Significant storage savings** (50%+ compression)
✅ **Fast prefix queries** using standard B-tree indexes
✅ **Analytics-friendly** with usage tracking and pattern detection
✅ **Rails-native** using built-in serialization
✅ **Scalable** to millions of events with O(log n) lookups

The key insight: **Range queries on lexicographically-sorted JSON strings use B-tree indexes efficiently**, avoiding the need for complex full-text search or custom indexing strategies.

---

**Related Documentation:**
- [Event Ingestion](./event-ingestion.md) (TODO)
- [WAF Rule Engine](./rule-engine.md) (TODO)
- [Analytics Architecture](./analytics.md) (TODO)