mirror of
https://github.com/czlonkowski/n8n-mcp.git
synced 2026-03-18 16:33:13 +00:00
Phase 1 - Lazy Session Restoration (REQ-1, REQ-2, REQ-8): - Add onSessionNotFound hook for restoring sessions from external storage - Implement idempotent session creation to prevent race conditions - Add session ID validation for security (prevent injection attacks) - Comprehensive error handling (400/408/500 status codes) - 13 integration tests covering all scenarios Phase 2 - Session Management API (REQ-5): - getActiveSessions(): Get all active session IDs - getSessionState(sessionId): Get session state for persistence - getAllSessionStates(): Bulk session state retrieval - restoreSession(sessionId, context): Manual session restoration - deleteSession(sessionId): Manual session termination - 21 unit tests covering all API methods Benefits: - Sessions survive container restarts - Horizontal scaling support (no session stickiness needed) - Zero-downtime deployments - 100% backwards compatible Implementation Details: - Backend methods in http-server-single-session.ts - Public API methods in mcp-engine.ts - SessionState type exported from index.ts - Synchronous session creation and deletion for reliable testing - Version updated from 2.18.10 to 2.19.0 Tests: 34 passing (13 integration + 21 unit) Coverage: Full API coverage with edge cases Security: Session ID validation prevents SQL/NoSQL injection and path traversal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
962 lines
31 KiB
Markdown
962 lines
31 KiB
Markdown
# n8n-MCP Telemetry Database Pruning Strategy
|
|
|
|
**Analysis Date:** 2025-10-10
|
|
**Current Database Size:** 265 MB (telemetry_events: 199 MB, telemetry_workflows: 66 MB)
|
|
**Free Tier Limit:** 500 MB
|
|
**Projected 4-Week Size:** 609 MB (exceeds limit by 109 MB)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
**Critical Finding:** At current growth rate (56.75% of data from last 7 days), we will exceed the 500 MB free tier limit in approximately 2 weeks. Implementing a 7-day retention policy can immediately save 36.5 MB (37.6%) and prevent database overflow.
|
|
|
|
**Key Insights:**
|
|
- 641,487 event records consuming 199 MB
|
|
- 17,247 workflow records consuming 66 MB
|
|
- Daily growth rate: ~7-8 MB/day for events
|
|
- 43.25% of data is older than 7 days but provides diminishing value
|
|
|
|
**Immediate Action Required:** Implement automated pruning to maintain database under 500 MB.
|
|
|
|
---
|
|
|
|
## 1. Current State Assessment
|
|
|
|
### Database Size and Distribution
|
|
|
|
| Table | Rows | Current Size | Growth Rate | Bytes/Row |
|
|
|-------|------|--------------|-------------|-----------|
|
|
| telemetry_events | 641,487 | 199 MB | 56.66% from last 7d | 325 |
|
|
| telemetry_workflows | 17,247 | 66 MB | 60.09% from last 7d | 4,013 |
|
|
| **TOTAL** | **658,734** | **265 MB** | **56.75% from last 7d** | **403** |
|
|
|
|
### Event Type Distribution
|
|
|
|
| Event Type | Count | % of Total | Storage | Avg Props Size | Oldest Event |
|
|
|------------|-------|-----------|---------|----------------|--------------|
|
|
| tool_sequence | 362,170 | 56.4% | 67 MB | 194 bytes | 2025-09-26 |
|
|
| tool_used | 191,659 | 29.9% | 14 MB | 77 bytes | 2025-09-26 |
|
|
| validation_details | 36,266 | 5.7% | 11 MB | 329 bytes | 2025-09-26 |
|
|
| workflow_created | 23,151 | 3.6% | 2.6 MB | 115 bytes | 2025-09-26 |
|
|
| session_start | 12,575 | 2.0% | 1.2 MB | 101 bytes | 2025-09-26 |
|
|
| workflow_validation_failed | 9,739 | 1.5% | 314 KB | 33 bytes | 2025-09-26 |
|
|
| error_occurred | 4,935 | 0.8% | 626 KB | 130 bytes | 2025-09-26 |
|
|
| search_query | 974 | 0.2% | 106 KB | 112 bytes | 2025-09-26 |
|
|
| Other | 18 | <0.1% | 5 KB | Various | Recent |
|
|
|
|
### Growth Pattern Analysis
|
|
|
|
**Daily Data Accumulation (Last 15 Days):**
|
|
|
|
| Date | Events/Day | Daily Size | Cumulative Size |
|
|
|------|-----------|------------|-----------------|
|
|
| 2025-10-10 | 28,457 | 4.3 MB | 97 MB |
|
|
| 2025-10-09 | 54,717 | 8.2 MB | 93 MB |
|
|
| 2025-10-08 | 52,901 | 7.9 MB | 85 MB |
|
|
| 2025-10-07 | 52,538 | 8.1 MB | 77 MB |
|
|
| 2025-10-06 | 51,401 | 7.8 MB | 69 MB |
|
|
| 2025-10-05 | 50,528 | 7.9 MB | 61 MB |
|
|
|
|
**Average Daily Growth:** ~7.7 MB/day
|
|
**Weekly Growth:** ~54 MB/week
|
|
**Projected to hit 500 MB limit:** ~17 days (late October 2025)
|
|
|
|
### Workflow Data Distribution
|
|
|
|
| Complexity | Count | % | Avg Nodes | Avg JSON Size | Estimated Size |
|
|
|-----------|-------|---|-----------|---------------|----------------|
|
|
| Simple | 12,923 | 77.6% | 5.48 | 2,122 bytes | 20 MB |
|
|
| Medium | 3,708 | 22.3% | 13.93 | 4,458 bytes | 12 MB |
|
|
| Complex | 616 | 0.1% | 26.62 | 7,909 bytes | 3.2 MB |
|
|
|
|
**Key Finding:** No duplicate workflow hashes found - each workflow is unique (good data quality).
|
|
|
|
---
|
|
|
|
## 2. Data Value Classification
|
|
|
|
### TIER 1: Critical - Keep Indefinitely
|
|
|
|
**Error Patterns (error_occurred)**
|
|
- **Why:** Essential for identifying systemic issues and regression detection
|
|
- **Volume:** 4,935 events (626 KB)
|
|
- **Recommendation:** Keep all errors with aggregated summaries for older data
|
|
- **Retention:** Detailed errors 30 days, aggregated stats indefinitely
|
|
|
|
**Tool Usage Statistics (Aggregated)**
|
|
- **Why:** Product analytics and feature prioritization
|
|
- **Recommendation:** Aggregate daily/weekly summaries after 14 days
|
|
- **Keep:** Summary tables with tool usage counts, success rates, avg duration
|
|
|
|
### TIER 2: High Value - Keep 30 Days
|
|
|
|
**Validation Details (validation_details)**
|
|
- **Current:** 36,266 events, 11 MB, avg 329 bytes
|
|
- **Why:** Important for understanding validation issues during current development cycle
|
|
- **Value Period:** 30 days (covers current version development)
|
|
- **After 30d:** Aggregate to summary stats (validation success rate by node type)
|
|
|
|
**Workflow Creation Patterns (workflow_created)**
|
|
- **Current:** 23,151 events, 2.6 MB
|
|
- **Why:** Track feature adoption and workflow patterns
|
|
- **Value Period:** 30 days for detailed analysis
|
|
- **After 30d:** Keep aggregated metrics only
|
|
|
|
### TIER 3: Medium Value - Keep 14 Days
|
|
|
|
**Session Data (session_start)**
|
|
- **Current:** 12,575 events, 1.2 MB
|
|
- **Why:** User engagement tracking
|
|
- **Value Period:** 14 days sufficient for engagement analysis
|
|
- **Pruning Impact:** 497 KB saved (40% reduction)
|
|
|
|
**Workflow Validation Failures (workflow_validation_failed)**
|
|
- **Current:** 9,739 events, 314 KB
|
|
- **Why:** Tracks validation patterns but less detailed than validation_details
|
|
- **Value Period:** 14 days
|
|
- **Pruning Impact:** 170 KB saved (54% reduction)
|
|
|
|
### TIER 4: Short-Term Value - Keep 7 Days
|
|
|
|
**Tool Sequences (tool_sequence)**
|
|
- **Current:** 362,170 events, 67 MB (largest table!)
|
|
- **Why:** Tracks multi-tool workflows but extremely high volume
|
|
- **Value Period:** 7 days for recent pattern analysis
|
|
- **Pruning Impact:** 29 MB saved (43% reduction) - HIGHEST IMPACT
|
|
- **Rationale:** Tool usage patterns stabilize quickly; older sequences provide diminishing returns
|
|
|
|
**Tool Usage Events (tool_used)**
|
|
- **Current:** 191,659 events, 14 MB
|
|
- **Why:** Individual tool executions - can be aggregated
|
|
- **Value Period:** 7 days detailed, then aggregate
|
|
- **Pruning Impact:** 6.2 MB saved (44% reduction)
|
|
|
|
**Search Queries (search_query)**
|
|
- **Current:** 974 events, 106 KB
|
|
- **Why:** Low volume, useful for understanding search patterns
|
|
- **Value Period:** 7 days sufficient
|
|
- **Pruning Impact:** Minimal (~1 KB)
|
|
|
|
### TIER 5: Ephemeral - Keep 3 Days
|
|
|
|
**Diagnostic/Health Checks (diagnostic_completed, health_check_completed)**
|
|
- **Current:** 17 events, ~2.5 KB
|
|
- **Why:** Operational health checks, only current state matters
|
|
- **Value Period:** 3 days
|
|
- **Pruning Impact:** Negligible but good hygiene
|
|
|
|
### Workflow Data Retention Strategy
|
|
|
|
**telemetry_workflows Table (66 MB):**
|
|
- **Simple workflows (5-6 nodes):** Keep 7 days → Save 11 MB
|
|
- **Medium workflows (13-14 nodes):** Keep 14 days → Save 6.7 MB
|
|
- **Complex workflows (26+ nodes):** Keep 30 days → Save 1.9 MB
|
|
- **Total Workflow Savings:** 19.6 MB with tiered retention
|
|
|
|
**Rationale:** Complex workflows are rarer and more valuable for understanding advanced use cases.
|
|
|
|
---
|
|
|
|
## 3. Pruning Recommendations with Space Savings
|
|
|
|
### Strategy A: Conservative 14-Day Retention (Recommended for Initial Implementation)
|
|
|
|
| Action | Records Deleted | Space Saved | Risk Level |
|
|
|--------|----------------|-------------|------------|
|
|
| Delete tool_sequence > 14d | 0 | 0 MB | None - all recent |
|
|
| Delete tool_used > 14d | 0 | 0 MB | None - all recent |
|
|
| Delete validation_details > 14d | 4,259 | 1.2 MB | Low |
|
|
| Delete session_start > 14d | 0 | 0 MB | None - all recent |
|
|
| Delete workflows > 14d | 1 | <1 KB | None |
|
|
| **TOTAL** | **4,260** | **1.2 MB** | **Low** |
|
|
|
|
**Assessment:** Minimal immediate impact but data is too recent. Not sufficient to prevent overflow.
|
|
|
|
### Strategy B: Aggressive 7-Day Retention (RECOMMENDED)
|
|
|
|
| Action | Records Deleted | Space Saved | Risk Level |
|
|
|--------|----------------|-------------|------------|
|
|
| Delete tool_sequence > 7d | 155,389 | 29 MB | Low - pattern data |
|
|
| Delete tool_used > 7d | 82,827 | 6.2 MB | Low - usage metrics |
|
|
| Delete validation_details > 7d | 17,465 | 5.4 MB | Medium - debugging data |
|
|
| Delete workflow_created > 7d | 9,106 | 1.0 MB | Low - creation events |
|
|
| Delete session_start > 7d | 5,664 | 497 KB | Low - session data |
|
|
| Delete error_occurred > 7d | 2,321 | 206 KB | Medium - error history |
|
|
| Delete workflow_validation_failed > 7d | 5,269 | 170 KB | Low - validation events |
|
|
| Delete workflows > 7d (simple) | 5,146 | 11 MB | Low - simple workflows |
|
|
| Delete workflows > 7d (medium) | 1,506 | 6.7 MB | Medium - medium workflows |
|
|
| Delete workflows > 7d (complex) | 231 | 1.9 MB | High - complex workflows |
|
|
| **TOTAL** | **284,924** | **62.1 MB** | **Medium** |
|
|
|
|
**New Database Size:** 265 MB - 62.1 MB = **202.9 MB (76.6% of limit)**
|
|
**Buffer:** 297 MB remaining (~38 days at current growth rate)
|
|
|
|
### Strategy C: Hybrid Tiered Retention (OPTIMAL LONG-TERM)
|
|
|
|
| Event Type | Retention Period | Records Deleted | Space Saved |
|
|
|-----------|------------------|----------------|-------------|
|
|
| tool_sequence | 7 days | 155,389 | 29 MB |
|
|
| tool_used | 7 days | 82,827 | 6.2 MB |
|
|
| validation_details | 14 days | 4,259 | 1.2 MB |
|
|
| workflow_created | 14 days | 3 | <1 KB |
|
|
| session_start | 7 days | 5,664 | 497 KB |
|
|
| error_occurred | 30 days (keep all) | 0 | 0 MB |
|
|
| workflow_validation_failed | 7 days | 5,269 | 170 KB |
|
|
| search_query | 7 days | 10 | 1 KB |
|
|
| Workflows (simple) | 7 days | 5,146 | 11 MB |
|
|
| Workflows (medium) | 14 days | 0 | 0 MB |
|
|
| Workflows (complex) | 30 days (keep all) | 0 | 0 MB |
|
|
| **TOTAL** | **Various** | **258,567** | **48.1 MB** |
|
|
|
|
**New Database Size:** 265 MB - 48.1 MB = **216.9 MB (82% of limit)**
|
|
**Buffer:** 283 MB remaining (~36 days at current growth rate)
|
|
|
|
---
|
|
|
|
## 4. Additional Optimization Opportunities
|
|
|
|
### Optimization 1: Properties Field Compression
|
|
|
|
**Finding:** validation_details events have bloated properties (avg 329 bytes, max 9 KB)
|
|
|
|
```sql
|
|
-- Identify large validation_details records
|
|
SELECT id, user_id, created_at, pg_column_size(properties) as size_bytes
|
|
FROM telemetry_events
|
|
WHERE event = 'validation_details'
|
|
AND pg_column_size(properties) > 1000
|
|
ORDER BY size_bytes DESC;
|
|
-- Result: 417 records > 1KB, 2 records > 5KB
|
|
```
|
|
|
|
**Recommendation:** Truncate verbose error messages in validation_details after 7 days
|
|
- Keep error types and counts
|
|
- Remove full stack traces and detailed messages
|
|
- Estimated savings: 2-3 MB
|
|
|
|
### Optimization 2: Remove Redundant tool_sequence Data
|
|
|
|
**Finding:** tool_sequence properties contain mostly null values
|
|
|
|
```sql
|
|
-- Analysis shows all tool_sequence.properties->>'tools' are null
|
|
-- 362,170 records storing null in properties field
|
|
```
|
|
|
|
**Recommendation:**
|
|
1. Investigate why tool_sequence properties are empty
|
|
2. If by design, reduce properties field size or use a flag
|
|
3. Potential savings: 10-15 MB if properties field is eliminated
|
|
|
|
### Optimization 3: Workflow Deduplication by Hash
|
|
|
|
**Finding:** No duplicate workflow_hash values found (good!)
|
|
|
|
**Recommendation:** Continue using workflow_hash for future deduplication if needed. No action required.
|
|
|
|
### Optimization 4: Dead Row Cleanup
|
|
|
|
**Finding:** telemetry_workflows has 1,591 dead rows (9.5% overhead)
|
|
|
|
```sql
|
|
-- Run VACUUM to reclaim space
|
|
VACUUM FULL telemetry_workflows;
|
|
-- Expected savings: ~6-7 MB
|
|
```
|
|
|
|
**Recommendation:** Schedule weekly VACUUM operations
|
|
|
|
### Optimization 5: Index Optimization
|
|
|
|
**Current indexes consume space but improve query performance**
|
|
|
|
```sql
|
|
-- Check index sizes
|
|
SELECT
|
|
schemaname, tablename, indexname,
|
|
pg_size_pretty(pg_relation_size(indexrelid)) as index_size
|
|
FROM pg_stat_user_indexes
|
|
WHERE schemaname = 'public'
|
|
ORDER BY pg_relation_size(indexrelid) DESC;
|
|
```
|
|
|
|
**Recommendation:** Review if all indexes are necessary after pruning strategy is implemented
|
|
|
|
---
|
|
|
|
## 5. Implementation Strategy
|
|
|
|
### Phase 1: Immediate Emergency Pruning (Day 1)
|
|
|
|
**Goal:** Free up 60+ MB immediately to prevent overflow
|
|
|
|
```sql
|
|
-- EMERGENCY PRUNING: Delete data older than 7 days
|
|
BEGIN;
|
|
|
|
-- Backup count before deletion
|
|
SELECT
|
|
event,
|
|
COUNT(*) FILTER (WHERE created_at < NOW() - INTERVAL '7 days') as to_delete
|
|
FROM telemetry_events
|
|
GROUP BY event;
|
|
|
|
-- Delete old events
|
|
DELETE FROM telemetry_events
|
|
WHERE created_at < NOW() - INTERVAL '7 days';
|
|
-- Expected: ~278,051 rows deleted, ~36.5 MB saved
|
|
|
|
-- Delete old simple workflows
|
|
DELETE FROM telemetry_workflows
|
|
WHERE created_at < NOW() - INTERVAL '7 days'
|
|
AND complexity = 'simple';
|
|
-- Expected: ~5,146 rows deleted, ~11 MB saved
|
|
|
|
-- Verify new size
|
|
SELECT
|
|
schemaname, relname,
|
|
pg_size_pretty(pg_total_relation_size(schemaname||'.'||relname)) AS size
|
|
FROM pg_stat_user_tables
|
|
WHERE schemaname = 'public';
|
|
|
|
COMMIT;
|
|
|
|
-- Clean up dead rows
|
|
VACUUM FULL telemetry_events;
|
|
VACUUM FULL telemetry_workflows;
|
|
```
|
|
|
|
**Expected Result:** Database size reduced to ~210-220 MB (55-60% buffer remaining)
|
|
|
|
### Phase 2: Implement Automated Retention Policy (Week 1)
|
|
|
|
**Create a scheduled Supabase Edge Function or pg_cron job**
|
|
|
|
```sql
|
|
-- Create retention policy function
|
|
CREATE OR REPLACE FUNCTION apply_retention_policy()
|
|
RETURNS void AS $$
|
|
BEGIN
|
|
-- Tier 4: 7-day retention for high-volume events
|
|
DELETE FROM telemetry_events
|
|
WHERE created_at < NOW() - INTERVAL '7 days'
|
|
AND event IN ('tool_sequence', 'tool_used', 'session_start',
|
|
'workflow_validation_failed', 'search_query');
|
|
|
|
-- Tier 3: 14-day retention for medium-value events
|
|
DELETE FROM telemetry_events
|
|
WHERE created_at < NOW() - INTERVAL '14 days'
|
|
AND event IN ('validation_details', 'workflow_created');
|
|
|
|
-- Tier 1: 30-day retention for errors (keep longer)
|
|
DELETE FROM telemetry_events
|
|
WHERE created_at < NOW() - INTERVAL '30 days'
|
|
AND event = 'error_occurred';
|
|
|
|
-- Workflow retention by complexity
|
|
DELETE FROM telemetry_workflows
|
|
WHERE created_at < NOW() - INTERVAL '7 days'
|
|
AND complexity = 'simple';
|
|
|
|
DELETE FROM telemetry_workflows
|
|
WHERE created_at < NOW() - INTERVAL '14 days'
|
|
AND complexity = 'medium';
|
|
|
|
DELETE FROM telemetry_workflows
|
|
WHERE created_at < NOW() - INTERVAL '30 days'
|
|
AND complexity = 'complex';
|
|
|
|
-- Cleanup
|
|
VACUUM telemetry_events;
|
|
VACUUM telemetry_workflows;
|
|
END;
|
|
$$ LANGUAGE plpgsql;
|
|
|
|
-- Schedule daily execution (using pg_cron extension)
|
|
SELECT cron.schedule('retention-policy', '0 2 * * *', 'SELECT apply_retention_policy()');
|
|
```
|
|
|
|
### Phase 3: Create Aggregation Tables (Week 2)
|
|
|
|
**Preserve insights while deleting raw data**
|
|
|
|
```sql
|
|
-- Daily tool usage summary
|
|
CREATE TABLE IF NOT EXISTS telemetry_daily_tool_stats (
|
|
date DATE NOT NULL,
|
|
tool TEXT NOT NULL,
|
|
usage_count INTEGER NOT NULL,
|
|
unique_users INTEGER NOT NULL,
|
|
avg_duration_ms NUMERIC,
|
|
error_count INTEGER DEFAULT 0,
|
|
created_at TIMESTAMPTZ DEFAULT NOW(),
|
|
PRIMARY KEY (date, tool)
|
|
);
|
|
|
|
-- Daily validation summary
|
|
CREATE TABLE IF NOT EXISTS telemetry_daily_validation_stats (
|
|
date DATE NOT NULL,
|
|
node_type TEXT,
|
|
total_validations INTEGER NOT NULL,
|
|
failed_validations INTEGER NOT NULL,
|
|
success_rate NUMERIC,
|
|
common_errors JSONB,
|
|
created_at TIMESTAMPTZ DEFAULT NOW(),
|
|
PRIMARY KEY (date, node_type)
|
|
);
|
|
|
|
-- Aggregate function to run before pruning
|
|
CREATE OR REPLACE FUNCTION aggregate_before_pruning()
|
|
RETURNS void AS $$
|
|
BEGIN
|
|
-- Aggregate tool usage for data about to be deleted
|
|
INSERT INTO telemetry_daily_tool_stats (date, tool, usage_count, unique_users, avg_duration_ms)
|
|
SELECT
|
|
DATE(created_at) as date,
|
|
properties->>'tool' as tool,
|
|
COUNT(*) as usage_count,
|
|
COUNT(DISTINCT user_id) as unique_users,
|
|
AVG((properties->>'duration')::numeric) as avg_duration_ms
|
|
FROM telemetry_events
|
|
WHERE event = 'tool_used'
|
|
AND created_at < NOW() - INTERVAL '7 days'
|
|
AND created_at >= NOW() - INTERVAL '8 days'
|
|
GROUP BY DATE(created_at), properties->>'tool'
|
|
ON CONFLICT (date, tool) DO NOTHING;
|
|
|
|
-- Aggregate validation stats
|
|
INSERT INTO telemetry_daily_validation_stats (date, node_type, total_validations, failed_validations)
|
|
SELECT
|
|
DATE(created_at) as date,
|
|
properties->>'nodeType' as node_type,
|
|
COUNT(*) as total_validations,
|
|
COUNT(*) FILTER (WHERE properties->>'valid' = 'false') as failed_validations
|
|
FROM telemetry_events
|
|
WHERE event = 'validation_details'
|
|
AND created_at < NOW() - INTERVAL '14 days'
|
|
AND created_at >= NOW() - INTERVAL '15 days'
|
|
GROUP BY DATE(created_at), properties->>'nodeType'
|
|
ON CONFLICT (date, node_type) DO NOTHING;
|
|
END;
|
|
$$ LANGUAGE plpgsql;
|
|
|
|
-- Update cron job to aggregate before pruning
|
|
SELECT cron.schedule('aggregate-then-prune', '0 2 * * *',
|
|
'SELECT aggregate_before_pruning(); SELECT apply_retention_policy();');
|
|
```
|
|
|
|
### Phase 4: Monitoring and Alerting (Week 2)
|
|
|
|
**Create size monitoring function**
|
|
|
|
```sql
|
|
CREATE OR REPLACE FUNCTION check_database_size()
|
|
RETURNS TABLE(
|
|
total_size_mb NUMERIC,
|
|
limit_mb NUMERIC,
|
|
percent_used NUMERIC,
|
|
days_until_full NUMERIC
|
|
) AS $$
|
|
DECLARE
|
|
current_size_bytes BIGINT;
|
|
growth_rate_bytes_per_day NUMERIC;
|
|
BEGIN
|
|
-- Get current size
|
|
SELECT SUM(pg_total_relation_size(schemaname||'.'||relname))
|
|
INTO current_size_bytes
|
|
FROM pg_stat_user_tables
|
|
WHERE schemaname = 'public';
|
|
|
|
-- Calculate 7-day growth rate
|
|
SELECT
|
|
(COUNT(*) FILTER (WHERE created_at >= NOW() - INTERVAL '7 days')) *
|
|
AVG(pg_column_size(properties)) * (1.0/7)
|
|
INTO growth_rate_bytes_per_day
|
|
FROM telemetry_events;
|
|
|
|
RETURN QUERY
|
|
SELECT
|
|
ROUND((current_size_bytes / 1024.0 / 1024.0)::numeric, 2) as total_size_mb,
|
|
500.0 as limit_mb,
|
|
ROUND((current_size_bytes / 1024.0 / 1024.0 / 500.0 * 100)::numeric, 2) as percent_used,
|
|
ROUND((((500.0 * 1024 * 1024) - current_size_bytes) / NULLIF(growth_rate_bytes_per_day, 0))::numeric, 1) as days_until_full;
|
|
END;
|
|
$$ LANGUAGE plpgsql;
|
|
|
|
-- Alert function (integrate with external monitoring)
|
|
CREATE OR REPLACE FUNCTION alert_if_size_critical()
|
|
RETURNS void AS $$
|
|
DECLARE
|
|
size_pct NUMERIC;
|
|
BEGIN
|
|
SELECT percent_used INTO size_pct FROM check_database_size();
|
|
|
|
IF size_pct > 90 THEN
|
|
-- Log critical alert
|
|
INSERT INTO telemetry_events (user_id, event, properties)
|
|
VALUES ('system', 'database_size_critical',
|
|
json_build_object('percent_used', size_pct, 'timestamp', NOW())::jsonb);
|
|
END IF;
|
|
END;
|
|
$$ LANGUAGE plpgsql;
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Priority Order for Implementation
|
|
|
|
### Priority 1: URGENT (Day 1)
|
|
1. **Execute Emergency Pruning** - Delete data older than 7 days
|
|
- Impact: 47.5 MB saved immediately
|
|
- Risk: Low - data already analyzed
|
|
- SQL: Provided in Phase 1
|
|
|
|
### Priority 2: HIGH (Week 1)
|
|
2. **Implement Automated Retention Policy**
|
|
- Impact: Prevents future overflow
|
|
- Risk: Low with proper testing
|
|
- Implementation: Phase 2 function
|
|
|
|
3. **Run VACUUM FULL**
|
|
- Impact: 6-7 MB reclaimed from dead rows
|
|
- Risk: Low but locks tables briefly
|
|
- Command: `VACUUM FULL telemetry_workflows;`
|
|
|
|
### Priority 3: MEDIUM (Week 2)
|
|
4. **Create Aggregation Tables**
|
|
- Impact: Preserves insights, enables longer-term pruning
|
|
- Risk: Low - additive only
|
|
- Implementation: Phase 3 tables and functions
|
|
|
|
5. **Implement Monitoring**
|
|
- Impact: Prevents future surprises
|
|
- Risk: None
|
|
- Implementation: Phase 4 monitoring functions
|
|
|
|
### Priority 4: LOW (Month 1)
|
|
6. **Optimize Properties Fields**
|
|
- Impact: 2-3 MB additional savings
|
|
- Risk: Medium - requires code changes
|
|
- Action: Truncate verbose error messages
|
|
|
|
7. **Investigate tool_sequence null properties**
|
|
- Impact: 10-15 MB potential savings
|
|
- Risk: Medium - requires application changes
|
|
- Action: Code review and optimization
|
|
|
|
---
|
|
|
|
## 7. Risk Assessment
|
|
|
|
### Strategy B (7-Day Retention): Risks and Mitigations
|
|
|
|
| Risk | Likelihood | Impact | Mitigation |
|
|
|------|-----------|---------|------------|
|
|
| Loss of debugging data for old issues | Medium | Medium | Keep error_occurred for 30 days; aggregate validation stats |
|
|
| Unable to analyze long-term trends | Low | Low | Implement aggregation tables before pruning |
|
|
| Accidental deletion of critical data | Low | High | Test on staging; implement backups; add rollback capability |
|
|
| Performance impact during deletion | Medium | Low | Run during off-peak hours (2 AM UTC) |
|
|
| VACUUM locks table briefly | Low | Low | Schedule during low-usage window |
|
|
|
|
### Strategy C (Hybrid Tiered): Risks and Mitigations
|
|
|
|
| Risk | Likelihood | Impact | Mitigation |
|
|
|------|-----------|---------|------------|
|
|
| Complex logic leads to bugs | Medium | Medium | Thorough testing; monitoring; gradual rollout |
|
|
| Different retention per event type confusing | Low | Low | Document clearly; add comments in code |
|
|
| Tiered approach still insufficient | Low | High | Monitor growth; adjust retention if needed |
|
|
|
|
---
|
|
|
|
## 8. Monitoring Metrics
|
|
|
|
### Key Metrics to Track Post-Implementation
|
|
|
|
1. **Database Size Trend**
|
|
```sql
|
|
SELECT * FROM check_database_size();
|
|
```
|
|
- Target: Stay under 300 MB (60% of limit)
|
|
- Alert threshold: 90% (450 MB)
|
|
|
|
2. **Daily Growth Rate**
|
|
```sql
|
|
SELECT
|
|
DATE(created_at) as date,
|
|
COUNT(*) as events,
|
|
pg_size_pretty(SUM(pg_column_size(properties))::bigint) as daily_size
|
|
FROM telemetry_events
|
|
WHERE created_at >= NOW() - INTERVAL '7 days'
|
|
GROUP BY DATE(created_at)
|
|
ORDER BY date DESC;
|
|
```
|
|
- Target: < 8 MB/day average
|
|
- Alert threshold: > 12 MB/day sustained
|
|
|
|
3. **Retention Policy Execution**
|
|
```sql
|
|
-- Add logging to retention policy function
|
|
CREATE TABLE retention_policy_log (
|
|
executed_at TIMESTAMPTZ DEFAULT NOW(),
|
|
events_deleted INTEGER,
|
|
workflows_deleted INTEGER,
|
|
space_reclaimed_mb NUMERIC
|
|
);
|
|
```
|
|
- Monitor: Daily successful execution
|
|
- Alert: If job fails or deletes 0 rows unexpectedly
|
|
|
|
4. **Data Availability Check**
|
|
```sql
|
|
-- Ensure sufficient data for analysis
|
|
SELECT
|
|
event,
|
|
COUNT(*) as available_records,
|
|
MIN(created_at) as oldest_record,
|
|
MAX(created_at) as newest_record
|
|
FROM telemetry_events
|
|
GROUP BY event;
|
|
```
|
|
- Target: 7 days of data always available
|
|
- Alert: If oldest_record > 8 days ago (retention policy failing)
|
|
|
|
---
|
|
|
|
## 9. Recommended Action Plan
|
|
|
|
### Immediate Actions (Today)
|
|
|
|
**Step 1:** Execute emergency pruning
|
|
```sql
|
|
-- Backup first (optional but recommended)
|
|
-- Create a copy of current stats
|
|
CREATE TABLE telemetry_events_stats_backup AS
|
|
SELECT event, COUNT(*), MIN(created_at), MAX(created_at)
|
|
FROM telemetry_events
|
|
GROUP BY event;
|
|
|
|
-- Execute pruning
|
|
DELETE FROM telemetry_events WHERE created_at < NOW() - INTERVAL '7 days';
|
|
DELETE FROM telemetry_workflows WHERE created_at < NOW() - INTERVAL '7 days' AND complexity = 'simple';
|
|
VACUUM FULL telemetry_events;
|
|
VACUUM FULL telemetry_workflows;
|
|
```
|
|
|
|
**Step 2:** Verify results
|
|
```sql
|
|
SELECT * FROM check_database_size();
|
|
```
|
|
|
|
**Expected outcome:** Database size ~210-220 MB (58-60% buffer remaining)
|
|
|
|
### Week 1 Actions
|
|
|
|
**Step 3:** Implement automated retention policy
|
|
- Create retention policy function (Phase 2 code)
|
|
- Test function on staging/development environment
|
|
- Schedule daily execution via pg_cron
|
|
|
|
**Step 4:** Set up monitoring
|
|
- Create monitoring functions (Phase 4 code)
|
|
- Configure alerts for size thresholds
|
|
- Document escalation procedures
|
|
|
|
### Week 2 Actions
|
|
|
|
**Step 5:** Create aggregation tables
|
|
- Implement summary tables (Phase 3 code)
|
|
- Backfill historical aggregations if needed
|
|
- Update retention policy to aggregate before pruning
|
|
|
|
**Step 6:** Optimize and tune
|
|
- Review query performance post-pruning
|
|
- Adjust retention periods if needed based on actual usage
|
|
- Document any issues or improvements
|
|
|
|
### Monthly Maintenance
|
|
|
|
**Step 7:** Regular review
|
|
- Monthly review of database growth trends
|
|
- Quarterly review of retention policy effectiveness
|
|
- Adjust retention periods based on product needs
|
|
|
|
---
|
|
|
|
## 10. SQL Execution Scripts
|
|
|
|
### Script 1: Emergency Pruning (Run First)
|
|
|
|
```sql
|
|
-- ============================================
|
|
-- EMERGENCY PRUNING SCRIPT
|
|
-- Expected savings: ~50 MB
|
|
-- Execution time: 2-5 minutes
|
|
-- ============================================
|
|
|
|
BEGIN;
|
|
|
|
-- Create backup of current state
|
|
CREATE TABLE IF NOT EXISTS pruning_audit (
|
|
executed_at TIMESTAMPTZ DEFAULT NOW(),
|
|
action TEXT,
|
|
records_affected INTEGER,
|
|
size_before_mb NUMERIC,
|
|
size_after_mb NUMERIC
|
|
);
|
|
|
|
-- Record size before
|
|
INSERT INTO pruning_audit (action, size_before_mb)
|
|
SELECT 'before_pruning',
|
|
pg_total_relation_size('telemetry_events')::numeric / 1024 / 1024;
|
|
|
|
-- Delete old events (keep last 7 days)
|
|
WITH deleted AS (
|
|
DELETE FROM telemetry_events
|
|
WHERE created_at < NOW() - INTERVAL '7 days'
|
|
RETURNING *
|
|
)
|
|
INSERT INTO pruning_audit (action, records_affected)
|
|
SELECT 'delete_events_7d', COUNT(*) FROM deleted;
|
|
|
|
-- Delete old simple workflows (keep last 7 days)
|
|
WITH deleted AS (
|
|
DELETE FROM telemetry_workflows
|
|
WHERE created_at < NOW() - INTERVAL '7 days'
|
|
AND complexity = 'simple'
|
|
RETURNING *
|
|
)
|
|
INSERT INTO pruning_audit (action, records_affected)
|
|
SELECT 'delete_workflows_simple_7d', COUNT(*) FROM deleted;
|
|
|
|
-- Record size after
|
|
UPDATE pruning_audit
|
|
SET size_after_mb = pg_total_relation_size('telemetry_events')::numeric / 1024 / 1024
|
|
WHERE action = 'before_pruning';
|
|
|
|
COMMIT;
|
|
|
|
-- Cleanup dead space
|
|
VACUUM FULL telemetry_events;
|
|
VACUUM FULL telemetry_workflows;
|
|
|
|
-- Verify results
|
|
SELECT * FROM pruning_audit ORDER BY executed_at DESC LIMIT 5;
|
|
SELECT * FROM check_database_size();
|
|
```
|
|
|
|
### Script 2: Create Retention Policy (Run After Testing)
|
|
|
|
```sql
|
|
-- ============================================
|
|
-- AUTOMATED RETENTION POLICY
|
|
-- Schedule: Daily at 2 AM UTC
|
|
-- ============================================
|
|
|
|
CREATE OR REPLACE FUNCTION apply_retention_policy()
|
|
RETURNS TABLE(
|
|
action TEXT,
|
|
records_deleted INTEGER,
|
|
execution_time_ms INTEGER
|
|
) AS $$
|
|
DECLARE
|
|
start_time TIMESTAMPTZ;
|
|
end_time TIMESTAMPTZ;
|
|
deleted_count INTEGER;
|
|
BEGIN
|
|
-- Tier 4: 7-day retention (high volume, low long-term value)
|
|
start_time := clock_timestamp();
|
|
|
|
DELETE FROM telemetry_events
|
|
WHERE created_at < NOW() - INTERVAL '7 days'
|
|
AND event IN ('tool_sequence', 'tool_used', 'session_start',
|
|
'workflow_validation_failed', 'search_query');
|
|
GET DIAGNOSTICS deleted_count = ROW_COUNT;
|
|
|
|
end_time := clock_timestamp();
|
|
action := 'delete_tier4_7d';
|
|
records_deleted := deleted_count;
|
|
execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
|
|
RETURN NEXT;
|
|
|
|
-- Tier 3: 14-day retention (medium value)
|
|
start_time := clock_timestamp();
|
|
|
|
DELETE FROM telemetry_events
|
|
WHERE created_at < NOW() - INTERVAL '14 days'
|
|
AND event IN ('validation_details', 'workflow_created');
|
|
GET DIAGNOSTICS deleted_count = ROW_COUNT;
|
|
|
|
end_time := clock_timestamp();
|
|
action := 'delete_tier3_14d';
|
|
records_deleted := deleted_count;
|
|
execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
|
|
RETURN NEXT;
|
|
|
|
-- Tier 1: 30-day retention (errors - keep longer)
|
|
start_time := clock_timestamp();
|
|
|
|
DELETE FROM telemetry_events
|
|
WHERE created_at < NOW() - INTERVAL '30 days'
|
|
AND event = 'error_occurred';
|
|
GET DIAGNOSTICS deleted_count = ROW_COUNT;
|
|
|
|
end_time := clock_timestamp();
|
|
action := 'delete_errors_30d';
|
|
records_deleted := deleted_count;
|
|
execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
|
|
RETURN NEXT;
|
|
|
|
-- Workflow pruning by complexity
|
|
start_time := clock_timestamp();
|
|
|
|
DELETE FROM telemetry_workflows
|
|
WHERE created_at < NOW() - INTERVAL '7 days'
|
|
AND complexity = 'simple';
|
|
GET DIAGNOSTICS deleted_count = ROW_COUNT;
|
|
|
|
end_time := clock_timestamp();
|
|
action := 'delete_workflows_simple_7d';
|
|
records_deleted := deleted_count;
|
|
execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
|
|
RETURN NEXT;
|
|
|
|
start_time := clock_timestamp();
|
|
|
|
DELETE FROM telemetry_workflows
|
|
WHERE created_at < NOW() - INTERVAL '14 days'
|
|
AND complexity = 'medium';
|
|
GET DIAGNOSTICS deleted_count = ROW_COUNT;
|
|
|
|
end_time := clock_timestamp();
|
|
action := 'delete_workflows_medium_14d';
|
|
records_deleted := deleted_count;
|
|
execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
|
|
RETURN NEXT;
|
|
|
|
start_time := clock_timestamp();
|
|
|
|
DELETE FROM telemetry_workflows
|
|
WHERE created_at < NOW() - INTERVAL '30 days'
|
|
AND complexity = 'complex';
|
|
GET DIAGNOSTICS deleted_count = ROW_COUNT;
|
|
|
|
end_time := clock_timestamp();
|
|
action := 'delete_workflows_complex_30d';
|
|
records_deleted := deleted_count;
|
|
execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
|
|
RETURN NEXT;
|
|
|
|
-- Vacuum to reclaim space
|
|
start_time := clock_timestamp();
|
|
VACUUM telemetry_events;
|
|
VACUUM telemetry_workflows;
|
|
end_time := clock_timestamp();
|
|
|
|
action := 'vacuum_tables';
|
|
records_deleted := 0;
|
|
execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
|
|
RETURN NEXT;
|
|
END;
|
|
$$ LANGUAGE plpgsql;
|
|
|
|
-- Test the function (dry run - won't schedule yet)
|
|
SELECT * FROM apply_retention_policy();
|
|
|
|
-- After testing, schedule with pg_cron
|
|
-- Requires pg_cron extension: CREATE EXTENSION IF NOT EXISTS pg_cron;
|
|
-- SELECT cron.schedule('retention-policy', '0 2 * * *', 'SELECT apply_retention_policy()');
|
|
```
|
|
|
|
### Script 3: Create Monitoring Dashboard
|
|
|
|
```sql
|
|
-- ============================================
|
|
-- MONITORING QUERIES
|
|
-- Run these regularly to track database health
|
|
-- ============================================
|
|
|
|
-- Query 1: Current database size and projections
|
|
SELECT
|
|
'Current Size' as metric,
|
|
pg_size_pretty(SUM(pg_total_relation_size(schemaname||'.'||relname))) as value
|
|
FROM pg_stat_user_tables
|
|
WHERE schemaname = 'public'
|
|
UNION ALL
|
|
SELECT
|
|
'Free Tier Limit' as metric,
|
|
'500 MB' as value
|
|
UNION ALL
|
|
SELECT
|
|
'Percent Used' as metric,
|
|
CONCAT(
|
|
ROUND(
|
|
(SUM(pg_total_relation_size(schemaname||'.'||relname))::numeric /
|
|
(500.0 * 1024 * 1024) * 100),
|
|
2
|
|
),
|
|
'%'
|
|
) as value
|
|
FROM pg_stat_user_tables
|
|
WHERE schemaname = 'public';
|
|
|
|
-- Query 2: Data age distribution
|
|
SELECT
|
|
event,
|
|
COUNT(*) as total_records,
|
|
MIN(created_at) as oldest_record,
|
|
MAX(created_at) as newest_record,
|
|
ROUND(EXTRACT(EPOCH FROM (MAX(created_at) - MIN(created_at))) / 86400, 2) as age_days
|
|
FROM telemetry_events
|
|
GROUP BY event
|
|
ORDER BY total_records DESC;
|
|
|
|
-- Query 3: Daily growth tracking (last 7 days)
|
|
SELECT
|
|
DATE(created_at) as date,
|
|
COUNT(*) as daily_events,
|
|
pg_size_pretty(SUM(pg_column_size(properties))::bigint) as daily_data_size,
|
|
COUNT(DISTINCT user_id) as active_users
|
|
FROM telemetry_events
|
|
WHERE created_at >= NOW() - INTERVAL '7 days'
|
|
GROUP BY DATE(created_at)
|
|
ORDER BY date DESC;
|
|
|
|
-- Query 4: Retention policy effectiveness
|
|
SELECT
|
|
DATE(executed_at) as execution_date,
|
|
action,
|
|
records_deleted,
|
|
execution_time_ms
|
|
FROM (
|
|
SELECT * FROM apply_retention_policy()
|
|
) AS policy_run
|
|
ORDER BY execution_date DESC;
|
|
```
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
**Immediate Action Required:** Implement Strategy B (7-day retention) immediately to avoid database overflow within 2 weeks.
|
|
|
|
**Long-Term Strategy:** Transition to Strategy C (Hybrid Tiered Retention) with automated aggregation to balance data preservation with storage constraints.
|
|
|
|
**Expected Outcomes:**
|
|
- Immediate: 50+ MB saved (26% reduction)
|
|
- Ongoing: Database stabilized at 200-220 MB (40-44% of limit)
|
|
- Buffer: 30-40 days before limit with current growth rate
|
|
- Risk: Low with proper testing and monitoring
|
|
|
|
**Success Metrics:**
|
|
1. Database size < 300 MB consistently
|
|
2. 7+ days of detailed event data always available
|
|
3. No impact on product analytics capabilities
|
|
4. Automated retention policy runs daily without errors
|
|
|
|
---
|
|
|
|
**Analysis completed:** 2025-10-10
|
|
**Next review date:** 2025-11-10 (monthly check)
|
|
**Escalation:** If database exceeds 400 MB, consider upgrading to paid tier or implementing more aggressive pruning
|