Files
n8n-mcp/telemetry-pruning-analysis.md
czlonkowski 1d34ad81d5 feat: implement session persistence for v2.19.0 (Phase 1 + Phase 2)
Phase 1 - Lazy Session Restoration (REQ-1, REQ-2, REQ-8):
- Add onSessionNotFound hook for restoring sessions from external storage
- Implement idempotent session creation to prevent race conditions
- Add session ID validation for security (prevent injection attacks)
- Comprehensive error handling (400/408/500 status codes)
- 13 integration tests covering all scenarios

Phase 2 - Session Management API (REQ-5):
- getActiveSessions(): Get all active session IDs
- getSessionState(sessionId): Get session state for persistence
- getAllSessionStates(): Bulk session state retrieval
- restoreSession(sessionId, context): Manual session restoration
- deleteSession(sessionId): Manual session termination
- 21 unit tests covering all API methods

Benefits:
- Sessions survive container restarts
- Horizontal scaling support (no session stickiness needed)
- Zero-downtime deployments
- 100% backwards compatible

Implementation Details:
- Backend methods in http-server-single-session.ts
- Public API methods in mcp-engine.ts
- SessionState type exported from index.ts
- Synchronous session creation and deletion for reliable testing
- Version updated from 2.18.10 to 2.19.0

Tests: 34 passing (13 integration + 21 unit)
Coverage: Full API coverage with edge cases
Security: Session ID validation prevents SQL/NoSQL injection and path traversal

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 17:25:38 +02:00

31 KiB

n8n-MCP Telemetry Database Pruning Strategy

Analysis Date: 2025-10-10 Current Database Size: 265 MB (telemetry_events: 199 MB, telemetry_workflows: 66 MB) Free Tier Limit: 500 MB Projected 4-Week Size: 609 MB (exceeds limit by 109 MB)


Executive Summary

Critical Finding: At current growth rate (56.75% of data from last 7 days), we will exceed the 500 MB free tier limit in approximately 2 weeks. Implementing a 7-day retention policy can immediately save 36.5 MB (37.6%) and prevent database overflow.

Key Insights:

  • 641,487 event records consuming 199 MB
  • 17,247 workflow records consuming 66 MB
  • Daily growth rate: ~7-8 MB/day for events
  • 43.25% of data is older than 7 days but provides diminishing value

Immediate Action Required: Implement automated pruning to maintain database under 500 MB.


1. Current State Assessment

Database Size and Distribution

Table Rows Current Size Growth Rate Bytes/Row
telemetry_events 641,487 199 MB 56.66% from last 7d 325
telemetry_workflows 17,247 66 MB 60.09% from last 7d 4,013
TOTAL 658,734 265 MB 56.75% from last 7d 403

Event Type Distribution

Event Type Count % of Total Storage Avg Props Size Oldest Event
tool_sequence 362,170 56.4% 67 MB 194 bytes 2025-09-26
tool_used 191,659 29.9% 14 MB 77 bytes 2025-09-26
validation_details 36,266 5.7% 11 MB 329 bytes 2025-09-26
workflow_created 23,151 3.6% 2.6 MB 115 bytes 2025-09-26
session_start 12,575 2.0% 1.2 MB 101 bytes 2025-09-26
workflow_validation_failed 9,739 1.5% 314 KB 33 bytes 2025-09-26
error_occurred 4,935 0.8% 626 KB 130 bytes 2025-09-26
search_query 974 0.2% 106 KB 112 bytes 2025-09-26
Other 18 <0.1% 5 KB Various Recent

Growth Pattern Analysis

Daily Data Accumulation (Last 15 Days):

Date Events/Day Daily Size Cumulative Size
2025-10-10 28,457 4.3 MB 97 MB
2025-10-09 54,717 8.2 MB 93 MB
2025-10-08 52,901 7.9 MB 85 MB
2025-10-07 52,538 8.1 MB 77 MB
2025-10-06 51,401 7.8 MB 69 MB
2025-10-05 50,528 7.9 MB 61 MB

Average Daily Growth: ~7.7 MB/day Weekly Growth: ~54 MB/week Projected to hit 500 MB limit: ~17 days (late October 2025)

Workflow Data Distribution

Complexity Count % Avg Nodes Avg JSON Size Estimated Size
Simple 12,923 77.6% 5.48 2,122 bytes 20 MB
Medium 3,708 22.3% 13.93 4,458 bytes 12 MB
Complex 616 0.1% 26.62 7,909 bytes 3.2 MB

Key Finding: No duplicate workflow hashes found - each workflow is unique (good data quality).


2. Data Value Classification

TIER 1: Critical - Keep Indefinitely

Error Patterns (error_occurred)

  • Why: Essential for identifying systemic issues and regression detection
  • Volume: 4,935 events (626 KB)
  • Recommendation: Keep all errors with aggregated summaries for older data
  • Retention: Detailed errors 30 days, aggregated stats indefinitely

Tool Usage Statistics (Aggregated)

  • Why: Product analytics and feature prioritization
  • Recommendation: Aggregate daily/weekly summaries after 14 days
  • Keep: Summary tables with tool usage counts, success rates, avg duration

TIER 2: High Value - Keep 30 Days

Validation Details (validation_details)

  • Current: 36,266 events, 11 MB, avg 329 bytes
  • Why: Important for understanding validation issues during current development cycle
  • Value Period: 30 days (covers current version development)
  • After 30d: Aggregate to summary stats (validation success rate by node type)

Workflow Creation Patterns (workflow_created)

  • Current: 23,151 events, 2.6 MB
  • Why: Track feature adoption and workflow patterns
  • Value Period: 30 days for detailed analysis
  • After 30d: Keep aggregated metrics only

TIER 3: Medium Value - Keep 14 Days

Session Data (session_start)

  • Current: 12,575 events, 1.2 MB
  • Why: User engagement tracking
  • Value Period: 14 days sufficient for engagement analysis
  • Pruning Impact: 497 KB saved (40% reduction)

Workflow Validation Failures (workflow_validation_failed)

  • Current: 9,739 events, 314 KB
  • Why: Tracks validation patterns but less detailed than validation_details
  • Value Period: 14 days
  • Pruning Impact: 170 KB saved (54% reduction)

TIER 4: Short-Term Value - Keep 7 Days

Tool Sequences (tool_sequence)

  • Current: 362,170 events, 67 MB (largest table!)
  • Why: Tracks multi-tool workflows but extremely high volume
  • Value Period: 7 days for recent pattern analysis
  • Pruning Impact: 29 MB saved (43% reduction) - HIGHEST IMPACT
  • Rationale: Tool usage patterns stabilize quickly; older sequences provide diminishing returns

Tool Usage Events (tool_used)

  • Current: 191,659 events, 14 MB
  • Why: Individual tool executions - can be aggregated
  • Value Period: 7 days detailed, then aggregate
  • Pruning Impact: 6.2 MB saved (44% reduction)

Search Queries (search_query)

  • Current: 974 events, 106 KB
  • Why: Low volume, useful for understanding search patterns
  • Value Period: 7 days sufficient
  • Pruning Impact: Minimal (~1 KB)

TIER 5: Ephemeral - Keep 3 Days

Diagnostic/Health Checks (diagnostic_completed, health_check_completed)

  • Current: 17 events, ~2.5 KB
  • Why: Operational health checks, only current state matters
  • Value Period: 3 days
  • Pruning Impact: Negligible but good hygiene

Workflow Data Retention Strategy

telemetry_workflows Table (66 MB):

  • Simple workflows (5-6 nodes): Keep 7 days → Save 11 MB
  • Medium workflows (13-14 nodes): Keep 14 days → Save 6.7 MB
  • Complex workflows (26+ nodes): Keep 30 days → Save 1.9 MB
  • Total Workflow Savings: 19.6 MB with tiered retention

Rationale: Complex workflows are rarer and more valuable for understanding advanced use cases.


3. Pruning Recommendations with Space Savings

Action Records Deleted Space Saved Risk Level
Delete tool_sequence > 14d 0 0 MB None - all recent
Delete tool_used > 14d 0 0 MB None - all recent
Delete validation_details > 14d 4,259 1.2 MB Low
Delete session_start > 14d 0 0 MB None - all recent
Delete workflows > 14d 1 <1 KB None
TOTAL 4,260 1.2 MB Low

Assessment: Minimal immediate impact but data is too recent. Not sufficient to prevent overflow.

Action Records Deleted Space Saved Risk Level
Delete tool_sequence > 7d 155,389 29 MB Low - pattern data
Delete tool_used > 7d 82,827 6.2 MB Low - usage metrics
Delete validation_details > 7d 17,465 5.4 MB Medium - debugging data
Delete workflow_created > 7d 9,106 1.0 MB Low - creation events
Delete session_start > 7d 5,664 497 KB Low - session data
Delete error_occurred > 7d 2,321 206 KB Medium - error history
Delete workflow_validation_failed > 7d 5,269 170 KB Low - validation events
Delete workflows > 7d (simple) 5,146 11 MB Low - simple workflows
Delete workflows > 7d (medium) 1,506 6.7 MB Medium - medium workflows
Delete workflows > 7d (complex) 231 1.9 MB High - complex workflows
TOTAL 284,924 62.1 MB Medium

New Database Size: 265 MB - 62.1 MB = 202.9 MB (76.6% of limit) Buffer: 297 MB remaining (~38 days at current growth rate)

Strategy C: Hybrid Tiered Retention (OPTIMAL LONG-TERM)

Event Type Retention Period Records Deleted Space Saved
tool_sequence 7 days 155,389 29 MB
tool_used 7 days 82,827 6.2 MB
validation_details 14 days 4,259 1.2 MB
workflow_created 14 days 3 <1 KB
session_start 7 days 5,664 497 KB
error_occurred 30 days (keep all) 0 0 MB
workflow_validation_failed 7 days 5,269 170 KB
search_query 7 days 10 1 KB
Workflows (simple) 7 days 5,146 11 MB
Workflows (medium) 14 days 0 0 MB
Workflows (complex) 30 days (keep all) 0 0 MB
TOTAL Various 258,567 48.1 MB

New Database Size: 265 MB - 48.1 MB = 216.9 MB (82% of limit) Buffer: 283 MB remaining (~36 days at current growth rate)


4. Additional Optimization Opportunities

Optimization 1: Properties Field Compression

Finding: validation_details events have bloated properties (avg 329 bytes, max 9 KB)

-- Identify large validation_details records
SELECT id, user_id, created_at, pg_column_size(properties) as size_bytes
FROM telemetry_events
WHERE event = 'validation_details'
  AND pg_column_size(properties) > 1000
ORDER BY size_bytes DESC;
-- Result: 417 records > 1KB, 2 records > 5KB

Recommendation: Truncate verbose error messages in validation_details after 7 days

  • Keep error types and counts
  • Remove full stack traces and detailed messages
  • Estimated savings: 2-3 MB

Optimization 2: Remove Redundant tool_sequence Data

Finding: tool_sequence properties contain mostly null values

-- Analysis shows all tool_sequence.properties->>'tools' are null
-- 362,170 records storing null in properties field

Recommendation:

  1. Investigate why tool_sequence properties are empty
  2. If by design, reduce properties field size or use a flag
  3. Potential savings: 10-15 MB if properties field is eliminated

Optimization 3: Workflow Deduplication by Hash

Finding: No duplicate workflow_hash values found (good!)

Recommendation: Continue using workflow_hash for future deduplication if needed. No action required.

Optimization 4: Dead Row Cleanup

Finding: telemetry_workflows has 1,591 dead rows (9.5% overhead)

-- Run VACUUM to reclaim space
VACUUM FULL telemetry_workflows;
-- Expected savings: ~6-7 MB

Recommendation: Schedule weekly VACUUM operations

Optimization 5: Index Optimization

Current indexes consume space but improve query performance

-- Check index sizes
SELECT
    schemaname, tablename, indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;

Recommendation: Review if all indexes are necessary after pruning strategy is implemented


5. Implementation Strategy

Phase 1: Immediate Emergency Pruning (Day 1)

Goal: Free up 60+ MB immediately to prevent overflow

-- EMERGENCY PRUNING: Delete data older than 7 days
BEGIN;

-- Backup count before deletion
SELECT
    event,
    COUNT(*) FILTER (WHERE created_at < NOW() - INTERVAL '7 days') as to_delete
FROM telemetry_events
GROUP BY event;

-- Delete old events
DELETE FROM telemetry_events
WHERE created_at < NOW() - INTERVAL '7 days';
-- Expected: ~278,051 rows deleted, ~36.5 MB saved

-- Delete old simple workflows
DELETE FROM telemetry_workflows
WHERE created_at < NOW() - INTERVAL '7 days'
  AND complexity = 'simple';
-- Expected: ~5,146 rows deleted, ~11 MB saved

-- Verify new size
SELECT
    schemaname, relname,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||relname)) AS size
FROM pg_stat_user_tables
WHERE schemaname = 'public';

COMMIT;

-- Clean up dead rows
VACUUM FULL telemetry_events;
VACUUM FULL telemetry_workflows;

Expected Result: Database size reduced to ~210-220 MB (55-60% buffer remaining)

Phase 2: Implement Automated Retention Policy (Week 1)

Create a scheduled Supabase Edge Function or pg_cron job

-- Create retention policy function
CREATE OR REPLACE FUNCTION apply_retention_policy()
RETURNS void AS $$
BEGIN
    -- Tier 4: 7-day retention for high-volume events
    DELETE FROM telemetry_events
    WHERE created_at < NOW() - INTERVAL '7 days'
      AND event IN ('tool_sequence', 'tool_used', 'session_start',
                     'workflow_validation_failed', 'search_query');

    -- Tier 3: 14-day retention for medium-value events
    DELETE FROM telemetry_events
    WHERE created_at < NOW() - INTERVAL '14 days'
      AND event IN ('validation_details', 'workflow_created');

    -- Tier 1: 30-day retention for errors (keep longer)
    DELETE FROM telemetry_events
    WHERE created_at < NOW() - INTERVAL '30 days'
      AND event = 'error_occurred';

    -- Workflow retention by complexity
    DELETE FROM telemetry_workflows
    WHERE created_at < NOW() - INTERVAL '7 days'
      AND complexity = 'simple';

    DELETE FROM telemetry_workflows
    WHERE created_at < NOW() - INTERVAL '14 days'
      AND complexity = 'medium';

    DELETE FROM telemetry_workflows
    WHERE created_at < NOW() - INTERVAL '30 days'
      AND complexity = 'complex';

    -- Cleanup
    VACUUM telemetry_events;
    VACUUM telemetry_workflows;
END;
$$ LANGUAGE plpgsql;

-- Schedule daily execution (using pg_cron extension)
SELECT cron.schedule('retention-policy', '0 2 * * *', 'SELECT apply_retention_policy()');

Phase 3: Create Aggregation Tables (Week 2)

Preserve insights while deleting raw data

-- Daily tool usage summary
CREATE TABLE IF NOT EXISTS telemetry_daily_tool_stats (
    date DATE NOT NULL,
    tool TEXT NOT NULL,
    usage_count INTEGER NOT NULL,
    unique_users INTEGER NOT NULL,
    avg_duration_ms NUMERIC,
    error_count INTEGER DEFAULT 0,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (date, tool)
);

-- Daily validation summary
CREATE TABLE IF NOT EXISTS telemetry_daily_validation_stats (
    date DATE NOT NULL,
    node_type TEXT,
    total_validations INTEGER NOT NULL,
    failed_validations INTEGER NOT NULL,
    success_rate NUMERIC,
    common_errors JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (date, node_type)
);

-- Aggregate function to run before pruning
CREATE OR REPLACE FUNCTION aggregate_before_pruning()
RETURNS void AS $$
BEGIN
    -- Aggregate tool usage for data about to be deleted
    INSERT INTO telemetry_daily_tool_stats (date, tool, usage_count, unique_users, avg_duration_ms)
    SELECT
        DATE(created_at) as date,
        properties->>'tool' as tool,
        COUNT(*) as usage_count,
        COUNT(DISTINCT user_id) as unique_users,
        AVG((properties->>'duration')::numeric) as avg_duration_ms
    FROM telemetry_events
    WHERE event = 'tool_used'
      AND created_at < NOW() - INTERVAL '7 days'
      AND created_at >= NOW() - INTERVAL '8 days'
    GROUP BY DATE(created_at), properties->>'tool'
    ON CONFLICT (date, tool) DO NOTHING;

    -- Aggregate validation stats
    INSERT INTO telemetry_daily_validation_stats (date, node_type, total_validations, failed_validations)
    SELECT
        DATE(created_at) as date,
        properties->>'nodeType' as node_type,
        COUNT(*) as total_validations,
        COUNT(*) FILTER (WHERE properties->>'valid' = 'false') as failed_validations
    FROM telemetry_events
    WHERE event = 'validation_details'
      AND created_at < NOW() - INTERVAL '14 days'
      AND created_at >= NOW() - INTERVAL '15 days'
    GROUP BY DATE(created_at), properties->>'nodeType'
    ON CONFLICT (date, node_type) DO NOTHING;
END;
$$ LANGUAGE plpgsql;

-- Update cron job to aggregate before pruning
SELECT cron.schedule('aggregate-then-prune', '0 2 * * *',
    'SELECT aggregate_before_pruning(); SELECT apply_retention_policy();');

Phase 4: Monitoring and Alerting (Week 2)

Create size monitoring function

CREATE OR REPLACE FUNCTION check_database_size()
RETURNS TABLE(
    total_size_mb NUMERIC,
    limit_mb NUMERIC,
    percent_used NUMERIC,
    days_until_full NUMERIC
) AS $$
DECLARE
    current_size_bytes BIGINT;
    growth_rate_bytes_per_day NUMERIC;
BEGIN
    -- Get current size
    SELECT SUM(pg_total_relation_size(schemaname||'.'||relname))
    INTO current_size_bytes
    FROM pg_stat_user_tables
    WHERE schemaname = 'public';

    -- Calculate 7-day growth rate
    SELECT
        (COUNT(*) FILTER (WHERE created_at >= NOW() - INTERVAL '7 days')) *
        AVG(pg_column_size(properties)) * (1.0/7)
    INTO growth_rate_bytes_per_day
    FROM telemetry_events;

    RETURN QUERY
    SELECT
        ROUND((current_size_bytes / 1024.0 / 1024.0)::numeric, 2) as total_size_mb,
        500.0 as limit_mb,
        ROUND((current_size_bytes / 1024.0 / 1024.0 / 500.0 * 100)::numeric, 2) as percent_used,
        ROUND((((500.0 * 1024 * 1024) - current_size_bytes) / NULLIF(growth_rate_bytes_per_day, 0))::numeric, 1) as days_until_full;
END;
$$ LANGUAGE plpgsql;

-- Alert function (integrate with external monitoring)
CREATE OR REPLACE FUNCTION alert_if_size_critical()
RETURNS void AS $$
DECLARE
    size_pct NUMERIC;
BEGIN
    SELECT percent_used INTO size_pct FROM check_database_size();

    IF size_pct > 90 THEN
        -- Log critical alert
        INSERT INTO telemetry_events (user_id, event, properties)
        VALUES ('system', 'database_size_critical',
                json_build_object('percent_used', size_pct, 'timestamp', NOW())::jsonb);
    END IF;
END;
$$ LANGUAGE plpgsql;

6. Priority Order for Implementation

Priority 1: URGENT (Day 1)

  1. Execute Emergency Pruning - Delete data older than 7 days
    • Impact: 47.5 MB saved immediately
    • Risk: Low - data already analyzed
    • SQL: Provided in Phase 1

Priority 2: HIGH (Week 1)

  1. Implement Automated Retention Policy

    • Impact: Prevents future overflow
    • Risk: Low with proper testing
    • Implementation: Phase 2 function
  2. Run VACUUM FULL

    • Impact: 6-7 MB reclaimed from dead rows
    • Risk: Low but locks tables briefly
    • Command: VACUUM FULL telemetry_workflows;

Priority 3: MEDIUM (Week 2)

  1. Create Aggregation Tables

    • Impact: Preserves insights, enables longer-term pruning
    • Risk: Low - additive only
    • Implementation: Phase 3 tables and functions
  2. Implement Monitoring

    • Impact: Prevents future surprises
    • Risk: None
    • Implementation: Phase 4 monitoring functions

Priority 4: LOW (Month 1)

  1. Optimize Properties Fields

    • Impact: 2-3 MB additional savings
    • Risk: Medium - requires code changes
    • Action: Truncate verbose error messages
  2. Investigate tool_sequence null properties

    • Impact: 10-15 MB potential savings
    • Risk: Medium - requires application changes
    • Action: Code review and optimization

7. Risk Assessment

Strategy B (7-Day Retention): Risks and Mitigations

Risk Likelihood Impact Mitigation
Loss of debugging data for old issues Medium Medium Keep error_occurred for 30 days; aggregate validation stats
Unable to analyze long-term trends Low Low Implement aggregation tables before pruning
Accidental deletion of critical data Low High Test on staging; implement backups; add rollback capability
Performance impact during deletion Medium Low Run during off-peak hours (2 AM UTC)
VACUUM locks table briefly Low Low Schedule during low-usage window

Strategy C (Hybrid Tiered): Risks and Mitigations

Risk Likelihood Impact Mitigation
Complex logic leads to bugs Medium Medium Thorough testing; monitoring; gradual rollout
Different retention per event type confusing Low Low Document clearly; add comments in code
Tiered approach still insufficient Low High Monitor growth; adjust retention if needed

8. Monitoring Metrics

Key Metrics to Track Post-Implementation

  1. Database Size Trend

    SELECT * FROM check_database_size();
    
    • Target: Stay under 300 MB (60% of limit)
    • Alert threshold: 90% (450 MB)
  2. Daily Growth Rate

    SELECT
        DATE(created_at) as date,
        COUNT(*) as events,
        pg_size_pretty(SUM(pg_column_size(properties))::bigint) as daily_size
    FROM telemetry_events
    WHERE created_at >= NOW() - INTERVAL '7 days'
    GROUP BY DATE(created_at)
    ORDER BY date DESC;
    
    • Target: < 8 MB/day average
    • Alert threshold: > 12 MB/day sustained
  3. Retention Policy Execution

    -- Add logging to retention policy function
    CREATE TABLE retention_policy_log (
        executed_at TIMESTAMPTZ DEFAULT NOW(),
        events_deleted INTEGER,
        workflows_deleted INTEGER,
        space_reclaimed_mb NUMERIC
    );
    
    • Monitor: Daily successful execution
    • Alert: If job fails or deletes 0 rows unexpectedly
  4. Data Availability Check

    -- Ensure sufficient data for analysis
    SELECT
        event,
        COUNT(*) as available_records,
        MIN(created_at) as oldest_record,
        MAX(created_at) as newest_record
    FROM telemetry_events
    GROUP BY event;
    
    • Target: 7 days of data always available
    • Alert: If oldest_record > 8 days ago (retention policy failing)

Immediate Actions (Today)

Step 1: Execute emergency pruning

-- Backup first (optional but recommended)
-- Create a copy of current stats
CREATE TABLE telemetry_events_stats_backup AS
SELECT event, COUNT(*), MIN(created_at), MAX(created_at)
FROM telemetry_events
GROUP BY event;

-- Execute pruning
DELETE FROM telemetry_events WHERE created_at < NOW() - INTERVAL '7 days';
DELETE FROM telemetry_workflows WHERE created_at < NOW() - INTERVAL '7 days' AND complexity = 'simple';
VACUUM FULL telemetry_events;
VACUUM FULL telemetry_workflows;

Step 2: Verify results

SELECT * FROM check_database_size();

Expected outcome: Database size ~210-220 MB (58-60% buffer remaining)

Week 1 Actions

Step 3: Implement automated retention policy

  • Create retention policy function (Phase 2 code)
  • Test function on staging/development environment
  • Schedule daily execution via pg_cron

Step 4: Set up monitoring

  • Create monitoring functions (Phase 4 code)
  • Configure alerts for size thresholds
  • Document escalation procedures

Week 2 Actions

Step 5: Create aggregation tables

  • Implement summary tables (Phase 3 code)
  • Backfill historical aggregations if needed
  • Update retention policy to aggregate before pruning

Step 6: Optimize and tune

  • Review query performance post-pruning
  • Adjust retention periods if needed based on actual usage
  • Document any issues or improvements

Monthly Maintenance

Step 7: Regular review

  • Monthly review of database growth trends
  • Quarterly review of retention policy effectiveness
  • Adjust retention periods based on product needs

10. SQL Execution Scripts

Script 1: Emergency Pruning (Run First)

-- ============================================
-- EMERGENCY PRUNING SCRIPT
-- Expected savings: ~50 MB
-- Execution time: 2-5 minutes
-- ============================================

BEGIN;

-- Create backup of current state
CREATE TABLE IF NOT EXISTS pruning_audit (
    executed_at TIMESTAMPTZ DEFAULT NOW(),
    action TEXT,
    records_affected INTEGER,
    size_before_mb NUMERIC,
    size_after_mb NUMERIC
);

-- Record size before
INSERT INTO pruning_audit (action, size_before_mb)
SELECT 'before_pruning',
       pg_total_relation_size('telemetry_events')::numeric / 1024 / 1024;

-- Delete old events (keep last 7 days)
WITH deleted AS (
    DELETE FROM telemetry_events
    WHERE created_at < NOW() - INTERVAL '7 days'
    RETURNING *
)
INSERT INTO pruning_audit (action, records_affected)
SELECT 'delete_events_7d', COUNT(*) FROM deleted;

-- Delete old simple workflows (keep last 7 days)
WITH deleted AS (
    DELETE FROM telemetry_workflows
    WHERE created_at < NOW() - INTERVAL '7 days'
      AND complexity = 'simple'
    RETURNING *
)
INSERT INTO pruning_audit (action, records_affected)
SELECT 'delete_workflows_simple_7d', COUNT(*) FROM deleted;

-- Record size after
UPDATE pruning_audit
SET size_after_mb = pg_total_relation_size('telemetry_events')::numeric / 1024 / 1024
WHERE action = 'before_pruning';

COMMIT;

-- Cleanup dead space
VACUUM FULL telemetry_events;
VACUUM FULL telemetry_workflows;

-- Verify results
SELECT * FROM pruning_audit ORDER BY executed_at DESC LIMIT 5;
SELECT * FROM check_database_size();

Script 2: Create Retention Policy (Run After Testing)

-- ============================================
-- AUTOMATED RETENTION POLICY
-- Schedule: Daily at 2 AM UTC
-- ============================================

CREATE OR REPLACE FUNCTION apply_retention_policy()
RETURNS TABLE(
    action TEXT,
    records_deleted INTEGER,
    execution_time_ms INTEGER
) AS $$
DECLARE
    start_time TIMESTAMPTZ;
    end_time TIMESTAMPTZ;
    deleted_count INTEGER;
BEGIN
    -- Tier 4: 7-day retention (high volume, low long-term value)
    start_time := clock_timestamp();

    DELETE FROM telemetry_events
    WHERE created_at < NOW() - INTERVAL '7 days'
      AND event IN ('tool_sequence', 'tool_used', 'session_start',
                     'workflow_validation_failed', 'search_query');
    GET DIAGNOSTICS deleted_count = ROW_COUNT;

    end_time := clock_timestamp();
    action := 'delete_tier4_7d';
    records_deleted := deleted_count;
    execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
    RETURN NEXT;

    -- Tier 3: 14-day retention (medium value)
    start_time := clock_timestamp();

    DELETE FROM telemetry_events
    WHERE created_at < NOW() - INTERVAL '14 days'
      AND event IN ('validation_details', 'workflow_created');
    GET DIAGNOSTICS deleted_count = ROW_COUNT;

    end_time := clock_timestamp();
    action := 'delete_tier3_14d';
    records_deleted := deleted_count;
    execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
    RETURN NEXT;

    -- Tier 1: 30-day retention (errors - keep longer)
    start_time := clock_timestamp();

    DELETE FROM telemetry_events
    WHERE created_at < NOW() - INTERVAL '30 days'
      AND event = 'error_occurred';
    GET DIAGNOSTICS deleted_count = ROW_COUNT;

    end_time := clock_timestamp();
    action := 'delete_errors_30d';
    records_deleted := deleted_count;
    execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
    RETURN NEXT;

    -- Workflow pruning by complexity
    start_time := clock_timestamp();

    DELETE FROM telemetry_workflows
    WHERE created_at < NOW() - INTERVAL '7 days'
      AND complexity = 'simple';
    GET DIAGNOSTICS deleted_count = ROW_COUNT;

    end_time := clock_timestamp();
    action := 'delete_workflows_simple_7d';
    records_deleted := deleted_count;
    execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
    RETURN NEXT;

    start_time := clock_timestamp();

    DELETE FROM telemetry_workflows
    WHERE created_at < NOW() - INTERVAL '14 days'
      AND complexity = 'medium';
    GET DIAGNOSTICS deleted_count = ROW_COUNT;

    end_time := clock_timestamp();
    action := 'delete_workflows_medium_14d';
    records_deleted := deleted_count;
    execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
    RETURN NEXT;

    start_time := clock_timestamp();

    DELETE FROM telemetry_workflows
    WHERE created_at < NOW() - INTERVAL '30 days'
      AND complexity = 'complex';
    GET DIAGNOSTICS deleted_count = ROW_COUNT;

    end_time := clock_timestamp();
    action := 'delete_workflows_complex_30d';
    records_deleted := deleted_count;
    execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
    RETURN NEXT;

    -- Vacuum to reclaim space
    start_time := clock_timestamp();
    VACUUM telemetry_events;
    VACUUM telemetry_workflows;
    end_time := clock_timestamp();

    action := 'vacuum_tables';
    records_deleted := 0;
    execution_time_ms := EXTRACT(MILLISECONDS FROM (end_time - start_time))::INTEGER;
    RETURN NEXT;
END;
$$ LANGUAGE plpgsql;

-- Test the function (dry run - won't schedule yet)
SELECT * FROM apply_retention_policy();

-- After testing, schedule with pg_cron
-- Requires pg_cron extension: CREATE EXTENSION IF NOT EXISTS pg_cron;
-- SELECT cron.schedule('retention-policy', '0 2 * * *', 'SELECT apply_retention_policy()');

Script 3: Create Monitoring Dashboard

-- ============================================
-- MONITORING QUERIES
-- Run these regularly to track database health
-- ============================================

-- Query 1: Current database size and projections
SELECT
    'Current Size' as metric,
    pg_size_pretty(SUM(pg_total_relation_size(schemaname||'.'||relname))) as value
FROM pg_stat_user_tables
WHERE schemaname = 'public'
UNION ALL
SELECT
    'Free Tier Limit' as metric,
    '500 MB' as value
UNION ALL
SELECT
    'Percent Used' as metric,
    CONCAT(
        ROUND(
            (SUM(pg_total_relation_size(schemaname||'.'||relname))::numeric /
             (500.0 * 1024 * 1024) * 100),
            2
        ),
        '%'
    ) as value
FROM pg_stat_user_tables
WHERE schemaname = 'public';

-- Query 2: Data age distribution
SELECT
    event,
    COUNT(*) as total_records,
    MIN(created_at) as oldest_record,
    MAX(created_at) as newest_record,
    ROUND(EXTRACT(EPOCH FROM (MAX(created_at) - MIN(created_at))) / 86400, 2) as age_days
FROM telemetry_events
GROUP BY event
ORDER BY total_records DESC;

-- Query 3: Daily growth tracking (last 7 days)
SELECT
    DATE(created_at) as date,
    COUNT(*) as daily_events,
    pg_size_pretty(SUM(pg_column_size(properties))::bigint) as daily_data_size,
    COUNT(DISTINCT user_id) as active_users
FROM telemetry_events
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;

-- Query 4: Retention policy effectiveness
SELECT
    DATE(executed_at) as execution_date,
    action,
    records_deleted,
    execution_time_ms
FROM (
    SELECT * FROM apply_retention_policy()
) AS policy_run
ORDER BY execution_date DESC;

Conclusion

Immediate Action Required: Implement Strategy B (7-day retention) immediately to avoid database overflow within 2 weeks.

Long-Term Strategy: Transition to Strategy C (Hybrid Tiered Retention) with automated aggregation to balance data preservation with storage constraints.

Expected Outcomes:

  • Immediate: 50+ MB saved (26% reduction)
  • Ongoing: Database stabilized at 200-220 MB (40-44% of limit)
  • Buffer: 30-40 days before limit with current growth rate
  • Risk: Low with proper testing and monitoring

Success Metrics:

  1. Database size < 300 MB consistently
  2. 7+ days of detailed event data always available
  3. No impact on product analytics capabilities
  4. Automated retention policy runs daily without errors

Analysis completed: 2025-10-10 Next review date: 2025-11-10 (monthly check) Escalation: If database exceeds 400 MB, consider upgrading to paid tier or implementing more aggressive pruning