mirror of https://github.com/czlonkowski/n8n-mcp.git synced 2026-01-30 06:22:04 +00:00

Files

czlonkowski 1d34ad81d5 feat: implement session persistence for v2.19.0 (Phase 1 + Phase 2)

Phase 1 - Lazy Session Restoration (REQ-1, REQ-2, REQ-8):
- Add onSessionNotFound hook for restoring sessions from external storage
- Implement idempotent session creation to prevent race conditions
- Add session ID validation for security (prevent injection attacks)
- Comprehensive error handling (400/408/500 status codes)
- 13 integration tests covering all scenarios

Phase 2 - Session Management API (REQ-5):
- getActiveSessions(): Get all active session IDs
- getSessionState(sessionId): Get session state for persistence
- getAllSessionStates(): Bulk session state retrieval
- restoreSession(sessionId, context): Manual session restoration
- deleteSession(sessionId): Manual session termination
- 21 unit tests covering all API methods

Benefits:
- Sessions survive container restarts
- Horizontal scaling support (no session stickiness needed)
- Zero-downtime deployments
- 100% backwards compatible

Implementation Details:
- Backend methods in http-server-single-session.ts
- Public API methods in mcp-engine.ts
- SessionState type exported from index.ts
- Synchronous session creation and deletion for reliable testing
- Version updated from 2.18.10 to 2.19.0

Tests: 34 passing (13 integration + 21 unit)
Coverage: Full API coverage with edge cases
Security: Session ID validation prevents SQL/NoSQL injection and path traversal

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-12 17:25:38 +02:00

16 KiB

Raw Blame History

Telemetry Data Pruning & Aggregation Guide

Overview

This guide provides a complete solution for managing n8n-mcp telemetry data in Supabase to stay within the 500 MB free tier limit while preserving valuable insights for product development.

Current Situation

Database Size: 265 MB / 500 MB (53% of limit)
Growth Rate: 7.7 MB/day (54 MB/week)
Time Until Full: ~17 days
Total Events: 641,487 events + 17,247 workflows

Storage Breakdown

Event Type	Count	Size	% of Total
`tool_sequence`	362,704	96 MB	72%
`tool_used`	191,938	28 MB	21%
`validation_details`	36,280	14 MB	11%
`workflow_created`	23,213	4.5 MB	3%
Others	~26,000	~3 MB	2%

Solution Strategy

Aggregate → Delete → Retain only recent raw events

Expected Results

Metric	Before	After	Improvement
Database Size	265 MB	~90-120 MB	55-65% reduction
Growth Rate	7.7 MB/day	~2-3 MB/day	60-70% slower
Days Until Full	17 days	Sustainable	Never fills
Free Tier Usage	53%	~20-25%	75-80% headroom

Implementation Steps

Step 1: Execute the SQL Migration

Open Supabase SQL Editor and run the entire contents of supabase-telemetry-aggregation.sql:

-- Copy and paste the entire supabase-telemetry-aggregation.sql file
-- Or run it directly from the file

This will create:

5 aggregation tables
Aggregation functions
Automated cleanup function
Monitoring functions
Scheduled cron job (daily at 2 AM UTC)

Step 2: Verify Cron Job Setup

Check that the cron job was created successfully:

-- View scheduled cron jobs
SELECT
    jobid,
    schedule,
    command,
    nodename,
    nodeport,
    database,
    username,
    active
FROM cron.job
WHERE jobname = 'telemetry-daily-cleanup';

Expected output:

Schedule: 0 2 * * * (daily at 2 AM UTC)
Active: true

Step 3: Run Initial Emergency Cleanup

Get immediate space relief by running the emergency cleanup:

-- This will aggregate and delete data older than 7 days
SELECT * FROM emergency_cleanup();

Expected results:

action                              | rows_deleted | space_freed_mb
------------------------------------+--------------+----------------
Deleted non-critical events > 7d    | ~284,924     | ~52 MB
Deleted error events > 14d          | ~2,400       | ~0.5 MB
Deleted duplicate workflows         | ~8,500       | ~11 MB
TOTAL (run VACUUM separately)       | 0            | ~63.5 MB

Step 4: Reclaim Disk Space

After deletion, reclaim the actual disk space:

-- Reclaim space from deleted rows
VACUUM FULL telemetry_events;
VACUUM FULL telemetry_workflows;

-- Update statistics for query optimization
ANALYZE telemetry_events;
ANALYZE telemetry_workflows;

Note: VACUUM FULL may take a few minutes and locks the table. Run during off-peak hours if possible.

Step 5: Verify Results

Check the new database size:

SELECT * FROM check_database_size();

Expected output:

total_size_mb | events_size_mb | workflows_size_mb | aggregates_size_mb | percent_of_limit | days_until_full | status
--------------+----------------+-------------------+--------------------+------------------+-----------------+---------
202.5         | 85.2           | 35.8              | 12.5               | 40.5             | ~95             | HEALTHY

Daily Operations (Automated)

Once set up, the system runs automatically:

Daily at 2 AM UTC: Cron job runs
Aggregation: Data older than 3 days is aggregated into summary tables
Deletion: Raw events are deleted after aggregation
Cleanup: VACUUM runs to reclaim space
Retention:
- High-volume events: 3 days
- Error events: 30 days
- Aggregated insights: Forever

Monitoring Commands

Check Database Health

-- View current size and status
SELECT * FROM check_database_size();

View Aggregated Insights

-- Top tools used daily
SELECT
    aggregation_date,
    tool_name,
    usage_count,
    success_count,
    error_count,
    ROUND(100.0 * success_count / NULLIF(usage_count, 0), 1) as success_rate_pct
FROM telemetry_tool_usage_daily
ORDER BY aggregation_date DESC, usage_count DESC
LIMIT 50;

-- Most common tool sequences
SELECT
    aggregation_date,
    tool_sequence,
    occurrence_count,
    ROUND(avg_sequence_duration_ms, 0) as avg_duration_ms,
    ROUND(100 * success_rate, 1) as success_rate_pct
FROM telemetry_tool_patterns
ORDER BY occurrence_count DESC
LIMIT 20;

-- Error patterns over time
SELECT
    aggregation_date,
    error_type,
    error_context,
    occurrence_count,
    affected_users,
    sample_error_message
FROM telemetry_error_patterns
ORDER BY aggregation_date DESC, occurrence_count DESC
LIMIT 30;

-- Workflow creation trends
SELECT
    aggregation_date,
    complexity,
    node_count_range,
    has_trigger,
    has_webhook,
    workflow_count,
    ROUND(avg_node_count, 1) as avg_nodes
FROM telemetry_workflow_insights
ORDER BY aggregation_date DESC, workflow_count DESC
LIMIT 30;

-- Validation success rates
SELECT
    aggregation_date,
    validation_type,
    profile,
    success_count,
    failure_count,
    ROUND(100.0 * success_count / NULLIF(success_count + failure_count, 0), 1) as success_rate_pct,
    common_failure_reasons
FROM telemetry_validation_insights
ORDER BY aggregation_date DESC, (success_count + failure_count) DESC
LIMIT 30;

Check Cron Job Execution History

-- View recent cron job runs
SELECT
    runid,
    jobid,
    database,
    status,
    return_message,
    start_time,
    end_time
FROM cron.job_run_details
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'telemetry-daily-cleanup')
ORDER BY start_time DESC
LIMIT 10;

Manual Operations

Run Cleanup On-Demand

If you need to run cleanup outside the scheduled time:

-- Run with default 3-day retention
SELECT * FROM run_telemetry_aggregation_and_cleanup(3);
VACUUM ANALYZE telemetry_events;

-- Or with custom retention (e.g., 5 days)
SELECT * FROM run_telemetry_aggregation_and_cleanup(5);
VACUUM ANALYZE telemetry_events;

Emergency Cleanup (Critical Situations)

If database is approaching limit and you need immediate relief:

-- Step 1: Run emergency cleanup (7-day retention)
SELECT * FROM emergency_cleanup();

-- Step 2: Reclaim space aggressively
VACUUM FULL telemetry_events;
VACUUM FULL telemetry_workflows;
ANALYZE telemetry_events;
ANALYZE telemetry_workflows;

-- Step 3: Verify results
SELECT * FROM check_database_size();

Adjust Retention Policy

To change the default 3-day retention period:

-- Update cron job to use 5-day retention instead
SELECT cron.unschedule('telemetry-daily-cleanup');

SELECT cron.schedule(
    'telemetry-daily-cleanup',
    '0 2 * * *', -- Daily at 2 AM UTC
    $$
    SELECT run_telemetry_aggregation_and_cleanup(5); -- 5 days instead of 3
    VACUUM ANALYZE telemetry_events;
    VACUUM ANALYZE telemetry_workflows;
    $$
);

Data Retention Policies

Raw Events Retention

Event Type	Retention	Reason
`tool_sequence`	3 days	High volume, low long-term value
`tool_used`	3 days	High volume, aggregated daily
`validation_details`	3 days	Aggregated into insights
`workflow_created`	3 days	Aggregated into patterns
`session_start`	3 days	Operational data only
`search_query`	3 days	Operational data only
`error_occurred`	30 days	Extended for debugging
`workflow_validation_failed`	3 days	Captured in aggregates

Aggregated Data Retention

All aggregated data is kept indefinitely:

Daily tool usage statistics
Tool sequence patterns
Workflow creation trends
Error patterns and frequencies
Validation success rates

Workflow Retention

Unique workflows: Kept indefinitely (one per unique hash)
Duplicate workflows: Deleted after 3 days
Workflow metadata: Aggregated into daily insights

Intelligence Preserved

Even after aggressive pruning, you still have access to:

Long-term Product Insights

Which tools are most/least used over time
Tool usage trends and adoption curves
Common workflow patterns and complexities
Error frequencies and types across versions
Validation failure patterns

Development Intelligence

Feature adoption rates (by day/week/month)
Pain points (high error rates, validation failures)
User behavior patterns (tool sequences, workflow styles)
Version comparison (changes in usage between releases)

Recent Debugging Data

Last 3 days of raw events for immediate issues
Last 30 days of error events for bug tracking
Sample error messages for each error type

Troubleshooting

Cron Job Not Running

Check if pg_cron extension is enabled:

-- Enable pg_cron
CREATE EXTENSION IF NOT EXISTS pg_cron;

-- Verify it's enabled
SELECT * FROM pg_extension WHERE extname = 'pg_cron';

Aggregation Functions Failing

Check for errors in cron job execution:

-- View error messages
SELECT
    status,
    return_message,
    start_time
FROM cron.job_run_details
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'telemetry-daily-cleanup')
    AND status = 'failed'
ORDER BY start_time DESC;

VACUUM Not Reclaiming Space

If VACUUM ANALYZE isn't reclaiming enough space, use VACUUM FULL:

-- More aggressive space reclamation (locks table)
VACUUM FULL telemetry_events;

Database Still Growing Too Fast

Reduce retention period further:

-- Change to 2-day retention (more aggressive)
SELECT * FROM run_telemetry_aggregation_and_cleanup(2);

Or delete more event types:

-- Delete additional low-value events
DELETE FROM telemetry_events
WHERE created_at < NOW() - INTERVAL '3 days'
    AND event IN ('session_start', 'search_query', 'diagnostic_completed', 'health_check_completed');

Performance Considerations

Cron Job Execution Time

The daily cleanup typically takes:

Aggregation: 30-60 seconds
Deletion: 15-30 seconds
VACUUM: 2-5 minutes
Total: ~3-7 minutes

Query Performance

All aggregation tables have indexes on:

Date columns (for time-series queries)
Lookup columns (tool_name, error_type, etc.)
User columns (for user-specific analysis)

Lock Considerations

VACUUM ANALYZE: Minimal locking, safe during operation
VACUUM FULL: Locks table, run during off-peak hours
Aggregation functions: Read-only queries, no locking

Customization

Add Custom Aggregations

To track additional metrics, create new aggregation tables:

-- Example: Session duration aggregation
CREATE TABLE telemetry_session_duration_daily (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregation_date DATE NOT NULL,
    avg_duration_seconds NUMERIC,
    median_duration_seconds NUMERIC,
    max_duration_seconds NUMERIC,
    session_count INTEGER,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(aggregation_date)
);

-- Add to cleanup function
-- (modify run_telemetry_aggregation_and_cleanup)

Modify Retention Policies

Edit the run_telemetry_aggregation_and_cleanup function to adjust retention by event type:

-- Keep validation_details for 7 days instead of 3
DELETE FROM telemetry_events
WHERE created_at < (NOW() - INTERVAL '7 days')
    AND event = 'validation_details';

Change Cron Schedule

Adjust the execution time if needed:

-- Run at different time (e.g., 3 AM UTC)
SELECT cron.schedule(
    'telemetry-daily-cleanup',
    '0 3 * * *', -- 3 AM instead of 2 AM
    $$ SELECT run_telemetry_aggregation_and_cleanup(3); VACUUM ANALYZE telemetry_events; $$
);

-- Run twice daily (2 AM and 2 PM)
SELECT cron.schedule(
    'telemetry-cleanup-morning',
    '0 2 * * *',
    $$ SELECT run_telemetry_aggregation_and_cleanup(3); $$
);

SELECT cron.schedule(
    'telemetry-cleanup-afternoon',
    '0 14 * * *',
    $$ SELECT run_telemetry_aggregation_and_cleanup(3); $$
);

Backup & Recovery

Before Running Emergency Cleanup

Create a backup of aggregation queries:

-- Export aggregated data to CSV or backup tables
CREATE TABLE telemetry_tool_usage_backup AS
SELECT * FROM telemetry_tool_usage_daily;

CREATE TABLE telemetry_patterns_backup AS
SELECT * FROM telemetry_tool_patterns;

Restore Deleted Data

Raw event data cannot be restored after deletion. However, aggregated insights are preserved indefinitely.

To prevent accidental data loss:

Test cleanup functions on staging first
Review check_database_size() before running emergency cleanup
Start with longer retention periods (7 days) and reduce gradually
Monitor aggregated data quality for 1-2 weeks

Monitoring Dashboard Queries

Weekly Growth Report

-- Database growth over last 7 days
SELECT
    DATE(created_at) as date,
    COUNT(*) as events_created,
    COUNT(DISTINCT event) as event_types,
    COUNT(DISTINCT user_id) as active_users,
    ROUND(SUM(pg_column_size(telemetry_events.*))::NUMERIC / 1024 / 1024, 2) as size_mb
FROM telemetry_events
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;

Storage Efficiency Report

-- Compare raw vs aggregated storage
SELECT
    'Raw Events (last 3 days)' as category,
    COUNT(*) as row_count,
    pg_size_pretty(pg_total_relation_size('telemetry_events')) as table_size
FROM telemetry_events
WHERE created_at >= NOW() - INTERVAL '3 days'

UNION ALL

SELECT
    'Aggregated Insights (all time)',
    (SELECT COUNT(*) FROM telemetry_tool_usage_daily) +
    (SELECT COUNT(*) FROM telemetry_tool_patterns) +
    (SELECT COUNT(*) FROM telemetry_workflow_insights) +
    (SELECT COUNT(*) FROM telemetry_error_patterns) +
    (SELECT COUNT(*) FROM telemetry_validation_insights),
    pg_size_pretty(
        pg_total_relation_size('telemetry_tool_usage_daily') +
        pg_total_relation_size('telemetry_tool_patterns') +
        pg_total_relation_size('telemetry_workflow_insights') +
        pg_total_relation_size('telemetry_error_patterns') +
        pg_total_relation_size('telemetry_validation_insights')
    );

Top Events by Size

-- Which event types consume most space
SELECT
    event,
    COUNT(*) as event_count,
    pg_size_pretty(SUM(pg_column_size(telemetry_events.*))::BIGINT) as total_size,
    pg_size_pretty(AVG(pg_column_size(telemetry_events.*))::BIGINT) as avg_size_per_event,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct_of_events
FROM telemetry_events
GROUP BY event
ORDER BY SUM(pg_column_size(telemetry_events.*)) DESC;

Success Metrics

Track these metrics weekly to ensure the system is working:

Target Metrics (After Implementation)

✅ Database size: < 150 MB (< 30% of limit)
✅ Growth rate: < 3 MB/day (sustainable)
✅ Raw event retention: 3 days (configurable)
✅ Aggregated data: All-time insights available
✅ Cron job success rate: > 95%
✅ Query performance: < 500ms for aggregated queries

Review Schedule

Daily: Check check_database_size() status
Weekly: Review aggregated insights and growth trends
Monthly: Analyze cron job success rate and adjust retention if needed
After each release: Compare usage patterns to previous version

Quick Reference

Essential Commands

-- Check database health
SELECT * FROM check_database_size();

-- View recent aggregated insights
SELECT * FROM telemetry_tool_usage_daily ORDER BY aggregation_date DESC LIMIT 10;

-- Run manual cleanup (3-day retention)
SELECT * FROM run_telemetry_aggregation_and_cleanup(3);
VACUUM ANALYZE telemetry_events;

-- Emergency cleanup (7-day retention)
SELECT * FROM emergency_cleanup();
VACUUM FULL telemetry_events;

-- View cron job status
SELECT * FROM cron.job WHERE jobname = 'telemetry-daily-cleanup';

-- View cron execution history
SELECT * FROM cron.job_run_details
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'telemetry-daily-cleanup')
ORDER BY start_time DESC LIMIT 5;

Support

If you encounter issues:

Check the troubleshooting section above
Review cron job execution logs
Verify pg_cron extension is enabled
Test aggregation functions manually
Check Supabase dashboard for errors

For questions or improvements, refer to the main project documentation.

16 KiB Raw Blame History