Phase 1 - Lazy Session Restoration (REQ-1, REQ-2, REQ-8): - Add onSessionNotFound hook for restoring sessions from external storage - Implement idempotent session creation to prevent race conditions - Add session ID validation for security (prevent injection attacks) - Comprehensive error handling (400/408/500 status codes) - 13 integration tests covering all scenarios Phase 2 - Session Management API (REQ-5): - getActiveSessions(): Get all active session IDs - getSessionState(sessionId): Get session state for persistence - getAllSessionStates(): Bulk session state retrieval - restoreSession(sessionId, context): Manual session restoration - deleteSession(sessionId): Manual session termination - 21 unit tests covering all API methods Benefits: - Sessions survive container restarts - Horizontal scaling support (no session stickiness needed) - Zero-downtime deployments - 100% backwards compatible Implementation Details: - Backend methods in http-server-single-session.ts - Public API methods in mcp-engine.ts - SessionState type exported from index.ts - Synchronous session creation and deletion for reliable testing - Version updated from 2.18.10 to 2.19.0 Tests: 34 passing (13 integration + 21 unit) Coverage: Full API coverage with edge cases Security: Session ID validation prevents SQL/NoSQL injection and path traversal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
16 KiB
Telemetry Data Pruning & Aggregation Guide
Overview
This guide provides a complete solution for managing n8n-mcp telemetry data in Supabase to stay within the 500 MB free tier limit while preserving valuable insights for product development.
Current Situation
- Database Size: 265 MB / 500 MB (53% of limit)
- Growth Rate: 7.7 MB/day (54 MB/week)
- Time Until Full: ~17 days
- Total Events: 641,487 events + 17,247 workflows
Storage Breakdown
| Event Type | Count | Size | % of Total |
|---|---|---|---|
tool_sequence |
362,704 | 96 MB | 72% |
tool_used |
191,938 | 28 MB | 21% |
validation_details |
36,280 | 14 MB | 11% |
workflow_created |
23,213 | 4.5 MB | 3% |
| Others | ~26,000 | ~3 MB | 2% |
Solution Strategy
Aggregate → Delete → Retain only recent raw events
Expected Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| Database Size | 265 MB | ~90-120 MB | 55-65% reduction |
| Growth Rate | 7.7 MB/day | ~2-3 MB/day | 60-70% slower |
| Days Until Full | 17 days | Sustainable | Never fills |
| Free Tier Usage | 53% | ~20-25% | 75-80% headroom |
Implementation Steps
Step 1: Execute the SQL Migration
Open Supabase SQL Editor and run the entire contents of supabase-telemetry-aggregation.sql:
-- Copy and paste the entire supabase-telemetry-aggregation.sql file
-- Or run it directly from the file
This will create:
- 5 aggregation tables
- Aggregation functions
- Automated cleanup function
- Monitoring functions
- Scheduled cron job (daily at 2 AM UTC)
Step 2: Verify Cron Job Setup
Check that the cron job was created successfully:
-- View scheduled cron jobs
SELECT
jobid,
schedule,
command,
nodename,
nodeport,
database,
username,
active
FROM cron.job
WHERE jobname = 'telemetry-daily-cleanup';
Expected output:
- Schedule:
0 2 * * *(daily at 2 AM UTC) - Active:
true
Step 3: Run Initial Emergency Cleanup
Get immediate space relief by running the emergency cleanup:
-- This will aggregate and delete data older than 7 days
SELECT * FROM emergency_cleanup();
Expected results:
action | rows_deleted | space_freed_mb
------------------------------------+--------------+----------------
Deleted non-critical events > 7d | ~284,924 | ~52 MB
Deleted error events > 14d | ~2,400 | ~0.5 MB
Deleted duplicate workflows | ~8,500 | ~11 MB
TOTAL (run VACUUM separately) | 0 | ~63.5 MB
Step 4: Reclaim Disk Space
After deletion, reclaim the actual disk space:
-- Reclaim space from deleted rows
VACUUM FULL telemetry_events;
VACUUM FULL telemetry_workflows;
-- Update statistics for query optimization
ANALYZE telemetry_events;
ANALYZE telemetry_workflows;
Note: VACUUM FULL may take a few minutes and locks the table. Run during off-peak hours if possible.
Step 5: Verify Results
Check the new database size:
SELECT * FROM check_database_size();
Expected output:
total_size_mb | events_size_mb | workflows_size_mb | aggregates_size_mb | percent_of_limit | days_until_full | status
--------------+----------------+-------------------+--------------------+------------------+-----------------+---------
202.5 | 85.2 | 35.8 | 12.5 | 40.5 | ~95 | HEALTHY
Daily Operations (Automated)
Once set up, the system runs automatically:
- Daily at 2 AM UTC: Cron job runs
- Aggregation: Data older than 3 days is aggregated into summary tables
- Deletion: Raw events are deleted after aggregation
- Cleanup: VACUUM runs to reclaim space
- Retention:
- High-volume events: 3 days
- Error events: 30 days
- Aggregated insights: Forever
Monitoring Commands
Check Database Health
-- View current size and status
SELECT * FROM check_database_size();
View Aggregated Insights
-- Top tools used daily
SELECT
aggregation_date,
tool_name,
usage_count,
success_count,
error_count,
ROUND(100.0 * success_count / NULLIF(usage_count, 0), 1) as success_rate_pct
FROM telemetry_tool_usage_daily
ORDER BY aggregation_date DESC, usage_count DESC
LIMIT 50;
-- Most common tool sequences
SELECT
aggregation_date,
tool_sequence,
occurrence_count,
ROUND(avg_sequence_duration_ms, 0) as avg_duration_ms,
ROUND(100 * success_rate, 1) as success_rate_pct
FROM telemetry_tool_patterns
ORDER BY occurrence_count DESC
LIMIT 20;
-- Error patterns over time
SELECT
aggregation_date,
error_type,
error_context,
occurrence_count,
affected_users,
sample_error_message
FROM telemetry_error_patterns
ORDER BY aggregation_date DESC, occurrence_count DESC
LIMIT 30;
-- Workflow creation trends
SELECT
aggregation_date,
complexity,
node_count_range,
has_trigger,
has_webhook,
workflow_count,
ROUND(avg_node_count, 1) as avg_nodes
FROM telemetry_workflow_insights
ORDER BY aggregation_date DESC, workflow_count DESC
LIMIT 30;
-- Validation success rates
SELECT
aggregation_date,
validation_type,
profile,
success_count,
failure_count,
ROUND(100.0 * success_count / NULLIF(success_count + failure_count, 0), 1) as success_rate_pct,
common_failure_reasons
FROM telemetry_validation_insights
ORDER BY aggregation_date DESC, (success_count + failure_count) DESC
LIMIT 30;
Check Cron Job Execution History
-- View recent cron job runs
SELECT
runid,
jobid,
database,
status,
return_message,
start_time,
end_time
FROM cron.job_run_details
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'telemetry-daily-cleanup')
ORDER BY start_time DESC
LIMIT 10;
Manual Operations
Run Cleanup On-Demand
If you need to run cleanup outside the scheduled time:
-- Run with default 3-day retention
SELECT * FROM run_telemetry_aggregation_and_cleanup(3);
VACUUM ANALYZE telemetry_events;
-- Or with custom retention (e.g., 5 days)
SELECT * FROM run_telemetry_aggregation_and_cleanup(5);
VACUUM ANALYZE telemetry_events;
Emergency Cleanup (Critical Situations)
If database is approaching limit and you need immediate relief:
-- Step 1: Run emergency cleanup (7-day retention)
SELECT * FROM emergency_cleanup();
-- Step 2: Reclaim space aggressively
VACUUM FULL telemetry_events;
VACUUM FULL telemetry_workflows;
ANALYZE telemetry_events;
ANALYZE telemetry_workflows;
-- Step 3: Verify results
SELECT * FROM check_database_size();
Adjust Retention Policy
To change the default 3-day retention period:
-- Update cron job to use 5-day retention instead
SELECT cron.unschedule('telemetry-daily-cleanup');
SELECT cron.schedule(
'telemetry-daily-cleanup',
'0 2 * * *', -- Daily at 2 AM UTC
$$
SELECT run_telemetry_aggregation_and_cleanup(5); -- 5 days instead of 3
VACUUM ANALYZE telemetry_events;
VACUUM ANALYZE telemetry_workflows;
$$
);
Data Retention Policies
Raw Events Retention
| Event Type | Retention | Reason |
|---|---|---|
tool_sequence |
3 days | High volume, low long-term value |
tool_used |
3 days | High volume, aggregated daily |
validation_details |
3 days | Aggregated into insights |
workflow_created |
3 days | Aggregated into patterns |
session_start |
3 days | Operational data only |
search_query |
3 days | Operational data only |
error_occurred |
30 days | Extended for debugging |
workflow_validation_failed |
3 days | Captured in aggregates |
Aggregated Data Retention
All aggregated data is kept indefinitely:
- Daily tool usage statistics
- Tool sequence patterns
- Workflow creation trends
- Error patterns and frequencies
- Validation success rates
Workflow Retention
- Unique workflows: Kept indefinitely (one per unique hash)
- Duplicate workflows: Deleted after 3 days
- Workflow metadata: Aggregated into daily insights
Intelligence Preserved
Even after aggressive pruning, you still have access to:
Long-term Product Insights
- Which tools are most/least used over time
- Tool usage trends and adoption curves
- Common workflow patterns and complexities
- Error frequencies and types across versions
- Validation failure patterns
Development Intelligence
- Feature adoption rates (by day/week/month)
- Pain points (high error rates, validation failures)
- User behavior patterns (tool sequences, workflow styles)
- Version comparison (changes in usage between releases)
Recent Debugging Data
- Last 3 days of raw events for immediate issues
- Last 30 days of error events for bug tracking
- Sample error messages for each error type
Troubleshooting
Cron Job Not Running
Check if pg_cron extension is enabled:
-- Enable pg_cron
CREATE EXTENSION IF NOT EXISTS pg_cron;
-- Verify it's enabled
SELECT * FROM pg_extension WHERE extname = 'pg_cron';
Aggregation Functions Failing
Check for errors in cron job execution:
-- View error messages
SELECT
status,
return_message,
start_time
FROM cron.job_run_details
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'telemetry-daily-cleanup')
AND status = 'failed'
ORDER BY start_time DESC;
VACUUM Not Reclaiming Space
If VACUUM ANALYZE isn't reclaiming enough space, use VACUUM FULL:
-- More aggressive space reclamation (locks table)
VACUUM FULL telemetry_events;
Database Still Growing Too Fast
Reduce retention period further:
-- Change to 2-day retention (more aggressive)
SELECT * FROM run_telemetry_aggregation_and_cleanup(2);
Or delete more event types:
-- Delete additional low-value events
DELETE FROM telemetry_events
WHERE created_at < NOW() - INTERVAL '3 days'
AND event IN ('session_start', 'search_query', 'diagnostic_completed', 'health_check_completed');
Performance Considerations
Cron Job Execution Time
The daily cleanup typically takes:
- Aggregation: 30-60 seconds
- Deletion: 15-30 seconds
- VACUUM: 2-5 minutes
- Total: ~3-7 minutes
Query Performance
All aggregation tables have indexes on:
- Date columns (for time-series queries)
- Lookup columns (tool_name, error_type, etc.)
- User columns (for user-specific analysis)
Lock Considerations
VACUUM ANALYZE: Minimal locking, safe during operationVACUUM FULL: Locks table, run during off-peak hours- Aggregation functions: Read-only queries, no locking
Customization
Add Custom Aggregations
To track additional metrics, create new aggregation tables:
-- Example: Session duration aggregation
CREATE TABLE telemetry_session_duration_daily (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregation_date DATE NOT NULL,
avg_duration_seconds NUMERIC,
median_duration_seconds NUMERIC,
max_duration_seconds NUMERIC,
session_count INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(aggregation_date)
);
-- Add to cleanup function
-- (modify run_telemetry_aggregation_and_cleanup)
Modify Retention Policies
Edit the run_telemetry_aggregation_and_cleanup function to adjust retention by event type:
-- Keep validation_details for 7 days instead of 3
DELETE FROM telemetry_events
WHERE created_at < (NOW() - INTERVAL '7 days')
AND event = 'validation_details';
Change Cron Schedule
Adjust the execution time if needed:
-- Run at different time (e.g., 3 AM UTC)
SELECT cron.schedule(
'telemetry-daily-cleanup',
'0 3 * * *', -- 3 AM instead of 2 AM
$$ SELECT run_telemetry_aggregation_and_cleanup(3); VACUUM ANALYZE telemetry_events; $$
);
-- Run twice daily (2 AM and 2 PM)
SELECT cron.schedule(
'telemetry-cleanup-morning',
'0 2 * * *',
$$ SELECT run_telemetry_aggregation_and_cleanup(3); $$
);
SELECT cron.schedule(
'telemetry-cleanup-afternoon',
'0 14 * * *',
$$ SELECT run_telemetry_aggregation_and_cleanup(3); $$
);
Backup & Recovery
Before Running Emergency Cleanup
Create a backup of aggregation queries:
-- Export aggregated data to CSV or backup tables
CREATE TABLE telemetry_tool_usage_backup AS
SELECT * FROM telemetry_tool_usage_daily;
CREATE TABLE telemetry_patterns_backup AS
SELECT * FROM telemetry_tool_patterns;
Restore Deleted Data
Raw event data cannot be restored after deletion. However, aggregated insights are preserved indefinitely.
To prevent accidental data loss:
- Test cleanup functions on staging first
- Review
check_database_size()before running emergency cleanup - Start with longer retention periods (7 days) and reduce gradually
- Monitor aggregated data quality for 1-2 weeks
Monitoring Dashboard Queries
Weekly Growth Report
-- Database growth over last 7 days
SELECT
DATE(created_at) as date,
COUNT(*) as events_created,
COUNT(DISTINCT event) as event_types,
COUNT(DISTINCT user_id) as active_users,
ROUND(SUM(pg_column_size(telemetry_events.*))::NUMERIC / 1024 / 1024, 2) as size_mb
FROM telemetry_events
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;
Storage Efficiency Report
-- Compare raw vs aggregated storage
SELECT
'Raw Events (last 3 days)' as category,
COUNT(*) as row_count,
pg_size_pretty(pg_total_relation_size('telemetry_events')) as table_size
FROM telemetry_events
WHERE created_at >= NOW() - INTERVAL '3 days'
UNION ALL
SELECT
'Aggregated Insights (all time)',
(SELECT COUNT(*) FROM telemetry_tool_usage_daily) +
(SELECT COUNT(*) FROM telemetry_tool_patterns) +
(SELECT COUNT(*) FROM telemetry_workflow_insights) +
(SELECT COUNT(*) FROM telemetry_error_patterns) +
(SELECT COUNT(*) FROM telemetry_validation_insights),
pg_size_pretty(
pg_total_relation_size('telemetry_tool_usage_daily') +
pg_total_relation_size('telemetry_tool_patterns') +
pg_total_relation_size('telemetry_workflow_insights') +
pg_total_relation_size('telemetry_error_patterns') +
pg_total_relation_size('telemetry_validation_insights')
);
Top Events by Size
-- Which event types consume most space
SELECT
event,
COUNT(*) as event_count,
pg_size_pretty(SUM(pg_column_size(telemetry_events.*))::BIGINT) as total_size,
pg_size_pretty(AVG(pg_column_size(telemetry_events.*))::BIGINT) as avg_size_per_event,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct_of_events
FROM telemetry_events
GROUP BY event
ORDER BY SUM(pg_column_size(telemetry_events.*)) DESC;
Success Metrics
Track these metrics weekly to ensure the system is working:
Target Metrics (After Implementation)
- ✅ Database size: < 150 MB (< 30% of limit)
- ✅ Growth rate: < 3 MB/day (sustainable)
- ✅ Raw event retention: 3 days (configurable)
- ✅ Aggregated data: All-time insights available
- ✅ Cron job success rate: > 95%
- ✅ Query performance: < 500ms for aggregated queries
Review Schedule
- Daily: Check
check_database_size()status - Weekly: Review aggregated insights and growth trends
- Monthly: Analyze cron job success rate and adjust retention if needed
- After each release: Compare usage patterns to previous version
Quick Reference
Essential Commands
-- Check database health
SELECT * FROM check_database_size();
-- View recent aggregated insights
SELECT * FROM telemetry_tool_usage_daily ORDER BY aggregation_date DESC LIMIT 10;
-- Run manual cleanup (3-day retention)
SELECT * FROM run_telemetry_aggregation_and_cleanup(3);
VACUUM ANALYZE telemetry_events;
-- Emergency cleanup (7-day retention)
SELECT * FROM emergency_cleanup();
VACUUM FULL telemetry_events;
-- View cron job status
SELECT * FROM cron.job WHERE jobname = 'telemetry-daily-cleanup';
-- View cron execution history
SELECT * FROM cron.job_run_details
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'telemetry-daily-cleanup')
ORDER BY start_time DESC LIMIT 5;
Support
If you encounter issues:
- Check the troubleshooting section above
- Review cron job execution logs
- Verify pg_cron extension is enabled
- Test aggregation functions manually
- Check Supabase dashboard for errors
For questions or improvements, refer to the main project documentation.