Files
n8n-mcp/TELEMETRY_PRUNING_GUIDE.md
czlonkowski 1d34ad81d5 feat: implement session persistence for v2.19.0 (Phase 1 + Phase 2)
Phase 1 - Lazy Session Restoration (REQ-1, REQ-2, REQ-8):
- Add onSessionNotFound hook for restoring sessions from external storage
- Implement idempotent session creation to prevent race conditions
- Add session ID validation for security (prevent injection attacks)
- Comprehensive error handling (400/408/500 status codes)
- 13 integration tests covering all scenarios

Phase 2 - Session Management API (REQ-5):
- getActiveSessions(): Get all active session IDs
- getSessionState(sessionId): Get session state for persistence
- getAllSessionStates(): Bulk session state retrieval
- restoreSession(sessionId, context): Manual session restoration
- deleteSession(sessionId): Manual session termination
- 21 unit tests covering all API methods

Benefits:
- Sessions survive container restarts
- Horizontal scaling support (no session stickiness needed)
- Zero-downtime deployments
- 100% backwards compatible

Implementation Details:
- Backend methods in http-server-single-session.ts
- Public API methods in mcp-engine.ts
- SessionState type exported from index.ts
- Synchronous session creation and deletion for reliable testing
- Version updated from 2.18.10 to 2.19.0

Tests: 34 passing (13 integration + 21 unit)
Coverage: Full API coverage with edge cases
Security: Session ID validation prevents SQL/NoSQL injection and path traversal

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-12 17:25:38 +02:00

624 lines
16 KiB
Markdown

# Telemetry Data Pruning & Aggregation Guide
## Overview
This guide provides a complete solution for managing n8n-mcp telemetry data in Supabase to stay within the 500 MB free tier limit while preserving valuable insights for product development.
## Current Situation
- **Database Size**: 265 MB / 500 MB (53% of limit)
- **Growth Rate**: 7.7 MB/day (54 MB/week)
- **Time Until Full**: ~17 days
- **Total Events**: 641,487 events + 17,247 workflows
### Storage Breakdown
| Event Type | Count | Size | % of Total |
|------------|-------|------|------------|
| `tool_sequence` | 362,704 | 96 MB | 72% |
| `tool_used` | 191,938 | 28 MB | 21% |
| `validation_details` | 36,280 | 14 MB | 11% |
| `workflow_created` | 23,213 | 4.5 MB | 3% |
| Others | ~26,000 | ~3 MB | 2% |
## Solution Strategy
**Aggregate → Delete → Retain only recent raw events**
### Expected Results
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Database Size | 265 MB | ~90-120 MB | **55-65% reduction** |
| Growth Rate | 7.7 MB/day | ~2-3 MB/day | **60-70% slower** |
| Days Until Full | 17 days | **Sustainable** | Never fills |
| Free Tier Usage | 53% | ~20-25% | **75-80% headroom** |
## Implementation Steps
### Step 1: Execute the SQL Migration
Open Supabase SQL Editor and run the entire contents of `supabase-telemetry-aggregation.sql`:
```sql
-- Copy and paste the entire supabase-telemetry-aggregation.sql file
-- Or run it directly from the file
```
This will create:
- 5 aggregation tables
- Aggregation functions
- Automated cleanup function
- Monitoring functions
- Scheduled cron job (daily at 2 AM UTC)
### Step 2: Verify Cron Job Setup
Check that the cron job was created successfully:
```sql
-- View scheduled cron jobs
SELECT
jobid,
schedule,
command,
nodename,
nodeport,
database,
username,
active
FROM cron.job
WHERE jobname = 'telemetry-daily-cleanup';
```
Expected output:
- Schedule: `0 2 * * *` (daily at 2 AM UTC)
- Active: `true`
### Step 3: Run Initial Emergency Cleanup
Get immediate space relief by running the emergency cleanup:
```sql
-- This will aggregate and delete data older than 7 days
SELECT * FROM emergency_cleanup();
```
Expected results:
```
action | rows_deleted | space_freed_mb
------------------------------------+--------------+----------------
Deleted non-critical events > 7d | ~284,924 | ~52 MB
Deleted error events > 14d | ~2,400 | ~0.5 MB
Deleted duplicate workflows | ~8,500 | ~11 MB
TOTAL (run VACUUM separately) | 0 | ~63.5 MB
```
### Step 4: Reclaim Disk Space
After deletion, reclaim the actual disk space:
```sql
-- Reclaim space from deleted rows
VACUUM FULL telemetry_events;
VACUUM FULL telemetry_workflows;
-- Update statistics for query optimization
ANALYZE telemetry_events;
ANALYZE telemetry_workflows;
```
**Note**: `VACUUM FULL` may take a few minutes and locks the table. Run during off-peak hours if possible.
### Step 5: Verify Results
Check the new database size:
```sql
SELECT * FROM check_database_size();
```
Expected output:
```
total_size_mb | events_size_mb | workflows_size_mb | aggregates_size_mb | percent_of_limit | days_until_full | status
--------------+----------------+-------------------+--------------------+------------------+-----------------+---------
202.5 | 85.2 | 35.8 | 12.5 | 40.5 | ~95 | HEALTHY
```
## Daily Operations (Automated)
Once set up, the system runs automatically:
1. **Daily at 2 AM UTC**: Cron job runs
2. **Aggregation**: Data older than 3 days is aggregated into summary tables
3. **Deletion**: Raw events are deleted after aggregation
4. **Cleanup**: VACUUM runs to reclaim space
5. **Retention**:
- High-volume events: 3 days
- Error events: 30 days
- Aggregated insights: Forever
## Monitoring Commands
### Check Database Health
```sql
-- View current size and status
SELECT * FROM check_database_size();
```
### View Aggregated Insights
```sql
-- Top tools used daily
SELECT
aggregation_date,
tool_name,
usage_count,
success_count,
error_count,
ROUND(100.0 * success_count / NULLIF(usage_count, 0), 1) as success_rate_pct
FROM telemetry_tool_usage_daily
ORDER BY aggregation_date DESC, usage_count DESC
LIMIT 50;
-- Most common tool sequences
SELECT
aggregation_date,
tool_sequence,
occurrence_count,
ROUND(avg_sequence_duration_ms, 0) as avg_duration_ms,
ROUND(100 * success_rate, 1) as success_rate_pct
FROM telemetry_tool_patterns
ORDER BY occurrence_count DESC
LIMIT 20;
-- Error patterns over time
SELECT
aggregation_date,
error_type,
error_context,
occurrence_count,
affected_users,
sample_error_message
FROM telemetry_error_patterns
ORDER BY aggregation_date DESC, occurrence_count DESC
LIMIT 30;
-- Workflow creation trends
SELECT
aggregation_date,
complexity,
node_count_range,
has_trigger,
has_webhook,
workflow_count,
ROUND(avg_node_count, 1) as avg_nodes
FROM telemetry_workflow_insights
ORDER BY aggregation_date DESC, workflow_count DESC
LIMIT 30;
-- Validation success rates
SELECT
aggregation_date,
validation_type,
profile,
success_count,
failure_count,
ROUND(100.0 * success_count / NULLIF(success_count + failure_count, 0), 1) as success_rate_pct,
common_failure_reasons
FROM telemetry_validation_insights
ORDER BY aggregation_date DESC, (success_count + failure_count) DESC
LIMIT 30;
```
### Check Cron Job Execution History
```sql
-- View recent cron job runs
SELECT
runid,
jobid,
database,
status,
return_message,
start_time,
end_time
FROM cron.job_run_details
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'telemetry-daily-cleanup')
ORDER BY start_time DESC
LIMIT 10;
```
## Manual Operations
### Run Cleanup On-Demand
If you need to run cleanup outside the scheduled time:
```sql
-- Run with default 3-day retention
SELECT * FROM run_telemetry_aggregation_and_cleanup(3);
VACUUM ANALYZE telemetry_events;
-- Or with custom retention (e.g., 5 days)
SELECT * FROM run_telemetry_aggregation_and_cleanup(5);
VACUUM ANALYZE telemetry_events;
```
### Emergency Cleanup (Critical Situations)
If database is approaching limit and you need immediate relief:
```sql
-- Step 1: Run emergency cleanup (7-day retention)
SELECT * FROM emergency_cleanup();
-- Step 2: Reclaim space aggressively
VACUUM FULL telemetry_events;
VACUUM FULL telemetry_workflows;
ANALYZE telemetry_events;
ANALYZE telemetry_workflows;
-- Step 3: Verify results
SELECT * FROM check_database_size();
```
### Adjust Retention Policy
To change the default 3-day retention period:
```sql
-- Update cron job to use 5-day retention instead
SELECT cron.unschedule('telemetry-daily-cleanup');
SELECT cron.schedule(
'telemetry-daily-cleanup',
'0 2 * * *', -- Daily at 2 AM UTC
$$
SELECT run_telemetry_aggregation_and_cleanup(5); -- 5 days instead of 3
VACUUM ANALYZE telemetry_events;
VACUUM ANALYZE telemetry_workflows;
$$
);
```
## Data Retention Policies
### Raw Events Retention
| Event Type | Retention | Reason |
|------------|-----------|--------|
| `tool_sequence` | 3 days | High volume, low long-term value |
| `tool_used` | 3 days | High volume, aggregated daily |
| `validation_details` | 3 days | Aggregated into insights |
| `workflow_created` | 3 days | Aggregated into patterns |
| `session_start` | 3 days | Operational data only |
| `search_query` | 3 days | Operational data only |
| `error_occurred` | **30 days** | Extended for debugging |
| `workflow_validation_failed` | 3 days | Captured in aggregates |
### Aggregated Data Retention
All aggregated data is kept **indefinitely**:
- Daily tool usage statistics
- Tool sequence patterns
- Workflow creation trends
- Error patterns and frequencies
- Validation success rates
### Workflow Retention
- **Unique workflows**: Kept indefinitely (one per unique hash)
- **Duplicate workflows**: Deleted after 3 days
- **Workflow metadata**: Aggregated into daily insights
## Intelligence Preserved
Even after aggressive pruning, you still have access to:
### Long-term Product Insights
- Which tools are most/least used over time
- Tool usage trends and adoption curves
- Common workflow patterns and complexities
- Error frequencies and types across versions
- Validation failure patterns
### Development Intelligence
- Feature adoption rates (by day/week/month)
- Pain points (high error rates, validation failures)
- User behavior patterns (tool sequences, workflow styles)
- Version comparison (changes in usage between releases)
### Recent Debugging Data
- Last 3 days of raw events for immediate issues
- Last 30 days of error events for bug tracking
- Sample error messages for each error type
## Troubleshooting
### Cron Job Not Running
Check if pg_cron extension is enabled:
```sql
-- Enable pg_cron
CREATE EXTENSION IF NOT EXISTS pg_cron;
-- Verify it's enabled
SELECT * FROM pg_extension WHERE extname = 'pg_cron';
```
### Aggregation Functions Failing
Check for errors in cron job execution:
```sql
-- View error messages
SELECT
status,
return_message,
start_time
FROM cron.job_run_details
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'telemetry-daily-cleanup')
AND status = 'failed'
ORDER BY start_time DESC;
```
### VACUUM Not Reclaiming Space
If `VACUUM ANALYZE` isn't reclaiming enough space, use `VACUUM FULL`:
```sql
-- More aggressive space reclamation (locks table)
VACUUM FULL telemetry_events;
```
### Database Still Growing Too Fast
Reduce retention period further:
```sql
-- Change to 2-day retention (more aggressive)
SELECT * FROM run_telemetry_aggregation_and_cleanup(2);
```
Or delete more event types:
```sql
-- Delete additional low-value events
DELETE FROM telemetry_events
WHERE created_at < NOW() - INTERVAL '3 days'
AND event IN ('session_start', 'search_query', 'diagnostic_completed', 'health_check_completed');
```
## Performance Considerations
### Cron Job Execution Time
The daily cleanup typically takes:
- **Aggregation**: 30-60 seconds
- **Deletion**: 15-30 seconds
- **VACUUM**: 2-5 minutes
- **Total**: ~3-7 minutes
### Query Performance
All aggregation tables have indexes on:
- Date columns (for time-series queries)
- Lookup columns (tool_name, error_type, etc.)
- User columns (for user-specific analysis)
### Lock Considerations
- `VACUUM ANALYZE`: Minimal locking, safe during operation
- `VACUUM FULL`: Locks table, run during off-peak hours
- Aggregation functions: Read-only queries, no locking
## Customization
### Add Custom Aggregations
To track additional metrics, create new aggregation tables:
```sql
-- Example: Session duration aggregation
CREATE TABLE telemetry_session_duration_daily (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregation_date DATE NOT NULL,
avg_duration_seconds NUMERIC,
median_duration_seconds NUMERIC,
max_duration_seconds NUMERIC,
session_count INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(aggregation_date)
);
-- Add to cleanup function
-- (modify run_telemetry_aggregation_and_cleanup)
```
### Modify Retention Policies
Edit the `run_telemetry_aggregation_and_cleanup` function to adjust retention by event type:
```sql
-- Keep validation_details for 7 days instead of 3
DELETE FROM telemetry_events
WHERE created_at < (NOW() - INTERVAL '7 days')
AND event = 'validation_details';
```
### Change Cron Schedule
Adjust the execution time if needed:
```sql
-- Run at different time (e.g., 3 AM UTC)
SELECT cron.schedule(
'telemetry-daily-cleanup',
'0 3 * * *', -- 3 AM instead of 2 AM
$$ SELECT run_telemetry_aggregation_and_cleanup(3); VACUUM ANALYZE telemetry_events; $$
);
-- Run twice daily (2 AM and 2 PM)
SELECT cron.schedule(
'telemetry-cleanup-morning',
'0 2 * * *',
$$ SELECT run_telemetry_aggregation_and_cleanup(3); $$
);
SELECT cron.schedule(
'telemetry-cleanup-afternoon',
'0 14 * * *',
$$ SELECT run_telemetry_aggregation_and_cleanup(3); $$
);
```
## Backup & Recovery
### Before Running Emergency Cleanup
Create a backup of aggregation queries:
```sql
-- Export aggregated data to CSV or backup tables
CREATE TABLE telemetry_tool_usage_backup AS
SELECT * FROM telemetry_tool_usage_daily;
CREATE TABLE telemetry_patterns_backup AS
SELECT * FROM telemetry_tool_patterns;
```
### Restore Deleted Data
Raw event data cannot be restored after deletion. However, aggregated insights are preserved indefinitely.
To prevent accidental data loss:
1. Test cleanup functions on staging first
2. Review `check_database_size()` before running emergency cleanup
3. Start with longer retention periods (7 days) and reduce gradually
4. Monitor aggregated data quality for 1-2 weeks
## Monitoring Dashboard Queries
### Weekly Growth Report
```sql
-- Database growth over last 7 days
SELECT
DATE(created_at) as date,
COUNT(*) as events_created,
COUNT(DISTINCT event) as event_types,
COUNT(DISTINCT user_id) as active_users,
ROUND(SUM(pg_column_size(telemetry_events.*))::NUMERIC / 1024 / 1024, 2) as size_mb
FROM telemetry_events
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;
```
### Storage Efficiency Report
```sql
-- Compare raw vs aggregated storage
SELECT
'Raw Events (last 3 days)' as category,
COUNT(*) as row_count,
pg_size_pretty(pg_total_relation_size('telemetry_events')) as table_size
FROM telemetry_events
WHERE created_at >= NOW() - INTERVAL '3 days'
UNION ALL
SELECT
'Aggregated Insights (all time)',
(SELECT COUNT(*) FROM telemetry_tool_usage_daily) +
(SELECT COUNT(*) FROM telemetry_tool_patterns) +
(SELECT COUNT(*) FROM telemetry_workflow_insights) +
(SELECT COUNT(*) FROM telemetry_error_patterns) +
(SELECT COUNT(*) FROM telemetry_validation_insights),
pg_size_pretty(
pg_total_relation_size('telemetry_tool_usage_daily') +
pg_total_relation_size('telemetry_tool_patterns') +
pg_total_relation_size('telemetry_workflow_insights') +
pg_total_relation_size('telemetry_error_patterns') +
pg_total_relation_size('telemetry_validation_insights')
);
```
### Top Events by Size
```sql
-- Which event types consume most space
SELECT
event,
COUNT(*) as event_count,
pg_size_pretty(SUM(pg_column_size(telemetry_events.*))::BIGINT) as total_size,
pg_size_pretty(AVG(pg_column_size(telemetry_events.*))::BIGINT) as avg_size_per_event,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct_of_events
FROM telemetry_events
GROUP BY event
ORDER BY SUM(pg_column_size(telemetry_events.*)) DESC;
```
## Success Metrics
Track these metrics weekly to ensure the system is working:
### Target Metrics (After Implementation)
- ✅ Database size: **< 150 MB** (< 30% of limit)
- Growth rate: **< 3 MB/day** (sustainable)
- Raw event retention: **3 days** (configurable)
- Aggregated data: **All-time insights available**
- Cron job success rate: **> 95%**
- ✅ Query performance: **< 500ms for aggregated queries**
### Review Schedule
- **Daily**: Check `check_database_size()` status
- **Weekly**: Review aggregated insights and growth trends
- **Monthly**: Analyze cron job success rate and adjust retention if needed
- **After each release**: Compare usage patterns to previous version
## Quick Reference
### Essential Commands
```sql
-- Check database health
SELECT * FROM check_database_size();
-- View recent aggregated insights
SELECT * FROM telemetry_tool_usage_daily ORDER BY aggregation_date DESC LIMIT 10;
-- Run manual cleanup (3-day retention)
SELECT * FROM run_telemetry_aggregation_and_cleanup(3);
VACUUM ANALYZE telemetry_events;
-- Emergency cleanup (7-day retention)
SELECT * FROM emergency_cleanup();
VACUUM FULL telemetry_events;
-- View cron job status
SELECT * FROM cron.job WHERE jobname = 'telemetry-daily-cleanup';
-- View cron execution history
SELECT * FROM cron.job_run_details
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'telemetry-daily-cleanup')
ORDER BY start_time DESC LIMIT 5;
```
## Support
If you encounter issues:
1. Check the troubleshooting section above
2. Review cron job execution logs
3. Verify pg_cron extension is enabled
4. Test aggregation functions manually
5. Check Supabase dashboard for errors
For questions or improvements, refer to the main project documentation.