mirror of https://github.com/czlonkowski/n8n-mcp.git synced 2026-01-29 22:12:05 +00:00

Files

czlonkowski 60ab66d64d feat: telemetry-driven quick wins to reduce AI agent validation errors by 30-40%

Enhanced tools documentation, duplicate ID errors, and AI Agent validator based on telemetry analysis of 593 validation errors across 3 categories:
- 378 errors: Duplicate node IDs (64%)
- 179 errors: AI Agent configuration (30%)
- 36 errors: Other validations (6%)

Quick Win #1: Enhanced tools documentation (src/mcp/tools-documentation.ts)
- Added prominent warnings to call get_node_essentials() FIRST before configuring nodes
- Emphasized 5KB vs 100KB+ size difference between essentials and full info
- Updated workflow patterns to prioritize essentials over get_node_info

Quick Win #2: Improved duplicate ID error messages (src/services/workflow-validator.ts)
- Added crypto import for UUID generation examples
- Enhanced error messages with node indices, names, and types
- Included crypto.randomUUID() example in error messages
- Helps AI agents understand EXACTLY which nodes conflict and how to fix

Quick Win #3: Added AI Agent node-specific validator (src/services/node-specific-validators.ts)
- Validates prompt configuration (promptType + text requirement)
- Checks maxIterations bounds (1-50 recommended)
- Suggests error handling (onError + retryOnFail)
- Warns about high iteration limits (cost/performance impact)
- Integrated into enhanced-config-validator.ts

Test Coverage:
- Added duplicate ID validation tests (workflow-validator.test.ts)
- Added AI Agent validator tests (node-specific-validators.test.ts:2312-2491)
- All new tests passing (3527 total passing)

Version: 2.22.12 → 2.22.13

Expected Impact: 30-40% reduction in AI agent validation errors

Technical Details:
- Telemetry analysis: 593 validation errors (Dec 2024 - Jan 2025)
- 100% error recovery rate maintained (validation working correctly)
- Root cause: Documentation/guidance gaps, not validation logic failures
- Solution: Proactive guidance at decision points

References:
- Telemetry analysis findings
- Issue #392 (helpful error messages pattern)
- Existing Slack validator pattern (node-specific-validators.ts:98-230)

Concieved by Romuald Członkowski - www.aiadvisors.pl/en

2025-11-08 18:07:26 +01:00

10 KiB

Raw Blame History

n8n-MCP Telemetry Analysis - Executive Summary

Quick Reference for Decision Makers

Analysis Date: November 8, 2025 Data Period: August 10 - November 8, 2025 (90 days) Status: Critical Issues Identified - Action Required

Key Statistics at a Glance

Metric	Value	Status
Total Errors (90 days)	8,859	96% are validation-related
Daily Average	60.68	Baseline (60-65 errors/day normal)
Peak Error Day	Oct 30	276 errors (4.5x baseline)
Days with Errors	36/90 (40%)	Intermittent spikes
Most Common Error	ValidationError	34.77% of all errors
Critical Tool Failure	get_node_info	11.72% failure rate
Performance Bottleneck	Sequential updates	55.2 seconds per operation
Active Users/Day	572	Healthy engagement
Total Users (90 days)	~5,000+	Growing user base

The 5 Critical Issues

1. Workflow-Level Validation Failures (39% of errors)

Problem: 21,423 errors from unspecified workflow structure violations

What Users See:

"Validation failed" (no indication of what's wrong)
Cannot deploy workflows
Must guess what structure requirement violated

Impact: Users abandon workflows; AI agents retry blindly

Fix: Provide specific error messages explaining exactly what failed

"Missing start trigger node"
"Type mismatch in node connection"
"Required property missing: URL"

Effort: 2 days | Impact: High | Priority: 1

2. `get_node_info` Unreliability (11.72% failure rate)

Problem: 1,208 failures out of 10,304 calls to retrieve node information

What Users See:

Cannot load node specifications when building workflows
Missing information about node properties
Forced to use incomplete data (fallback to essentials)

Impact: Workflows built with wrong configuration assumptions; validation failures cascade

Fix: Add retry logic, caching, and fallback mechanism

Effort: 1 day | Impact: High | Priority: 1

3. Slow Sequential Updates (55+ seconds per operation)

Problem: 96,003 sequential workflow updates take average 55.2 seconds each

What Users See:

Workflow construction takes minutes instead of seconds
"System appears stuck" (agent waiting 55s between operations)
Poor user experience

Impact: Users abandon complex workflows; slow AI agent response

Fix: Implement batch update operation (apply multiple changes in 1 call)

Effort: 2-3 days | Impact: Critical | Priority: 1

4. Search Inefficiency (17% retry rate)

Problem: 68,056 sequential search calls; users need multiple searches to find nodes

What Users See:

Search for "http" doesn't show "HTTP Request" in top results
Users refine search 2-3 times
Extra API calls and latency

Impact: Slower node discovery; AI agents waste API calls

Fix: Improve search ranking for high-volume queries

Effort: 2 days | Impact: Medium | Priority: 2

Problem: 2,767 TypeError occurrences from configuration mismatches

What Users See:

Node validation fails due to type mismatch
"string vs. number" errors without clear resolution
Configuration seems correct but validation fails

Impact: Users unsure of correct configuration format

Fix: Implement strict type system; add TypeScript types for common nodes

Effort: 3 days | Impact: Medium | Priority: 2

Business Impact Summary

Current State: What's Broken?

Area	Problem	Impact
Reliability	`get_node_info` fails 11.72%	Users blocked 1 in 8 times
Feedback	Generic error messages	Users can't self-fix errors
Performance	55s per sequential update	5-node workflow takes 4+ minutes
Search	17% require refine search	Extra latency; poor UX
Types	31% of errors type-related	Users make wrong assumptions

If No Action Taken

Error volume likely to remain at 60+ per day
User frustration compounds
AI agents become unreliable (cascading failures)
Adoption plateau or decline
Support burden increases

With Phase 1 Fixes (Week 1)

get_node_info reliability: 11.72% → <1% (91% improvement)
Validation errors: 21,423 → <1,000 (95% improvement in clarity)
Sequential updates: 55.2s → <5s (91% improvement)
Overall error reduction: 40-50%
User satisfaction: +60% (estimated)

Full Implementation (4-5 weeks)

Error volume: 8,859 → <2,000 per quarter (77% reduction)
Tool failure rates: <1% across board
Performance: 90% improvement in workflow creation
User retention: +35% (estimated)

Implementation Roadmap

Week 1 (Immediate Wins)

Fix get_node_info reliability [1 day]
Improve validation error messages [2 days]
Add batch update operation [2 days]

Impact: Address 60% of user-facing issues

Week 2-3 (High Priority)

Implement validation caching [1-2 days]
Improve search ranking [2 days]
Add TypeScript types [3 days]

Impact: Performance +70%; Errors -30%

Week 4 (Optimization)

Return updated state in responses [1-2 days]
Add workflow diff generation [1-2 days]

Impact: Eliminate 40% of API calls

Ongoing (Documentation)

Create error code documentation [1 week]
Add configuration examples [2 weeks]

Resource Requirements

Phase	Duration	Team	Impact	Business Value
Phase 1	1 week	1 engineer	60% of issues	High ROI
Phase 2	2 weeks	1 engineer	+30% improvement	Medium ROI
Phase 3	1 week	1 engineer	+10% improvement	Low ROI
Phase 4	3 weeks	0.5 engineer	Support reduction	Medium ROI

Total: 7 weeks, 1 engineer FTE, +35% overall improvement

Risk Assessment

Risk	Likelihood	Impact	Mitigation
Breaking API changes	Low	High	Maintain backward compatibility
Performance regression	Low	High	Load test before deployment
Validation false positives	Medium	Medium	Beta test with sample workflows
Incomplete implementation	Low	Medium	Clear definition of done per task

Overall Risk Level: Low (with proper mitigation)

Success Metrics (Measurable)

By End of Week 1

get_node_info failure rate < 2%
Validation errors provide specific guidance
Batch update operation deployed and tested

By End of Week 3

Overall error rate < 3,000/quarter
Tool success rates > 98% across board
Average workflow creation time < 2 minutes

By End of Week 5

Error volume < 2,000/quarter (77% reduction)
All users can self-resolve 80% of common errors
AI agent success rate improves by 30%

Top Recommendations

Do This First (Week 1)

Fix get_node_info - Affects most critical user action
- Add retry logic [4 hours]
- Implement cache [4 hours]
- Add fallback [4 hours]
Improve Validation Messages - Addresses 39% of errors
- Create error code system [8 hours]
- Enhance validation logic [8 hours]
- Add help documentation [4 hours]
Add Batch Updates - Fixes performance bottleneck
- Define API [4 hours]
- Implement handler [12 hours]
- Test & integrate [4 hours]

Avoid This (Anti-patterns)

❌ Increasing error logging without actionable feedback
❌ Adding more validation without improving error messages
❌ Optimizing non-critical operations while critical issues remain
❌ Waiting for perfect data before implementing fixes

Stakeholder Questions & Answers

Q: Why are there so many validation errors if most tools work (96%+)?

A: Validation happens in a separate system. Core tools are reliable, but validation feedback is poor. Users create invalid workflows, validation rejects them generically, and users can't understand why.

Q: Is the system unstable?

A: No. Infrastructure is stable (99% uptime estimated). The issue is usability: errors are generic and operations are slow.

Q: Should we defer fixes until next quarter?

A: No. Every day of 60+ daily errors compounds user frustration. Early fixes have highest ROI (1 week = 40-50% improvement).

Q: What about the Oct 30 spike (276 errors)?

A: Likely specific trigger (batch test, migration). Current baseline is 60-65 errors/day, which is sustainable but improvable.

Q: Which issue is most urgent?

A: get_node_info reliability. It's the foundation for everything else. Without it, users can't build workflows correctly.

Next Steps

This Week
- Review this analysis with engineering team
- Estimate resource allocation
- Prioritize Phase 1 tasks
Next Week
- Start Phase 1 implementation
- Set up monitoring for improvements
- Begin user communication about fixes
Week 3
- Deploy Phase 1 fixes
- Measure improvements
- Start Phase 2

Questions?

For detailed analysis: See TELEMETRY_ANALYSIS_REPORT.md For technical details: See TELEMETRY_TECHNICAL_DEEP_DIVE.md For implementation: See IMPLEMENTATION_ROADMAP.md

Analysis by: AI Telemetry Analyst Confidence Level: High (506K+ events analyzed) Last Updated: November 8, 2025 Review Frequency: Weekly recommended Next Review Date: November 15, 2025

Appendix: Key Data Points

Error Distribution

ValidationError: 3,080 (34.77%)
TypeError: 2,767 (31.23%)
Generic Error: 2,711 (30.60%)
SqliteError: 202 (2.28%)
Other: 99 (1.12%)

Tool Reliability (Top Issues)

get_node_info: 88.28% success (11.72% failure)
validate_node_operation: 93.58% success (6.42% failure)
get_node_documentation: 95.87% success (4.13% failure)
All others: 96-100% success

User Engagement

Daily sessions: 895 (avg)
Daily users: 572 (avg)
Sessions/user: 1.52 (avg)
Peak day: 1,821 sessions (Oct 22)

Most Searched Topics

Testing (5,852 searches)
Webhooks (5,087)
HTTP (4,241)
Database (4,030)
API integration (2,074)

Performance Bottlenecks

Update loop: 55.2s avg (66% slow)
Read-after-write: 96.6s avg (63% slow)
Search refinement: 17% need 2+ queries
Session creation: ~5-10 seconds

10 KiB Raw Blame History

n8n-MCP Telemetry Analysis - Executive Summary

Quick Reference for Decision Makers

Key Statistics at a Glance

The 5 Critical Issues

1. Workflow-Level Validation Failures (39% of errors)

2. get_node_info Unreliability (11.72% failure rate)

3. Slow Sequential Updates (55+ seconds per operation)

4. Search Inefficiency (17% retry rate)

5. Type-Related Validation Errors (31.23% of errors)

Business Impact Summary

Current State: What's Broken?

If No Action Taken

With Phase 1 Fixes (Week 1)

Full Implementation (4-5 weeks)

Implementation Roadmap

Week 1 (Immediate Wins)

Week 2-3 (High Priority)

Week 4 (Optimization)

Ongoing (Documentation)

Resource Requirements

Risk Assessment

Success Metrics (Measurable)

By End of Week 1

By End of Week 3

By End of Week 5

Top Recommendations

Do This First (Week 1)

Avoid This (Anti-patterns)

Stakeholder Questions & Answers

Next Steps

Questions?

Appendix: Key Data Points

Error Distribution

Tool Reliability (Top Issues)

User Engagement

Most Searched Topics

Performance Bottlenecks

10 KiB

Raw Blame History

2. `get_node_info` Unreliability (11.72% failure rate)