ADR-039: MCP Workflow Tools
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- MCP Team - Tool design
- Architecture Team - Workflow patterns
Layer
MCP
Related ADRs
- ADR-020: MCP Server v2 with FastMCP
Supersedes
None
Depends On
- ADR-020: MCP Server v2 with FastMCP
Context
Complex operations span multiple entities:
- Multi-Step Workflows: Create SLO requires SLI first
- Relationship Management: Link entities together
- Bulk Operations: Process multiple items
- Validation: Cross-entity constraints
- AI Assistance: LLM needs orchestration tools
Requirements:
- 11 workflow tools for complex operations
- Transaction-like behavior
- Rollback on failure
- Progress reporting
- Audit trail
Decision
We implement MCP workflow tools for multi-entity orchestration:
Key Design Decisions
- 11 Workflow Tools: Cover common multi-step operations
- Transactional: All-or-nothing where possible
- Progress Callbacks: Report progress during execution
- Validation First: Check constraints before execution
- Audit Logging: Record all workflow executions
Workflow Tools
| Tool | Purpose | Entities |
|---|---|---|
create_slo_with_sli | Create SLO and its SLI | SLO, SLI |
create_requirement_with_test_cases | Requirement + tests | Requirement |
link_entities | Create relationship | Any |
bulk_categorize | Categorize multiple | Any |
analyze_capability_gaps | Gap analysis | Capability, Requirement |
generate_runbook_steps | AI runbook gen | Runbook |
incident_postmortem | Create postmortem | Incident |
calculate_error_budgets | Recalculate budgets | ErrorBudget |
migrate_requirements | Bulk migrate | Requirement |
sync_relationships | Sync IEEE 29148 | Relationship |
validate_traceability | Check trace matrix | All |
Tool Implementation Example
@mcp.tool()
async def create_slo_with_sli(
name: str,
description: str,
sli_metric: str,
target_percentage: float,
window_days: int = 30,
ctx: Context = None,
) -> dict:
"""Create an SLO along with its associated SLI.
This workflow tool creates both entities and links them together.
Args:
name: SLO name
description: SLO description
sli_metric: The metric the SLI measures
target_percentage: Target (e.g., 99.9)
window_days: Measurement window
Returns:
Created SLO with linked SLI
"""
async with db_transaction() as db:
# Create SLI first
sli = await sli_service.create(db, SLICreate(
title=f"SLI: {sli_metric}",
metric_name=sli_metric,
))
# Create SLO referencing SLI
slo = await slo_service.create(db, SLOCreate(
title=name,
description=description,
sli_id=sli.id,
target_percentage=target_percentage,
window_days=window_days,
))
# Create relationship
await relationship_service.link(db,
source_id=slo.id,
target_id=sli.id,
relationship_type="measures"
)
return {
"slo": slo.to_dict(),
"sli": sli.to_dict(),
"relationship_created": True
}
Error Handling
async def workflow_with_rollback(operations: list):
"""Execute operations with rollback on failure."""
completed = []
try:
for op in operations:
result = await op.execute()
completed.append((op, result))
except Exception as e:
# Rollback in reverse order
for op, result in reversed(completed):
await op.rollback(result)
raise WorkflowError(f"Workflow failed: {e}")
return [r for _, r in completed]
Consequences
Positive
- Atomic Operations: Multi-step as single action
- AI Friendly: Complex operations exposed simply
- Consistency: Entities created in correct order
- Audit Trail: All workflows logged
- Reusability: Same workflow from UI or MCP
Negative
- Complexity: Workflow logic is complex
- Partial Failure: Rollback may not be perfect
- Testing: Multi-entity tests harder
- Maintenance: 11 tools to maintain
Neutral
- Performance: Workflows slower than direct calls
- Versioning: Tool changes affect clients
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
Implementation Details
- Workflow Service:
backend/services/mcp_workflow_tools.py - MCP Registration:
backend/api/mcp_config.py - Tests:
backend/tests/integration/test_mcp_workflows.py - Audit:
backend/services/mcp_audit_logger.py
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED
Quality Metrics
- Consensus Strength Score (CSS): 0.95
- Deliberation Depth Index (DDI): 0.92
Council Feedback Summary
The workflow tool design provides exceptional deep architectural analysis. The transactional approach and audit logging are well-designed for AI orchestration.
Key Concerns Identified:
- Idempotency: MCP clients may retry; tools must handle duplicate invocations
- Partial Failure Recovery: Rollback may not restore external side effects (notifications sent)
- Tool Complexity: 11 tools may overwhelm LLM context; consider consolidation
Required Modifications:
- Idempotency Keys: Accept optional
idempotency_keyparameter; return cached result on retry - Partial Success Schema: Return structured result showing which steps succeeded/failed
{
"status": "partial_success",
"completed": ["create_sli"],
"failed": ["create_slo"],
"error": "Validation failed"
} - Compensating Actions: Document which side effects can't be rolled back
- Timeout Handling: Long workflows need progress callbacks and timeout extension
- Human-in-the-Loop: Flag destructive operations for confirmation (bulk_delete, migrate)
Modifications Applied
- Documented idempotency key pattern
- Added partial success response schema
- Documented compensating action limitations
- Added HITL requirement for destructive operations
Council Ranking
- claude-opus-4.5: Best Response (idempotency)
- gpt-5.2: Strong (partial failures)
- gemini-3-pro: Good (complexity)
References
/docs/mcp/workflow-tools.md/MCP_REMEDIATION_FINAL_STATUS.md
ADR-039 | MCP Layer | Implemented