Skip to main content

ADR-039: MCP Workflow Tools

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

  • MCP Team - Tool design
  • Architecture Team - Workflow patterns

Layer

MCP

  • ADR-020: MCP Server v2 with FastMCP

Supersedes

None

Depends On

  • ADR-020: MCP Server v2 with FastMCP

Context

Complex operations span multiple entities:

  1. Multi-Step Workflows: Create SLO requires SLI first
  2. Relationship Management: Link entities together
  3. Bulk Operations: Process multiple items
  4. Validation: Cross-entity constraints
  5. AI Assistance: LLM needs orchestration tools

Requirements:

  • 11 workflow tools for complex operations
  • Transaction-like behavior
  • Rollback on failure
  • Progress reporting
  • Audit trail

Decision

We implement MCP workflow tools for multi-entity orchestration:

Key Design Decisions

  1. 11 Workflow Tools: Cover common multi-step operations
  2. Transactional: All-or-nothing where possible
  3. Progress Callbacks: Report progress during execution
  4. Validation First: Check constraints before execution
  5. Audit Logging: Record all workflow executions

Workflow Tools

ToolPurposeEntities
create_slo_with_sliCreate SLO and its SLISLO, SLI
create_requirement_with_test_casesRequirement + testsRequirement
link_entitiesCreate relationshipAny
bulk_categorizeCategorize multipleAny
analyze_capability_gapsGap analysisCapability, Requirement
generate_runbook_stepsAI runbook genRunbook
incident_postmortemCreate postmortemIncident
calculate_error_budgetsRecalculate budgetsErrorBudget
migrate_requirementsBulk migrateRequirement
sync_relationshipsSync IEEE 29148Relationship
validate_traceabilityCheck trace matrixAll

Tool Implementation Example

@mcp.tool()
async def create_slo_with_sli(
name: str,
description: str,
sli_metric: str,
target_percentage: float,
window_days: int = 30,
ctx: Context = None,
) -> dict:
"""Create an SLO along with its associated SLI.

This workflow tool creates both entities and links them together.

Args:
name: SLO name
description: SLO description
sli_metric: The metric the SLI measures
target_percentage: Target (e.g., 99.9)
window_days: Measurement window

Returns:
Created SLO with linked SLI
"""
async with db_transaction() as db:
# Create SLI first
sli = await sli_service.create(db, SLICreate(
title=f"SLI: {sli_metric}",
metric_name=sli_metric,
))

# Create SLO referencing SLI
slo = await slo_service.create(db, SLOCreate(
title=name,
description=description,
sli_id=sli.id,
target_percentage=target_percentage,
window_days=window_days,
))

# Create relationship
await relationship_service.link(db,
source_id=slo.id,
target_id=sli.id,
relationship_type="measures"
)

return {
"slo": slo.to_dict(),
"sli": sli.to_dict(),
"relationship_created": True
}

Error Handling

async def workflow_with_rollback(operations: list):
"""Execute operations with rollback on failure."""
completed = []

try:
for op in operations:
result = await op.execute()
completed.append((op, result))
except Exception as e:
# Rollback in reverse order
for op, result in reversed(completed):
await op.rollback(result)
raise WorkflowError(f"Workflow failed: {e}")

return [r for _, r in completed]

Consequences

Positive

  • Atomic Operations: Multi-step as single action
  • AI Friendly: Complex operations exposed simply
  • Consistency: Entities created in correct order
  • Audit Trail: All workflows logged
  • Reusability: Same workflow from UI or MCP

Negative

  • Complexity: Workflow logic is complex
  • Partial Failure: Rollback may not be perfect
  • Testing: Multi-entity tests harder
  • Maintenance: 11 tools to maintain

Neutral

  • Performance: Workflows slower than direct calls
  • Versioning: Tool changes affect clients

Implementation Status

  • Core implementation complete
  • Tests written and passing
  • Documentation updated
  • Migration/upgrade path defined
  • Monitoring/observability in place

Implementation Details

  • Workflow Service: backend/services/mcp_workflow_tools.py
  • MCP Registration: backend/api/mcp_config.py
  • Tests: backend/tests/integration/test_mcp_workflows.py
  • Audit: backend/services/mcp_audit_logger.py

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED

Quality Metrics

  • Consensus Strength Score (CSS): 0.95
  • Deliberation Depth Index (DDI): 0.92

Council Feedback Summary

The workflow tool design provides exceptional deep architectural analysis. The transactional approach and audit logging are well-designed for AI orchestration.

Key Concerns Identified:

  1. Idempotency: MCP clients may retry; tools must handle duplicate invocations
  2. Partial Failure Recovery: Rollback may not restore external side effects (notifications sent)
  3. Tool Complexity: 11 tools may overwhelm LLM context; consider consolidation

Required Modifications:

  1. Idempotency Keys: Accept optional idempotency_key parameter; return cached result on retry
  2. Partial Success Schema: Return structured result showing which steps succeeded/failed
    {
    "status": "partial_success",
    "completed": ["create_sli"],
    "failed": ["create_slo"],
    "error": "Validation failed"
    }
  3. Compensating Actions: Document which side effects can't be rolled back
  4. Timeout Handling: Long workflows need progress callbacks and timeout extension
  5. Human-in-the-Loop: Flag destructive operations for confirmation (bulk_delete, migrate)

Modifications Applied

  1. Documented idempotency key pattern
  2. Added partial success response schema
  3. Documented compensating action limitations
  4. Added HITL requirement for destructive operations

Council Ranking

  • claude-opus-4.5: Best Response (idempotency)
  • gpt-5.2: Strong (partial failures)
  • gemini-3-pro: Good (complexity)

References

  • /docs/mcp/workflow-tools.md
  • /MCP_REMEDIATION_FINAL_STATUS.md

ADR-039 | MCP Layer | Implemented