Skip to main content

ADR-004: RESTful API Design

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

  • Architecture Team - API design principles
  • Frontend Team - Consumer requirements

Layer

API

  • ADR-008: FastAPI with Pydantic (implementation framework)
  • ADR-030: API Versioning (version strategy)
  • ADR-031: Request Validation (validation approach)

Supersedes

None

Depends On

None

Context

The SRE Operations Platform requires an API design that supports:

  1. Multiple Clients: Web UI, MCP server, external integrations
  2. CRUD Operations: Standard create, read, update, delete for 17 entity types
  3. Bulk Operations: Efficient handling of multiple entities
  4. Search & Filter: Complex queries across entities
  5. Pagination: Handling large result sets
  6. Documentation: Self-documenting for consumers

Key constraints:

  • Must work with React Query caching
  • Need consistent error handling
  • Support OpenAPI specification generation
  • Enable versioning for future evolution

Decision

We adopt RESTful API design with the following conventions:

Key Design Decisions

  1. Resource-Based URLs: /api/v1/{entity_type} for collections
  2. HTTP Verbs: GET (read), POST (create), PUT (update), DELETE (remove)
  3. Consistent Response Format: { items: T[], total: number } for lists
  4. Query Parameters: Pagination, sorting, filtering via query string
  5. OpenAPI-First: Full OpenAPI 3.0 specification

URL Structure

GET    /api/v1/requirements              # List all
GET /api/v1/requirements/{id} # Get one
POST /api/v1/requirements # Create
PUT /api/v1/requirements/{id} # Update
DELETE /api/v1/requirements/{id} # Delete
POST /api/v1/requirements/bulk-delete # Bulk operation

Response Formats

List Response:

{
"items": [...],
"total": 100
}

Single Entity Response:

{
"id": "REQ-000001",
"title": "...",
...
}

Error Response:

{
"detail": "Error message",
"status_code": 400
}

Validation Error (422):

{
"detail": [
{
"loc": ["body", "title"],
"msg": "field required",
"type": "value_error.missing"
}
]
}

Query Parameters

ParameterPurposeExample
skipPagination offset?skip=20
limitPage size?limit=50
sort_bySort field?sort_by=created_at
sort_orderSort direction?sort_order=desc
searchText search?search=authentication
statusFilter by status?status=Active
typeFilter by type?type=Functional
group_byGroup results?group_by=category

Consequences

Positive

  • Predictable: Developers know URL patterns without docs
  • Cacheable: GET requests cache effectively with React Query
  • Tooling Support: OpenAPI enables client generation
  • Browser Friendly: Standard HTTP semantics
  • Debugging: Easy to test with curl, Postman
  • Industry Standard: Low learning curve

Negative

  • Over-fetching: May return more data than needed
  • Under-fetching: May require multiple requests for related data
  • N+1 Queries: List endpoints may need optimization
  • Limited Flexibility: Complex operations require workarounds

Neutral

  • HATEOAS: Not implemented (not needed for SPA)
  • GraphQL Alternative: Considered but not adopted

Alternatives Considered

1. GraphQL

  • Approach: Query language with flexible data fetching
  • Rejected: Added complexity, Apollo Client conflicts (see incident 2025-10-14)

2. gRPC

  • Approach: Binary protocol with code generation
  • Rejected: Not browser-native, limited debugging

3. JSON-RPC

  • Approach: RPC-style over HTTP
  • Rejected: Less tooling, unconventional

Implementation Status

  • Core implementation complete
  • Tests written and passing
  • Documentation updated
  • Migration/upgrade path defined
  • Monitoring/observability in place

Implementation Details

  • Route Handlers: backend/api/v1/
  • OpenAPI Schema: Auto-generated at /docs and /openapi.json
  • Pagination Utils: backend/core/pagination.py
  • Response Models: backend/schemas/
  • API Docs: docs/api/

Compliance/Validation

  • Automated checks: OpenAPI schema validation in CI
  • Manual review: API changes reviewed for REST compliance
  • Metrics: Response time and error rate per endpoint

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: CONDITIONAL APPROVAL

Quality Metrics

  • Consensus Strength Score (CSS): 0.90
  • Deliberation Depth Index (DDI): 0.88

Council Feedback Summary

The council approved the baseline design but identified critical flaws for high-volume SRE entities. The current pagination and update strategies will cause performance degradation in production.

Key Concerns Identified:

  1. Pagination is a Critical Blocker: skip/limit is O(N) and unstable with concurrent writes
  2. PUT-Only Updates: Dangerous in SRE context with concurrent automated updates
  3. Missing Bulk Operations: SRE automation needs to update hundreds of entities at once
  4. No Nested Resources: SRE data is rarely flat (incidents → timeline, services → alerts)

Required Modifications:

  1. Hybrid Pagination Strategy:
    • Cursor-based (keyset): Mandatory for high-volume data (Alerts, Audit Logs, Metrics)
    • Offset-based: Only for low-cardinality config data (Users, Teams, Runbooks)
  2. Add PATCH Immediately: For partial updates (e.g., changing status to "Resolved")
  3. Implement Optimistic Concurrency: Use ETag and If-Match headers
  4. Standardize Bulk Operations:
    • Sync: POST /api/v1/alerts/bulk (list of IDs + action)
    • Async: POST .../bulk-async returning 202 + Job ID
  5. Update Response Envelope:
    {
    "items": [...],
    "meta": {
    "total": 100,
    "next_cursor": "abc...",
    "has_more": true
    }
    }
  6. Error Handling: Adopt RFC 7807 (Problem Details for HTTP APIs)
  7. Idempotency: Mandate Idempotency-Key headers on POST/PATCH

Modifications Applied

  1. Documented hybrid pagination strategy (cursor + offset)
  2. Added PATCH for partial updates
  3. Defined bulk operation patterns (sync/async)
  4. Added ETag-based optimistic concurrency recommendation
  5. Documented RFC 7807 error format

Council Ranking

  • gpt-5.2: Best Response (pagination/bulk focus)
  • gemini-3-pro: Strong (idempotency emphasis)
  • claude-opus-4.5: Good (PATCH/PUT distinction)
  • grok-4.1: Partial

Operational Guidelines (APPROVED_WITH_MODS)

Response Structure with Links:

{
"items": [...],
"total": 150,
"_links": {
"self": { "href": "/api/v1/requirements?skip=20&limit=20" },
"first": { "href": "/api/v1/requirements?skip=0&limit=20" },
"prev": { "href": "/api/v1/requirements?skip=0&limit=20" },
"next": { "href": "/api/v1/requirements?skip=40&limit=20" },
"last": { "href": "/api/v1/requirements?skip=140&limit=20" }
}
}

Entity Response with Related Links:

{
"id": "REQ-000001",
"title": "User Authentication",
"_links": {
"self": { "href": "/api/v1/requirements/REQ-000001" },
"capabilities": { "href": "/api/v1/requirements/REQ-000001/capabilities" },
"test_cases": { "href": "/api/v1/requirements/REQ-000001/test-cases" },
"history": { "href": "/api/v1/requirements/REQ-000001/history" },
"parent": { "href": "/api/v1/capabilities/CAP-000005" }
}
}

Implementation:

# backend/schemas/base.py
class HALLinks(BaseModel):
"""HATEOAS link structure."""
href: str
method: str = "GET"
title: str | None = None

class PaginatedResponse(BaseModel, Generic[T]):
"""Paginated response with HATEOAS links."""
items: list[T]
total: int
_links: dict[str, HALLinks] | None = None

@classmethod
def create(cls, items, total, skip, limit, base_url):
links = {
"self": HALLinks(href=f"{base_url}?skip={skip}&limit={limit}"),
"first": HALLinks(href=f"{base_url}?skip=0&limit={limit}"),
}
if skip > 0:
links["prev"] = HALLinks(href=f"{base_url}?skip={max(0, skip-limit)}&limit={limit}")
if skip + limit < total:
links["next"] = HALLinks(href=f"{base_url}?skip={skip+limit}&limit={limit}")
return cls(items=items, total=total, _links=links)

Standardized Error Response Format

RFC 7807 Problem Details:

{
"type": "https://api.ops.example.com/errors/validation-error",
"title": "Validation Error",
"status": 422,
"detail": "Request validation failed",
"instance": "/api/v1/requirements",
"errors": [
{
"field": "title",
"message": "Title is required",
"code": "required"
},
{
"field": "priority",
"message": "Must be one of: low, medium, high, critical",
"code": "invalid_enum"
}
],
"trace_id": "abc123def456"
}

Error Type Registry:

TypeStatusDescription
validation-error422Request body validation failed
not-found404Resource not found
unauthorized401Authentication required
forbidden403Insufficient permissions
conflict409Resource conflict (duplicate, version)
rate-limited429Too many requests
internal-error500Server error

Implementation:

# backend/core/exceptions.py
class ProblemDetail(BaseModel):
type: str = "about:blank"
title: str
status: int
detail: str
instance: str | None = None
errors: list[dict] | None = None
trace_id: str | None = None

@app.exception_handler(RequestValidationError)
async def validation_exception_handler(request, exc):
return JSONResponse(
status_code=422,
content=ProblemDetail(
type="https://api.ops.example.com/errors/validation-error",
title="Validation Error",
status=422,
detail="Request validation failed",
instance=str(request.url.path),
errors=[{"field": e["loc"][-1], "message": e["msg"]} for e in exc.errors()],
trace_id=request.state.trace_id,
).model_dump(),
media_type="application/problem+json",
)

References


ADR-004 | API Layer | Implemented | APPROVED_WITH_MODS Completed