Skip to main content

ADR-002: Four-Layer Semantic Analysis Pipeline

Status: APPROVED with Modifications
Date: 2025-08-25
Author: Architecture Review Team

Context

The semantic analysis engine is designed with 4 layers:

  1. Static Analysis (languages, frameworks, dependencies)
  2. Semantic Classification (AI-powered problem domain identification)
  3. Embedding Generation (multi-modal vectors for similarity)
  4. Similarity Search & Clustering (find related projects)

This is the core differentiator that transforms SmartBadge from "badge generator" to "semantic intelligence platform."

Decision

APPROVED with staged implementation and critical modifications:

Stage 1 (MVP): Simplified 3-layer approach

  1. Static Analysis (existing plan)
  2. Combined Semantic Classification + Embedding Generation
  3. Basic Similarity Search

Stage 2 (Production): Full 4-layer implementation

Consequences

Positive:

  • Maintains core differentiator vision
  • Staged approach reduces initial complexity
  • Early validation of semantic value proposition
  • Cost optimization through combined operations

Negative:

  • Potential quality trade-offs in MVP stage
  • Risk of over-engineering for initial validation
  • High dependency on OpenAI API availability and costs

Alternatives Considered

  1. Full 4-Layer from Start (Original Plan)

    • Pros: Maximum accuracy and capability
    • Cons: Over-engineering for MVP, high complexity
  2. Simplified 3-Layer (RECOMMENDED for MVP)

    • Pros: Faster implementation, easier debugging, cost optimization
    • Cons: Potentially lower accuracy initially
  3. Static Analysis Only

    • Pros: Fast, deterministic, no AI costs
    • Cons: Eliminates core differentiator, becomes commodity badge generator
  4. External Service Integration (GitHub Topics, etc.)

    • Pros: Leverages existing classifications
    • Cons: Not differentiating, limited semantic understanding

Risk Assessment

Critical Risks:

  • AI Model Costs: At scale, could reach $50K+/month without optimization
  • Consistency Issues: Different models may produce inconsistent classifications
  • API Dependencies: Single point of failure on OpenAI availability
  • Performance Impact: 4 layers could exceed <30s analysis requirement

Mitigation Strategies:

  1. Cost Control:

    • Implement tiered analysis (star count based)
    • Aggressive caching (7-day TTL)
    • Batch processing for efficiency
  2. Consistency Validation:

    • Implement validator that flags inconsistencies >20% difference
    • Human review queue for edge cases
    • Ground truth dataset for continuous validation
  3. Fallback Strategy:

    • Cached similar repository classifications
    • Rule-based classification for common patterns
    • Graceful degradation to static analysis only

Migration Strategy

Stage 1 Implementation (Weeks 1-4):

class SimplifiedSemanticAnalyzer {
async analyze(repo: Repository): Promise<Analysis> {
// 1. Static analysis
const staticData = await this.staticAnalyzer.analyze(repo)

// 2. Combined semantic + embedding (single API call)
const semantic = await this.openai.chat.completions.create({
model: "gpt-4",
messages: [{
role: "system",
content: "Analyze repository and return both classification and embedding-ready text..."
}],
functions: [semanticClassificationSchema]
})

// 3. Generate embedding from semantic output
const embedding = await this.openai.embeddings.create({
model: "text-embedding-3-small",
input: semantic.embeddingText
})

return { staticData, semantic, embedding }
}
}

Stage 2 Migration (Month 2):

  • Separate semantic and embedding layers
  • Add advanced similarity clustering
  • Implement consistency validation
  • Add human-in-the-loop feedback

Conclusion

The four-layer pipeline represents the ideal architecture, but a staged implementation reduces risk and allows for early validation. The simplified 3-layer approach for MVP maintains the core value proposition while significantly reducing initial complexity and cost.