ADR-002: Four-Layer Semantic Analysis Pipeline
Status: APPROVED with Modifications
Date: 2025-08-25
Author: Architecture Review Team
Context
The semantic analysis engine is designed with 4 layers:
- Static Analysis (languages, frameworks, dependencies)
- Semantic Classification (AI-powered problem domain identification)
- Embedding Generation (multi-modal vectors for similarity)
- Similarity Search & Clustering (find related projects)
This is the core differentiator that transforms SmartBadge from "badge generator" to "semantic intelligence platform."
Decision
APPROVED with staged implementation and critical modifications:
Stage 1 (MVP): Simplified 3-layer approach
- Static Analysis (existing plan)
- Combined Semantic Classification + Embedding Generation
- Basic Similarity Search
Stage 2 (Production): Full 4-layer implementation
Consequences
Positive:
- Maintains core differentiator vision
- Staged approach reduces initial complexity
- Early validation of semantic value proposition
- Cost optimization through combined operations
Negative:
- Potential quality trade-offs in MVP stage
- Risk of over-engineering for initial validation
- High dependency on OpenAI API availability and costs
Alternatives Considered
-
Full 4-Layer from Start (Original Plan)
- Pros: Maximum accuracy and capability
- Cons: Over-engineering for MVP, high complexity
-
Simplified 3-Layer (RECOMMENDED for MVP)
- Pros: Faster implementation, easier debugging, cost optimization
- Cons: Potentially lower accuracy initially
-
Static Analysis Only
- Pros: Fast, deterministic, no AI costs
- Cons: Eliminates core differentiator, becomes commodity badge generator
-
External Service Integration (GitHub Topics, etc.)
- Pros: Leverages existing classifications
- Cons: Not differentiating, limited semantic understanding
Risk Assessment
Critical Risks:
- AI Model Costs: At scale, could reach $50K+/month without optimization
- Consistency Issues: Different models may produce inconsistent classifications
- API Dependencies: Single point of failure on OpenAI availability
- Performance Impact: 4 layers could exceed <30s analysis requirement
Mitigation Strategies:
-
Cost Control:
- Implement tiered analysis (star count based)
- Aggressive caching (7-day TTL)
- Batch processing for efficiency
-
Consistency Validation:
- Implement validator that flags inconsistencies >20% difference
- Human review queue for edge cases
- Ground truth dataset for continuous validation
-
Fallback Strategy:
- Cached similar repository classifications
- Rule-based classification for common patterns
- Graceful degradation to static analysis only
Migration Strategy
Stage 1 Implementation (Weeks 1-4):
class SimplifiedSemanticAnalyzer {
async analyze(repo: Repository): Promise<Analysis> {
// 1. Static analysis
const staticData = await this.staticAnalyzer.analyze(repo)
// 2. Combined semantic + embedding (single API call)
const semantic = await this.openai.chat.completions.create({
model: "gpt-4",
messages: [{
role: "system",
content: "Analyze repository and return both classification and embedding-ready text..."
}],
functions: [semanticClassificationSchema]
})
// 3. Generate embedding from semantic output
const embedding = await this.openai.embeddings.create({
model: "text-embedding-3-small",
input: semantic.embeddingText
})
return { staticData, semantic, embedding }
}
}
Stage 2 Migration (Month 2):
- Separate semantic and embedding layers
- Add advanced similarity clustering
- Implement consistency validation
- Add human-in-the-loop feedback
Conclusion
The four-layer pipeline represents the ideal architecture, but a staged implementation reduces risk and allows for early validation. The simplified 3-layer approach for MVP maintains the core value proposition while significantly reducing initial complexity and cost.