Entity Relationship Primary Key Fix: Preventing Silent Data Corruption
Published: 2025-01-16
When the LLM Council reviewed our entity relationships implementation (ADR-026), they flagged a critical flaw: our 4-column primary key blocked multiple relationship types between the same entities, and duplicate relationships could silently corrupt traceability data.
This post details how Issue #458 fixed both data integrity gaps with a proper 5-column primary key.
The Problem
Our EntityRelationship table connects any two entities with typed relationships:
-- Original schema (simplified)
CREATE TABLE entity_relationships (
source_type VARCHAR(50),
source_id VARCHAR(255),
target_type VARCHAR(50),
target_id VARCHAR(255),
relationship_type VARCHAR(50),
PRIMARY KEY (source_type, source_id, target_type, target_id)
-- Notice: relationship_type NOT in primary key!
);
The 4-column primary key created two problems:
Problem 1: Blocked Multiple Relationship Types
-- Insert #1: REQ-001 depends on REQ-002
INSERT INTO entity_relationships
VALUES ('requirement', 'REQ-001', 'requirement', 'REQ-002', 'depends_on');
-- SUCCESS!
-- Insert #2: REQ-001 also references REQ-002
INSERT INTO entity_relationships
VALUES ('requirement', 'REQ-001', 'requirement', 'REQ-002', 'references');
-- FAILS! PK violation (same 4-column key)
This was fundamentally broken—entities couldn't have multiple relationship types!
Problem 2: Potential Duplicate Relationships
Without proper constraints, race conditions or UI bugs could create duplicate relationships.
Consequences
- Feature Blockage: Can't express "depends_on AND references" relationships
- Graph Pollution: (if duplicates existed) Wrong edge counts
- IEEE 29148-2018 Non-Compliance: Traceability requires rich relationships
The Solution
Change the primary key to 5 columns, including relationship_type:
# backend/models/generic_relationships.py
class EntityRelationship(Base):
__tablename__ = "entity_relationships"
# 5-column composite primary key
source_type = Column(String(50), primary_key=True)
source_id = Column(String(255), primary_key=True)
target_type = Column(String(50), primary_key=True)
target_id = Column(String(255), primary_key=True)
relationship_type = Column(String(50), primary_key=True) # Now in PK!
Key Insight: Multiple Relationship Types Are Now Valid
With the 5-column PK:
- REQ-001 → REQ-002 with
depends_on✓ - REQ-001 → REQ-002 with
references✓ (different relationship type = different row) - REQ-001 → REQ-002 with
depends_onagain ✗ (exact duplicate blocked by PK)
Migration Strategy
The migration handles the schema change safely:
def upgrade():
# Step 1: Remove any duplicates (keep newest by updated_at)
conn.execute(text("""
DELETE FROM entity_relationships
WHERE ctid NOT IN (
SELECT DISTINCT ON (source_type, source_id, target_type, target_id, relationship_type)
ctid
FROM entity_relationships
ORDER BY source_type, source_id, target_type, target_id, relationship_type,
updated_at DESC NULLS LAST
)
"""))
# Step 2: Drop 4-column primary key
op.drop_constraint("entity_relationships_pkey", "entity_relationships", type_="primary")
# Step 3: Create 5-column primary key
op.create_primary_key(
"entity_relationships_pkey",
"entity_relationships",
["source_type", "source_id", "target_type", "target_id", "relationship_type"],
)
Testing
Our TDD tests verify the constraint at the database level:
def test_duplicate_relationship_rejected_at_database_level(self, postgres_db):
"""Exact duplicate 5-tuple MUST raise IntegrityError."""
rel1 = EntityRelationship(
source_type="requirement", source_id="REQ-DUP-001",
target_type="capability", target_id="CAP-DUP-001",
relationship_type="depends_on",
)
postgres_db.add(rel1)
postgres_db.flush()
# Attempt exact duplicate
rel2_duplicate = EntityRelationship(
source_type="requirement", source_id="REQ-DUP-001",
target_type="capability", target_id="CAP-DUP-001",
relationship_type="depends_on", # Same!
)
postgres_db.add(rel2_duplicate)
with pytest.raises(IntegrityError):
postgres_db.flush() # MUST fail
Impact
| Before | After |
|---|---|
| Duplicates silently accepted | IntegrityError on duplicate |
| Graph traversal counts edges incorrectly | Clean graph structure |
| UI shows duplicates | No duplicates possible |
| ADR-026: CONDITIONAL | ADR-026: APPROVED |
Lessons Learned
- Composite Primary Keys Need Review: Adding a column to a table doesn't mean it's part of the uniqueness guarantee
- Semantic Differences Matter: "depends_on" and "references" are different relationships—the fix preserves this distinction
- Database Constraints > Application Validation: Race conditions can bypass application checks; database constraints are authoritative
- Migration Must Handle Existing Data: Don't just add constraints—clean up legacy data first
Issue #458 | ADR-026 | LLM Council Blocking Issue Resolved