Skip to main content

ADR-004: Remote Blog Post Aggregation

Status

Accepted

Date

2026-01-03

Context

With the successful implementation of ADR-003 (Cross-Project ADR Aggregation), amiable.dev now aggregates Architecture Decision Records from all portfolio projects. This establishes a pattern for pulling content from tracked GitHub repositories at build time.

A new requirement has emerged: aggregate blog posts from tracked projects to provide a unified content experience. Users should be able to discover not just ADRs but also blog content from the projects showcased on the /projects page.

Prior Art

  1. ADR-001: Enabled docs plugin at /docs for architecture documentation
  2. ADR-002: Established projects.json as the configuration source for tracked repositories
  3. ADR-003: Implemented build-time ADR aggregation with:
    • GitHub Git Trees API for discovery
    • Raw GitHub content fetching (no rate limits)
    • Front matter injection with source attribution
    • Relative link rewriting
    • Cache with commit SHA invalidation
    • Template/boilerplate exclusion
    • Status parsing and normalization

Available Solutions

Option A: docusaurus-plugin-remote-content

The @rdilweb/docusaurus-plugin-remote-content plugin provides:

Capabilities:

  • Two sync modes: Constant (auto during build) or CLI (manual)
  • Configurable source URLs and output directories
  • Content transformation via modifyContent callback
  • Separate instances for docs vs images
  • Axios-based HTTP fetching

Limitations:

  • Requires explicit document list (no auto-discovery)
  • No built-in caching or SHA-based invalidation
  • No relative link rewriting
  • Would require multiple plugin instances (one per project)
  • Less control over front matter injection
  • Active but minimal maintenance (last release 2023)

When to reconsider: If requirements evolve to explicit file lists (no discovery needed) and link rewriting becomes unnecessary, the plugin could reduce maintenance burden.

Option B: Extend Existing fetch-adrs.js Pattern

Leverage the proven architecture from ADR-003:

Advantages:

  • Consistent codebase and patterns
  • Full control over content transformation
  • Existing cache infrastructure
  • Single source of project configuration (projects.json)
  • Unified testing approach with Vitest/MSW

Considerations:

  • Additional custom code to maintain
  • Blog posts have different structure than ADRs

Option C: Create Unified Content Aggregation System

Refactor into a generalized scripts/fetch-remote-content.js that handles both ADRs and blog posts:

Advantages:

  • Single, well-tested content aggregation pipeline
  • Shared utilities (caching, link rewriting, front matter)
  • Easier to add future content types (e.g., docs, tutorials)
  • DRY principle

Considerations:

  • Larger refactoring effort
  • Risk of breaking existing ADR functionality

Decision

Option B: Extend the existing fetch-adrs.js pattern to create a parallel scripts/fetch-blog-posts.js script.

Rationale

  1. Proven Pattern: ADR-003 established a reliable, well-tested approach with 165+ tests
  2. Content Discovery: Blog posts need auto-discovery (like ADRs), not explicit document lists
  3. Link Rewriting: Critical for cross-references and images, not supported by the plugin
  4. Caching: SHA-based invalidation prevents redundant fetches
  5. Front Matter: Full control over attribution, tags, and metadata injection
  6. Independence: Keeps blog aggregation separate, reducing risk to ADR functionality
  7. Plugin Overhead: Adding the plugin introduces a dependency for functionality we already have

Code Organization

To prevent logic drift between fetch-adrs.js and fetch-blog-posts.js, share core modules:

lib/
githubFetch.js # Fetch + retries + rate limit handling
cacheManifest.js # SHA-based caching utilities
rewriteLinks.js # Link rewriting for images/cross-refs
frontMatter.js # Front matter parsing and injection

Separate entrypoints allow independent iteration while shared core logic ensures consistency.

Future Consideration (ADR-005 Triggers)

Evaluate Option C (unified system) when:

  • A third content type is needed (e.g., docs, tutorials)
  • More than 50% of code is duplicated between scripts
  • Common bugs appear in both scripts simultaneously

Implementation

Blog Post Discovery

Directory Scanning (in order of precedence):

  1. blog/ - Standard Docusaurus blog location
  2. content/blog/ - Alternative content directory
  3. posts/ - Common blog directory

First match wins; subsequent directories are not scanned if earlier ones exist.

Supported File Extensions:

  • .md (Markdown)
  • .mdx (MDX with React components)

File Patterns:

  • YYYY-MM-DD-*.md(x) - Date-prefixed posts
  • */index.md(x) - Folder-based posts (date from front matter or Git)
  • *.md(x) - Non-dated posts (date from front matter or Git)

Depth: Scan nested directories recursively (e.g., blog/2024/january/post.md).

Draft Handling: Posts with draft: true in front matter are excluded from aggregation.

Date Extraction Strategy

Date is critical for Docusaurus blog ordering. Extract in this order:

  1. Front matter date field - Highest priority
  2. Filename pattern - Extract from YYYY-MM-DD- prefix
  3. Git commit date - Fallback using GitHub API to get file's first commit date
  4. Fail gracefully - Log warning and skip file if no date can be determined
function extractDate(filename, frontMatter, gitCommitDate) {
if (frontMatter.date) return frontMatter.date;
const match = filename.match(/^(\d{4}-\d{2}-\d{2})/);
if (match) return match[1];
if (gitCommitDate) return gitCommitDate;
return null; // Skip file with warning
}

Collision Handling

Slug Collisions: Two repos may have 2024-01-01-update.md.

Strategy:

  1. Output path blog/projects/{repo-name}/ provides natural namespacing
  2. If source post has explicit slug front matter, prefix with repo name: {repo-name}-{slug}
  3. Log warning when collision is detected and resolved

Configuration Enhancement

Add optional blog configuration to project entries:

{
"repo": "amiable-dev/llm-council",
"includeBlogPosts": true,
"blogPath": "blog/",
"blogConfig": {
"includeDrafts": false,
"tagPrefix": "llm-council"
}
}

Projects without includeBlogPosts: true will not have their blog posts aggregated.

Output Structure

blog/
projects/
{repo-name}/
YYYY-MM-DD-post-slug.md
...

Front Matter Injection

Inject full author object inline (avoids modifying global authors.yml):

---
title: Original Title
date: 2026-01-01
source_repo: amiable-dev/llm-council
source_url: https://github.com/amiable-dev/llm-council/blob/main/blog/2026-01-01-post.md
aggregated_at: 2026-01-03T12:00:00Z
tags: [llm-council, aggregated]
authors:
- name: LLM Council Project
url: https://github.com/amiable-dev/llm-council
image_url: https://github.com/amiable-dev.png
---

Tag Handling:

  • Preserve original tags from source post
  • Add project-prefixed tag: {repo-name}
  • Add aggregated tag for filtering

Same patterns as ADR-003:

  • Relative images → absolute GitHub raw URLs (raw.githubusercontent.com)
  • Internal blog links within same repo → local paths if both posts aggregated
  • Other internal links → GitHub blob URLs
  • External links → preserved as-is
  • javascript: or other suspicious protocols → stripped with warning

Cache Strategy

Create parallel .cache/blog/manifest.json with script versioning:

{
"schemaVersion": 1,
"scriptVersion": "1.0.0",
"repos": {
"amiable-dev/llm-council": {
"sha": "abc123",
"lastFetch": "2026-01-03T12:00:00Z",
"posts": ["2026-01-01-intro.md", "2026-01-15-update.md"]
}
}
}

Cache Invalidation:

  • SHA change in repo → re-fetch all posts
  • scriptVersion change → bust entire cache (handles logic changes)
  • Manual: --no-cache flag to force re-fetch

Prebuild Integration

{
"prebuild": "node scripts/fetch-adrs.js && node scripts/fetch-blog-posts.js && node scripts/fetch-projects.js"
}

Security Model

Repo Allowlist: Only aggregate from repos explicitly listed in projects.json with includeBlogPosts: true.

Content Validation:

  • Sanitize front matter with js-yaml safeLoad
  • Strip suspicious link protocols (javascript:, data:, vbscript:)
  • Validate paths to prevent directory traversal (../)
  • No server-side code execution from fetched content

Rate Limiting:

  • Use GITHUB_TOKEN for authenticated requests (5000/hr vs 60/hr)
  • Use Git Trees API (single call per repo) for discovery
  • Raw content via raw.githubusercontent.com (no rate limit)
  • Implement retry with exponential backoff

Performance Expectations

MetricExpected Value
Repos to scan5-10 (portfolio projects)
Posts per repo0-20 (most projects have few/no blog posts)
API calls per build~2-3 per repo (tree + content fetches)
Build time impactLess than 30 seconds added (with caching)
Initial fetch (cold)~2-3 minutes for all repos

Failure Handling: Non-blocking. Log warnings and continue build if individual repos fail. Never break the build due to GitHub API issues.

Consequences

Positive

  • Unified content experience across portfolio projects
  • Consistent patterns with ADR aggregation
  • Full control over content transformation
  • Efficient caching reduces API calls
  • No new external dependencies
  • Inline authors avoid global file modifications

Negative

  • Additional custom code to maintain
  • Blog posts from projects may have inconsistent formatting
  • Potential for content conflicts (duplicate slugs)
  • Temporary code duplication with fetch-adrs.js until potential ADR-005 unification
  • We own edge cases: MDX parsing, date normalization, collision handling

Neutral

  • docusaurus-plugin-remote-content remains available for future use cases
  • Separate scripts allow independent iteration on ADRs vs blog posts

Alternatives Considered

Use docusaurus-plugin-remote-content

Rejected because:

  • No auto-discovery of blog posts (requires explicit file lists)
  • No relative link rewriting (breaks images and cross-references)
  • Would require significant configuration per project
  • Less control over front matter transformation
  • Introduces external dependency for functionality we already have

Refactor Everything into Unified System

Deferred because:

  • Higher risk to stable ADR functionality
  • Premature optimization without knowing blog-specific requirements
  • Can be revisited in ADR-005 after blog aggregation is proven
  • Allows commonalities to emerge naturally before abstraction

Modify Global authors.yml

Rejected because:

  • Modifying source-controlled files during builds creates dirty git states
  • Causes merge conflicts and CI complications
  • Inline author injection in front matter is cleaner and self-contained

LLM Council Review

This ADR was reviewed by the LLM Council (Reasoning tier) on 2026-01-03. Key feedback incorporated:

  1. Author Strategy - Changed from dynamic authors.yml modification to inline front matter injection
  2. Date Extraction - Added explicit fallback strategy with Git commit date
  3. Discovery Rules - Added precedence, MDX support, draft handling, depth specification
  4. Cache Versioning - Added scriptVersion to manifest to handle logic changes
  5. Collision Handling - Added explicit slug collision resolution strategy
  6. Shared Modules - Added recommendation to share core logic via lib/ modules
  7. Security Model - Added explicit allowlist, validation, and rate limiting details
  8. ADR-005 Triggers - Added specific conditions for evaluating unified system

References