ADR-019: AI-Powered Outlook Email Analysis for ITIL Operations Insights

Status: Proposed Date: 2025-12-18 Decision Makers: Engineering, IT Operations Council Review: Full council (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Grok-4)

Context

IT Operations teams receive large volumes of unstructured support emails via shared Outlook mailboxes. These emails contain valuable operational intelligence but lack the structure of formal ITSM ticketing systems:

Current Challenges

Challenge	Impact
Unstructured data	Cannot query "top issues" or "worst apps"
Reply chain noise	Signatures, disclaimers, "thank you" messages pollute analysis
No ITIL mapping	Emails don't classify as Incident/Problem/Change/Request
Pattern blindness	Recurring issues go undetected until major outages
Tribal knowledge	Solutions exist in email threads but aren't discoverable

Desired Outcomes

Most frequent issues ranked by volume and severity
Effective solutions extracted and catalogued
Worst offending applications identified for remediation
Avoidance strategies generated proactively
Trend detection for emerging problems

Decision

Implement a Hybrid AI Architecture combining traditional NLP for preprocessing with LLM-based extraction for ITIL insights, using a "Filter-Cluster-Extract" pattern.

Council Consensus

The LLM Council unanimously agreed on three critical architectural decisions:

Graph API + Pre-processing Layer (not IMAP/SMTP)
Vector Database + Temporal Aggregation for pattern detection
Structured JSON Output constrained to ITIL taxonomy

Architecture

High-Level Data Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                         LAYER 1: INGESTION                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│  Microsoft Graph API                                                         │
│  ├── OAuth 2.0 (Client Credentials Flow)                                    │
│  ├── Delta Queries for incremental sync                                     │
│  ├── Target: Shared mailboxes (support@, helpdesk@)                         │
│  └── Batch: Every 15-60 minutes via orchestrator                            │
└─────────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│                         LAYER 2: PRE-PROCESSING                              │
├─────────────────────────────────────────────────────────────────────────────┤
│  2a. PII Redaction (MUST be first)                                          │
│      └── Microsoft Presidio: names, emails, IPs, passwords → <REDACTED>     │
│                                                                              │
│  2b. Thread Reconstruction                                                   │
│      └── Group by conversationId, extract First (problem) + Last (solution) │
│                                                                              │
│  2c. Noise Removal                                                           │
│      ├── Signatures: regex for "Best regards", "--", phone patterns         │
│      ├── Disclaimers: keyword match "confidential", "do not reply"          │
│      ├── Reply headers: "On [date] wrote:", "From:", "Sent:"                │
│      └── Transactional: discard <50 words or "thank you" sentiment          │
└─────────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│                         LAYER 3: AI PROCESSING                               │
├─────────────────────────────────────────────────────────────────────────────┤
│  3a. Vectorization (Low-cost, high-volume)                                  │
│      ├── Model: text-embedding-3-small or all-MiniLM-L6-v2                  │
│      └── Output: 384-1536 dim vectors per email thread                      │
│                                                                              │
│  3b. Semantic Clustering                                                     │
│      ├── Algorithm: HDBSCAN (density-based, handles noise)                  │
│      └── Result: 1000 emails → ~50 distinct issue clusters                  │
│                                                                              │
│  3c. LLM Extraction (High-cost, cluster centroids only)                     │
│      ├── Model: GPT-4o or Llama-3-70b                                       │
│      ├── Input: Representative emails from each cluster                     │
│      └── Output: Structured JSON (see ITIL Schema below)                    │
└─────────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│                         LAYER 4: STORAGE                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│  Bronze Layer (Raw)                                                          │
│  └── Azure Blob / S3: Full JSON from Graph API (enables replay)             │
│                                                                              │
│  Vector Store                                                                │
│  └── Pinecone / pgvector / FAISS: Embeddings + metadata                     │
│                                                                              │
│  Silver Layer (Structured)                                                   │
│  └── PostgreSQL: Extracted ITIL records for analytics                       │
└─────────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│                         LAYER 5: ANALYTICS                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│  Dashboards: Power BI / Grafana                                              │
│  Reports: Automated weekly PDF via Azure Functions                           │
│  Alerts: Cluster drift detection → Teams/Slack notification                  │
└─────────────────────────────────────────────────────────────────────────────┘

ITIL Classification Schema

The LLM must output structured JSON conforming to this schema (enforced via Pydantic or JSON mode):

{
  "thread_id": "AAMkAGI2...",
  "timestamp": "2025-12-18T10:30:00Z",

  "classification": {
    "itil_type": "Incident | Problem | Service Request | Change Request | Noise",
    "category": "Hardware | Software | Network | Identity | Security",
    "subcategory": "VPN | Email | ERP | CRM | Endpoint | Other",
    "severity": "Critical | High | Medium | Low",
    "sentiment": "Frustrated | Neutral | Satisfied"
  },

  "extraction": {
    "issue_summary": "One sentence description of the problem",
    "affected_ci": "Application or system name (from approved CI list)",
    "error_message": "Extracted error code or message if present",
    "root_cause": "Identified cause or 'Unknown'",
    "resolution_summary": "How it was resolved or 'Unresolved'",
    "resolution_type": "Restart | Access Grant | Patch | Workaround | Escalation | User Training | Self-Healed"
  },

  "metadata": {
    "thread_length": 5,
    "time_to_resolution_hours": 2.5,
    "participants": ["support@company.com", "<REDACTED>"]
  }
}

Constrained Generation

To prevent hallucination of application names, provide the LLM with an enumerated list of valid Configuration Items:

VALID_CIS = [
    "Microsoft Outlook", "Microsoft Teams", "SAP ERP", "Salesforce CRM",
    "GlobalProtect VPN", "Okta SSO", "ServiceNow", "Jira", "Confluence",
    "Active Directory", "Azure AD", "Office 365", "Other"
]

The LLM must select from this list or output "Other" with a free-text description.

Analytics Outputs

1. Frequent Issues Report

Rank	Issue Cluster	Volume	Severity	Top Solution
1	VPN Connection Failures	234	High	Token resync via self-service
2	Password Reset Requests	189	Low	AD self-service portal
3	Outlook Freezing	156	Medium	Clear OST cache

2. Worst Offending Applications

Scoring: Score = Volume × Severity_Weight × (1 + Negative_Sentiment_Ratio)

Application	Incidents	Avg Severity	Frustration Score	Trend
SAP ERP	89	High	0.72	↑ 23%
GlobalProtect VPN	234	Medium	0.45	↓ 12%
Microsoft Teams	67	Low	0.31	→ Stable

3. Avoidance Matrix

                    HIGH COMPLEXITY (Long threads, escalations)
                              │
     PROBLEM MANAGEMENT       │      ROOT CAUSE ANALYSIS
     (Systemic fixes needed)  │      (Deep investigation)
                              │
    ──────────────────────────┼──────────────────────────────
                              │
     AUTOMATION CANDIDATE     │      TRAINING / DOCUMENTATION
     (Chatbot, self-service)  │      (User education)
                              │
                    LOW COMPLEXITY (Quick resolution)

    LOW FREQUENCY ────────────┼──────────────── HIGH FREQUENCY

4. Proactive Recommendations

LLM-generated from top clusters:

"Based on 234 VPN connection failures in the past 30 days, recommend:

Implement automatic RSA token refresh before expiration

Create self-service portal for token resync

Add monitoring alert for certificate expiration"

Privacy & Security

Concern	Mitigation
PII in emails	Microsoft Presidio redaction BEFORE any AI processing
Credentials in text	TruffleHog scanner for leaked passwords/API keys
Data residency	Same-region Azure OpenAI (GDPR/HIPAA compliance)
Access control	Row-level security by email folder/department
Retention	Auto-delete raw data after 90 days per policy
LLM data usage	Zero Data Retention agreement with API provider

Alternatives Considered

Alternative 1: Pure LLM Approach (All emails through GPT-4)

Rejected: Cost-prohibitive at scale. Processing 10,000 emails/day at $0.01/email = $3,000/month. Hybrid approach reduces LLM calls by 90%+ via clustering.

Alternative 2: Pure Traditional NLP (No LLM)

Rejected: Cannot extract nuanced insights like root causes or generate avoidance recommendations. Keyword-based classification misses semantic similarity ("frozen" vs "hangs" vs "unresponsive").

Alternative 3: Fine-tuned Domain Model

Deferred: Requires 5,000+ labeled examples. Start with few-shot prompting; fine-tune later if accuracy <85% on validation set.

Alternative 4: Direct IMAP/SMTP Access

Rejected: Security concerns (storing credentials), no OAuth support, limited metadata. Graph API provides conversationId, threading, and enterprise-grade auth.

Implementation Phases

Phase 1: Foundation (Weeks 1-4)

Azure AD app registration with Graph API permissions
Delta query ingestion pipeline (Azure Functions)
PII redaction with Presidio
Bronze layer storage (Blob)

Phase 2: Intelligence (Weeks 5-8)

Thread reconstruction and noise filtering
Embedding generation pipeline
Vector store setup (pgvector or Pinecone)
HDBSCAN clustering job

Phase 3: Extraction (Weeks 9-12)

LLM extraction prompts with JSON schema
ITIL classification validation
Silver layer database schema
CI enumeration and constraint logic

Phase 4: Analytics (Weeks 13-16)

Power BI dashboard integration
Automated weekly reports
Cluster drift alerting
Avoidance recommendation generation

Risks and Mitigations

Risk	Likelihood	Impact	Mitigation
Thread fragmentation (10 replies = 10 incidents)	High	High	Strict conversationId aggregation
Signature hallucination (LLM reads signature as issue)	Medium	Medium	Aggressive regex truncation before LLM
Topic drift in reply chains (new issue in old thread)	Medium	Medium	LLM prompt: "Is this semantically related?"
"Thank you" pollution	High	Low	Filter messages <50 words + positive sentiment
Cost overruns	Medium	High	Cluster-first approach; LLM only for centroids
Compliance breach (PII to external LLM)	Low	Critical	Presidio redaction as FIRST step, not optional

Success Metrics

Metric	Target	Measurement
Issue classification accuracy	>85% F1	Manual validation of 200 samples
Pattern detection coverage	>90% of issues in clusters	Noise cluster <10%
Time to insight	<24 hours	Ingestion to dashboard latency
Cost per email	<$0.001	Total monthly cost / emails processed
Actionable recommendations	5+ per week	Auto-generated avoidance strategies

Technology Stack

Layer	Technology	Rationale
Ingestion	MS Graph API + Azure Functions	Native O365 integration, serverless scale
Orchestration	Apache Airflow	Complex DAG support, monitoring
PII Redaction	Microsoft Presidio	Open source, runs locally
Embeddings	text-embedding-3-small	Cost-effective, good quality
Vector Store	pgvector (PostgreSQL)	Single database, familiar SQL
Clustering	HDBSCAN (scikit-learn)	Handles noise, variable density
LLM	Azure OpenAI GPT-4o	Enterprise compliance, JSON mode
Structured Output	Pydantic + Instructor	Schema enforcement
Analytics	Power BI	Enterprise standard, Graph integration
Alerting	Azure Logic Apps	Low-code, Teams/Slack integration

References

Council Review Summary

Reviewed by: GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Grok-4

Council Rankings:

Claude Opus 4.5 (0.667) - Hybrid embeddings + LLM approach
GPT-5.1 (0.500) - Feedback loop emphasis
Gemini 3 Pro (0.333) - PII-first architecture
Grok-4 (0.000) - Cloud-native scaling

Key Council Insights:

PII redaction must be architectural prerequisite, not afterthought
Vector clustering before LLM extraction reduces costs 90%+
Structured JSON output prevents ITIL taxonomy drift
Time-windowed aggregation distinguishes chronic vs acute issues

Context​

Current Challenges​

Desired Outcomes​

Decision​

Council Consensus​

Architecture​

High-Level Data Flow​

ITIL Classification Schema​

Constrained Generation​

Analytics Outputs​

1. Frequent Issues Report​

2. Worst Offending Applications​

3. Avoidance Matrix​

4. Proactive Recommendations​

Privacy & Security​

Alternatives Considered​

Alternative 1: Pure LLM Approach (All emails through GPT-4)​

Alternative 2: Pure Traditional NLP (No LLM)​

Alternative 3: Fine-tuned Domain Model​

Alternative 4: Direct IMAP/SMTP Access​

Implementation Phases​

Phase 1: Foundation (Weeks 1-4)​

Phase 2: Intelligence (Weeks 5-8)​

Phase 3: Extraction (Weeks 9-12)​

Phase 4: Analytics (Weeks 13-16)​

Risks and Mitigations​

Success Metrics​

Technology Stack​

References​

Council Review Summary​