Skip to main content

ADR-019: AI-Powered Outlook Email Analysis for ITIL Operations Insights

Status: Proposed Date: 2025-12-18 Decision Makers: Engineering, IT Operations Council Review: Full council (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Grok-4)


Context

IT Operations teams receive large volumes of unstructured support emails via shared Outlook mailboxes. These emails contain valuable operational intelligence but lack the structure of formal ITSM ticketing systems:

Current Challenges

ChallengeImpact
Unstructured dataCannot query "top issues" or "worst apps"
Reply chain noiseSignatures, disclaimers, "thank you" messages pollute analysis
No ITIL mappingEmails don't classify as Incident/Problem/Change/Request
Pattern blindnessRecurring issues go undetected until major outages
Tribal knowledgeSolutions exist in email threads but aren't discoverable

Desired Outcomes

  1. Most frequent issues ranked by volume and severity
  2. Effective solutions extracted and catalogued
  3. Worst offending applications identified for remediation
  4. Avoidance strategies generated proactively
  5. Trend detection for emerging problems

Decision

Implement a Hybrid AI Architecture combining traditional NLP for preprocessing with LLM-based extraction for ITIL insights, using a "Filter-Cluster-Extract" pattern.

Council Consensus

The LLM Council unanimously agreed on three critical architectural decisions:

  1. Graph API + Pre-processing Layer (not IMAP/SMTP)
  2. Vector Database + Temporal Aggregation for pattern detection
  3. Structured JSON Output constrained to ITIL taxonomy

Architecture

High-Level Data Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 1: INGESTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ Microsoft Graph API │
│ ├── OAuth 2.0 (Client Credentials Flow) │
│ ├── Delta Queries for incremental sync │
│ ├── Target: Shared mailboxes (support@, helpdesk@) │
│ └── Batch: Every 15-60 minutes via orchestrator │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 2: PRE-PROCESSING │
├─────────────────────────────────────────────────────────────────────────────┤
│ 2a. PII Redaction (MUST be first) │
│ └── Microsoft Presidio: names, emails, IPs, passwords → <REDACTED> │
│ │
│ 2b. Thread Reconstruction │
│ └── Group by conversationId, extract First (problem) + Last (solution) │
│ │
│ 2c. Noise Removal │
│ ├── Signatures: regex for "Best regards", "--", phone patterns │
│ ├── Disclaimers: keyword match "confidential", "do not reply" │
│ ├── Reply headers: "On [date] wrote:", "From:", "Sent:" │
│ └── Transactional: discard <50 words or "thank you" sentiment │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 3: AI PROCESSING │
├─────────────────────────────────────────────────────────────────────────────┤
│ 3a. Vectorization (Low-cost, high-volume) │
│ ├── Model: text-embedding-3-small or all-MiniLM-L6-v2 │
│ └── Output: 384-1536 dim vectors per email thread │
│ │
│ 3b. Semantic Clustering │
│ ├── Algorithm: HDBSCAN (density-based, handles noise) │
│ └── Result: 1000 emails → ~50 distinct issue clusters │
│ │
│ 3c. LLM Extraction (High-cost, cluster centroids only) │
│ ├── Model: GPT-4o or Llama-3-70b │
│ ├── Input: Representative emails from each cluster │
│ └── Output: Structured JSON (see ITIL Schema below) │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 4: STORAGE │
├─────────────────────────────────────────────────────────────────────────────┤
│ Bronze Layer (Raw) │
│ └── Azure Blob / S3: Full JSON from Graph API (enables replay) │
│ │
│ Vector Store │
│ └── Pinecone / pgvector / FAISS: Embeddings + metadata │
│ │
│ Silver Layer (Structured) │
│ └── PostgreSQL: Extracted ITIL records for analytics │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 5: ANALYTICS │
├─────────────────────────────────────────────────────────────────────────────┤
│ Dashboards: Power BI / Grafana │
│ Reports: Automated weekly PDF via Azure Functions │
│ Alerts: Cluster drift detection → Teams/Slack notification │
└─────────────────────────────────────────────────────────────────────────────┘

ITIL Classification Schema

The LLM must output structured JSON conforming to this schema (enforced via Pydantic or JSON mode):

{
"thread_id": "AAMkAGI2...",
"timestamp": "2025-12-18T10:30:00Z",

"classification": {
"itil_type": "Incident | Problem | Service Request | Change Request | Noise",
"category": "Hardware | Software | Network | Identity | Security",
"subcategory": "VPN | Email | ERP | CRM | Endpoint | Other",
"severity": "Critical | High | Medium | Low",
"sentiment": "Frustrated | Neutral | Satisfied"
},

"extraction": {
"issue_summary": "One sentence description of the problem",
"affected_ci": "Application or system name (from approved CI list)",
"error_message": "Extracted error code or message if present",
"root_cause": "Identified cause or 'Unknown'",
"resolution_summary": "How it was resolved or 'Unresolved'",
"resolution_type": "Restart | Access Grant | Patch | Workaround | Escalation | User Training | Self-Healed"
},

"metadata": {
"thread_length": 5,
"time_to_resolution_hours": 2.5,
"participants": ["support@company.com", "<REDACTED>"]
}
}

Constrained Generation

To prevent hallucination of application names, provide the LLM with an enumerated list of valid Configuration Items:

VALID_CIS = [
"Microsoft Outlook", "Microsoft Teams", "SAP ERP", "Salesforce CRM",
"GlobalProtect VPN", "Okta SSO", "ServiceNow", "Jira", "Confluence",
"Active Directory", "Azure AD", "Office 365", "Other"
]

The LLM must select from this list or output "Other" with a free-text description.


Analytics Outputs

1. Frequent Issues Report

RankIssue ClusterVolumeSeverityTop Solution
1VPN Connection Failures234HighToken resync via self-service
2Password Reset Requests189LowAD self-service portal
3Outlook Freezing156MediumClear OST cache

2. Worst Offending Applications

Scoring: Score = Volume × Severity_Weight × (1 + Negative_Sentiment_Ratio)

ApplicationIncidentsAvg SeverityFrustration ScoreTrend
SAP ERP89High0.72↑ 23%
GlobalProtect VPN234Medium0.45↓ 12%
Microsoft Teams67Low0.31→ Stable

3. Avoidance Matrix

                    HIGH COMPLEXITY (Long threads, escalations)

PROBLEM MANAGEMENT │ ROOT CAUSE ANALYSIS
(Systemic fixes needed) │ (Deep investigation)

──────────────────────────┼──────────────────────────────

AUTOMATION CANDIDATE │ TRAINING / DOCUMENTATION
(Chatbot, self-service) │ (User education)

LOW COMPLEXITY (Quick resolution)

LOW FREQUENCY ────────────┼──────────────── HIGH FREQUENCY

4. Proactive Recommendations

LLM-generated from top clusters:

"Based on 234 VPN connection failures in the past 30 days, recommend:

  1. Implement automatic RSA token refresh before expiration
  2. Create self-service portal for token resync
  3. Add monitoring alert for certificate expiration"

Privacy & Security

ConcernMitigation
PII in emailsMicrosoft Presidio redaction BEFORE any AI processing
Credentials in textTruffleHog scanner for leaked passwords/API keys
Data residencySame-region Azure OpenAI (GDPR/HIPAA compliance)
Access controlRow-level security by email folder/department
RetentionAuto-delete raw data after 90 days per policy
LLM data usageZero Data Retention agreement with API provider

Alternatives Considered

Alternative 1: Pure LLM Approach (All emails through GPT-4)

Rejected: Cost-prohibitive at scale. Processing 10,000 emails/day at $0.01/email = $3,000/month. Hybrid approach reduces LLM calls by 90%+ via clustering.

Alternative 2: Pure Traditional NLP (No LLM)

Rejected: Cannot extract nuanced insights like root causes or generate avoidance recommendations. Keyword-based classification misses semantic similarity ("frozen" vs "hangs" vs "unresponsive").

Alternative 3: Fine-tuned Domain Model

Deferred: Requires 5,000+ labeled examples. Start with few-shot prompting; fine-tune later if accuracy <85% on validation set.

Alternative 4: Direct IMAP/SMTP Access

Rejected: Security concerns (storing credentials), no OAuth support, limited metadata. Graph API provides conversationId, threading, and enterprise-grade auth.


Implementation Phases

Phase 1: Foundation (Weeks 1-4)

  • Azure AD app registration with Graph API permissions
  • Delta query ingestion pipeline (Azure Functions)
  • PII redaction with Presidio
  • Bronze layer storage (Blob)

Phase 2: Intelligence (Weeks 5-8)

  • Thread reconstruction and noise filtering
  • Embedding generation pipeline
  • Vector store setup (pgvector or Pinecone)
  • HDBSCAN clustering job

Phase 3: Extraction (Weeks 9-12)

  • LLM extraction prompts with JSON schema
  • ITIL classification validation
  • Silver layer database schema
  • CI enumeration and constraint logic

Phase 4: Analytics (Weeks 13-16)

  • Power BI dashboard integration
  • Automated weekly reports
  • Cluster drift alerting
  • Avoidance recommendation generation

Risks and Mitigations

RiskLikelihoodImpactMitigation
Thread fragmentation (10 replies = 10 incidents)HighHighStrict conversationId aggregation
Signature hallucination (LLM reads signature as issue)MediumMediumAggressive regex truncation before LLM
Topic drift in reply chains (new issue in old thread)MediumMediumLLM prompt: "Is this semantically related?"
"Thank you" pollutionHighLowFilter messages <50 words + positive sentiment
Cost overrunsMediumHighCluster-first approach; LLM only for centroids
Compliance breach (PII to external LLM)LowCriticalPresidio redaction as FIRST step, not optional

Success Metrics

MetricTargetMeasurement
Issue classification accuracy>85% F1Manual validation of 200 samples
Pattern detection coverage>90% of issues in clustersNoise cluster <10%
Time to insight<24 hoursIngestion to dashboard latency
Cost per email<$0.001Total monthly cost / emails processed
Actionable recommendations5+ per weekAuto-generated avoidance strategies

Technology Stack

LayerTechnologyRationale
IngestionMS Graph API + Azure FunctionsNative O365 integration, serverless scale
OrchestrationApache AirflowComplex DAG support, monitoring
PII RedactionMicrosoft PresidioOpen source, runs locally
Embeddingstext-embedding-3-smallCost-effective, good quality
Vector Storepgvector (PostgreSQL)Single database, familiar SQL
ClusteringHDBSCAN (scikit-learn)Handles noise, variable density
LLMAzure OpenAI GPT-4oEnterprise compliance, JSON mode
Structured OutputPydantic + InstructorSchema enforcement
AnalyticsPower BIEnterprise standard, Graph integration
AlertingAzure Logic AppsLow-code, Teams/Slack integration

References


Council Review Summary

Reviewed by: GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Grok-4

Council Rankings:

  1. Claude Opus 4.5 (0.667) - Hybrid embeddings + LLM approach
  2. GPT-5.1 (0.500) - Feedback loop emphasis
  3. Gemini 3 Pro (0.333) - PII-first architecture
  4. Grok-4 (0.000) - Cloud-native scaling

Key Council Insights:

  • PII redaction must be architectural prerequisite, not afterthought
  • Vector clustering before LLM extraction reduces costs 90%+
  • Structured JSON output prevents ITIL taxonomy drift
  • Time-windowed aggregation distinguishes chronic vs acute issues