Skip to main content

Automated Market Discovery and Matching

· 6 min read

2026-01-22 | ADR-017 Implementation

Implementation of automated market discovery between Polymarket and Kalshi while preserving human-in-the-loop safety.

The Problem

Manual market discovery doesn't scale:

  1. Discovery burden: Operators research markets on both platforms independently
  2. Missed opportunities: New markets go undetected
  3. No persistence: Mappings exist only in memory
  4. Scale limitation: Cannot monitor thousands of markets

Industry context: Research documented $40M+ in arbitrage profits from Polymarket alone (Apr 2024 - Apr 2025). Existing bots watch 10,000+ markets.

The Solution

Text similarity matching with semantic warnings and mandatory human approval.

Architecture

API Clients → Scanner (hourly) → Matcher → Candidates (SQLite) → Human Review → MappingManager

Matching Algorithm

  1. Pre-filter: Category, expiration (±7 days), outcome count
  2. Similarity: 0.6 × Jaccard(tokens) + 0.4 × Levenshtein_normalized
  3. Threshold: Score ≥ 0.6 creates candidate for review
  4. Warnings: Flag settlement differences (announcement vs actual event)

Safety Architecture (FR-MD-003)

The critical constraint: settlement semantics differ across platforms.

Example - 2024 Government Shutdown:

  • Polymarket: "OPM issues shutdown announcement"
  • Kalshi: "Actual shutdown exceeding 24 hours"

Same event, different resolution criteria, potentially different outcomes.

Safety Gates

pub fn approve(&self, id: Uuid, acknowledge_warnings: bool) -> Result<(), ApprovalError> {
let candidate = self.storage.get_candidate(id)?;

// Safety: Require warning acknowledgment
if !candidate.semantic_warnings.is_empty() && !acknowledge_warnings {
return Err(ApprovalError::WarningsNotAcknowledged);
}

// Use existing safety gate (FR-MD-003)
let mut manager = self.mapping_manager.lock().unwrap();
let mapping_id = manager.propose_mapping(/*...*/);
manager.verify_mapping(mapping_id);

// Audit log for compliance
self.storage.log_decision(/*...*/)?;
Ok(())
}

What This Guarantees

  1. Human-in-the-loop: Candidates require explicit approval
  2. FR-MD-003 enforced: Uses existing MappingManager.verify_mapping()
  3. Semantic warnings block quick approval: Must acknowledge settlement differences
  4. Audit trail: All approvals/rejections logged

Implementation Highlights

Feature Flag

Discovery is opt-in via Cargo feature:

[features]
discovery = ["dep:strsim"]

CLI Interface

# Discover and match markets
cargo run --features discovery -- --discover-markets

# Review candidates interactively
cargo run --features discovery -- --review-candidates

# List pending candidates
cargo run --features discovery -- --list-candidates --status pending

# Batch operations
cargo run --features discovery -- --approve-candidates --ids "uuid1,uuid2"
cargo run --features discovery -- --reject-candidates --ids "uuid1" --reason "Settlement differs"

Similarity Scorer

pub struct SimilarityScorer {
jaccard_weight: f64, // 0.6
levenshtein_weight: f64, // 0.4
threshold: f64, // 0.6
}

impl SimilarityScorer {
pub fn find_matches(&self, market: &DiscoveredMarket, candidates: &[DiscoveredMarket])
-> Vec<CandidateMatch>
{
candidates.iter()
.filter(|c| self.pre_filter(market, c))
.filter_map(|c| {
let score = self.combined_score(&market.title, &c.title);
if score >= self.threshold {
Some(CandidateMatch::new(market.clone(), c.clone(), score))
} else {
None
}
})
.collect()
}
}

Scanner Actor

impl Actor for DiscoveryScannerActor {
type Message = ScannerMsg;

async fn handle(&mut self, message: Self::Message) -> Result<(), ActorError> {
match message {
ScannerMsg::Scan => {
let poly_markets = self.fetch_all_markets(&*self.polymarket_client).await?;
let kalshi_markets = self.fetch_all_markets(&*self.kalshi_client).await?;

// Store markets
for market in &poly_markets {
self.storage.lock().await.upsert_market(market)?;
}

// Find candidates
for poly_market in &poly_markets {
let matches = self.scorer.find_matches(poly_market, &kalshi_markets);
for candidate in matches {
if !self.is_duplicate_candidate(&candidate).await? {
self.storage.lock().await.insert_candidate(&candidate)?;
}
}
}
}
// ...
}
}
}

Test Coverage

48 tests across 5 phases:

ModuleTests
candidate.rs5
storage.rs7
normalizer.rs3
matcher.rs7
polymarket_gamma.rs4
kalshi_markets.rs4
scanner.rs5
approval.rs5
CLI integration8

Council Review

All 5 phases passed LLM Council review with confidence >= 0.87.

Final ADR Review:

  • Verdict: PASS
  • Confidence: 0.88
  • Weighted Score: 8.55/10

Safety gates (FR-MD-003) received "PASS (Strong)" verdict.

Why Text Similarity Over LLM/Embeddings

Options considered:

ApproachAccuracyCostLatency
Text similarityModerateZeroSub-ms
LLM verificationHigh$0.01-0.05/call+200-500ms
EmbeddingsHighestStorage + computeBatch dependent

Text similarity was selected because:

  1. Sufficient for MVP: Catches majority of matches
  2. Zero dependencies: No external API costs
  3. Extensible: LLM verification can be added later
  4. Council compliant: "Suggestion engine only" per Design Review 1

Update: Post-Implementation Learnings (2026-01-23)

Post-implementation testing revealed a critical gap: text similarity is insufficient for production.

The Problem

Real market pairs score only 8-9% similarity despite semantic equivalence:

KalshiPolymarketJaccard
"Will Trump buy Greenland?""Will the US acquire part of Greenland in 2026?"8.3%
"Will Washington win the 2026 Pro Football Championship?""Super Bowl Champion 2026"9.1%

Root causes:

  • Different vocabulary: "Super Bowl" vs "Pro Football Championship"
  • Different framing: Question vs statement
  • Different specificity: Team name vs championship event

The Solution: 5-Phase Approach

We've extended ADR-017 with a progressive enhancement roadmap:

Phase 1: Text Similarity     ← Current (MVP, 8-9% accuracy on hard pairs)
Phase 2: Fingerprint Matching ← Proposed (entity extraction, field-weighted scoring)
Phase 3: Embedding Matching ← Proposed (semantic similarity via vectors)
Phase 4: LLM Verification ← Proposed (human-level reasoning for uncertain cases)
Phase 5: Human Feedback Loop ← Proposed (continuous improvement from decisions)

Phase 3: Embedding-Based Semantic Matching

Embeddings capture semantic similarity that text matching misses:

# "Super Bowl" and "Pro Football Championship" have zero word overlap
# but high embedding similarity
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

emb1 = model.encode("Super Bowl Champion 2026")
emb2 = model.encode("2026 Pro Football Championship winner")
similarity = cosine_similarity(emb1, emb2) # ~0.85

New requirements: FR-MD-018 through FR-MD-023

Phase 4: LLM Verification

For uncertain matches (0.60-0.85 score), invoke LLM for human-level reasoning:

Candidate pair for verification:

Market A (Kalshi): "Will the US acquire part of Greenland in 2026?"
Market B (Polymarket): "Will Trump buy Greenland?"

Analyze: Are these the same underlying event?
Consider: Resolution criteria, timing, specificity

Cost optimization: Haiku screening ($0.001/call), Sonnet escalation ($0.01/call) Budget: ~$50/day for 5,000 candidates

New requirements: FR-MD-024 through FR-MD-027

Phase 5: Learning from Human Feedback (Data Flywheel)

The key innovation: human approval decisions are training data.

┌─────────────────────────────────────────────────────────────┐
│ Data Flywheel: Human Decisions Train Models │
├─────────────────────────────────────────────────────────────┤
│ Human Approval ──► Entity Alias Learning │
│ ("Super Bowl" = "Pro Football Championship")
│ │
│ Human Approval ──► Embedding Fine-Tuning │
│ (contrastive learning on approved pairs) │
│ │
│ Human Approval ──► Weight Optimization │
│ (logistic regression on decision history)│
└─────────────────────────────────────────────────────────────┘

Weekly improvement cycle:

  • Monday: Export new decisions, update golden set
  • Tuesday: Retrain embedding model, optimize weights
  • Wednesday: Validate on golden set
  • Thursday-Saturday: A/B test (10% traffic)
  • Sunday: Promote if improved, rollback if degraded

New requirements: FR-MD-028 through FR-MD-032

Council Review

The Phase 3-5 extension passed council review:

DimensionScore
Accuracy8.5
Completeness9.0
Clarity8.5
Conciseness7.5
Relevance9.0

Verdict: PASS (confidence 0.87, weighted score 8.5)

What This Means

The safety architecture remains unchanged: human-in-the-loop is mandatory (FR-MD-003). But now each human decision improves future matching, creating a virtuous cycle where accuracy improves over time with minimal additional effort.

References