Market Discovery Phase 2: Text Matching Engine
This post covers Phase 2 of ADR-017 - implementing the text similarity matching engine that powers automated market discovery between Polymarket and Kalshi.
The Problem
Phase 1 established the data types and storage layer. Now we need to actually find matching markets across platforms. The challenge:
- Fuzzy matching - Market titles differ in phrasing ("Will Trump win?" vs "Trump wins 2024?")
- False positives - Similar titles may have different settlement criteria
- Scalability - Must compare thousands of markets efficiently
Algorithm Design
Combined Similarity Scoring
We use a weighted combination of two complementary algorithms:
score = 0.6 × Jaccard + 0.4 × Levenshtein
Jaccard similarity (0.6 weight) measures token set overlap:
let intersection = set_a.intersection(&set_b).count();
let union = set_a.union(&set_b).count();
jaccard = intersection / union
This captures semantic similarity when words are reordered.
Levenshtein similarity (0.4 weight) measures edit distance:
let distance = levenshtein(&norm_a, &norm_b);
levenshtein_sim = 1.0 - (distance / max_length)
This catches typos and minor variations.
Text Normalization
Before comparison, titles are normalized:
impl TextNormalizer {
pub fn normalize(&self, text: &str) -> String {
// 1. Lowercase
// 2. Replace punctuation with spaces
// 3. Collapse whitespace
}
pub fn tokenize(&self, text: &str) -> Vec<String> {
// 4. Split into words
// 5. Filter stop words (a, an, the, will, be, ...)
}
}
Example: "Will Bitcoin reach $100k?" → ["bitcoin", "reach", "100k"]
Pre-Filtering
Before scoring, candidates are filtered to reduce false positives:
| Filter | Default | Purpose |
|---|---|---|
| Expiration tolerance | ±7 days | Markets must settle around same time |
| Outcome count | Must match | Binary vs multi-outcome |
| Category match | Optional | Same topic area |
Semantic Warning Detection (FR-MD-008)
Even similar titles may have different settlement criteria. We detect and flag:
Conditional language mismatches:
Polymarket: "Will Fed announce rate cut?"
Kalshi: "Will Fed cut rates?"
⚠️ Warning: Settlement trigger mismatch - one market references 'announce'
Resolution source differences:
Polymarket resolution: "Associated Press"
Kalshi resolution: "Official FEC results"
⚠️ Warning: Resolution source differs
Expiration differences:
⚠️ Warning: Expiration differs by 3 day(s)
These warnings flow to the human reviewer (FR-MD-003) for acknowledgment before approval.
Implementation
SimilarityScorer
pub struct SimilarityScorer {
jaccard_weight: f64, // 0.6
levenshtein_weight: f64, // 0.4
threshold: f64, // 0.6
normalizer: TextNormalizer,
pre_filter: PreFilterConfig,
}
impl SimilarityScorer {
pub fn find_matches(
&self,
market: &DiscoveredMarket,
candidates: &[DiscoveredMarket],
) -> Vec<CandidateMatch> {
candidates.iter()
.filter(|c| c.platform != market.platform) // Cross-platform only
.filter(|c| self.passes_pre_filter(market, c))
.filter_map(|c| {
let score = self.score(&market.title, &c.title);
if score >= self.threshold {
let warnings = self.detect_warnings(market, c);
Some(CandidateMatch::new(/*...*/).with_warnings(warnings))
} else {
None
}
})
.collect()
}
}
Match Reason Classification
let match_reason = if score >= 0.95 {
MatchReason::ExactTitle
} else {
MatchReason::HighTextSimilarity { score: (score * 100.0) as u32 }
};
Test Coverage
Phase 2 adds 10 tests (22 total for discovery module):
| Module | Tests | Focus |
|---|---|---|
normalizer.rs | 3 | Lowercase, punctuation, tokenization |
matcher.rs | 7 | Jaccard, Levenshtein, combined score, filtering, warnings |
Key test:
#[test]
fn test_semantic_warning_announcement() {
let scorer = SimilarityScorer::default();
let poly = create_market(Platform::Polymarket, "Will Fed announce rate cut?");
let kalshi = create_market(Platform::Kalshi, "Will Fed cut rates?");
let warnings = scorer.detect_warnings(&poly, &kalshi);
assert!(warnings.iter().any(|w| w.contains("announce")));
}
What's Next
Phase 3 will implement the API clients:
- Polymarket Gamma API client (FR-MD-006)
- Kalshi /v2/markets API client (FR-MD-007)
- Rate limiting and pagination
Council Review
Phase 2 passed council verification with confidence 0.85. Key findings:
- No unsafe code
- Human-in-the-loop preserved (
find_matchesreturns candidates, not verified mappings) - Semantic warnings properly flag settlement differences
- All tests passing (22 total)
Implementation: arbiter-engine/src/discovery/{normalizer,matcher}.rs | Issue: #43 | ADR: 017
