How Proper Names Behave in Text Embedding Space (vectors.run)

🤖 AI Summary
Researchers ran a focused diagnostic to see how dense text embeddings handle proper names in retrieval. Using synthetic author names (to avoid memorization) and real arXiv topics, each query compared four candidates—correct author+topic, wrong-author/same-topic, same-author/wrong-topic, and both-wrong—and measured three margins: Δ_name (author signal), Δ_topic (topic signal) and Δ_both. Across ~6,000 runs in English and French with OpenAI text-embed-3L and Voyage 3.5, correct candidates sat closer than impostors; topic drove retrieval more than author (Δ_topic > Δ_name). Quantitatively, Δ_name/Δ_topic was ~0.53–0.59, so names carry roughly half the separation power of topical signals (example: text-embed-3L EN Δ_name=0.175, Δ_topic=0.305). Ablations show why: the name signal mostly comes from surface form, tokenization and exact-match bias rather than deep “identity” understanding. Destroying identity (masking) collapses Δ_name by −100%; replacing names with stable gibberish or small edit-distance corruptions cuts the name margin by ~65–80%. By contrast, mild orthographic/formatting changes (case, punctuation, diacritics, initials, name order) only shave a few percent off the signal, while layout/label shifts produce model- and language-specific effects. Implication: dense retrievers are surprisingly good on names in practice, but fragile to identity-destroying transforms—so hybrid retrieval, careful normalization, and attention to tokenization/layout are crucial when entity fidelity matters.
Loading comments...
loading comments...