đ¤ AI Summary
Researchers ran a focused diagnostic to see how dense text embeddings handle proper names in retrieval. Using synthetic author names (to avoid memorization) and real arXiv topics, each query compared four candidatesâcorrect author+topic, wrong-author/same-topic, same-author/wrong-topic, and both-wrongâand measured three margins: Î_name (author signal), Î_topic (topic signal) and Î_both. Across ~6,000 runs in English and French with OpenAI text-embed-3L and Voyage 3.5, correct candidates sat closer than impostors; topic drove retrieval more than author (Î_topic > Î_name). Quantitatively, Î_name/Î_topic was ~0.53â0.59, so names carry roughly half the separation power of topical signals (example: text-embed-3L EN Î_name=0.175, Î_topic=0.305).
Ablations show why: the name signal mostly comes from surface form, tokenization and exact-match bias rather than deep âidentityâ understanding. Destroying identity (masking) collapses Î_name by â100%; replacing names with stable gibberish or small edit-distance corruptions cuts the name margin by ~65â80%. By contrast, mild orthographic/formatting changes (case, punctuation, diacritics, initials, name order) only shave a few percent off the signal, while layout/label shifts produce model- and language-specific effects. Implication: dense retrievers are surprisingly good on names in practice, but fragile to identity-destroying transformsâso hybrid retrieval, careful normalization, and attention to tokenization/layout are crucial when entity fidelity matters.
Loading comments...
login to comment
loading comments...
no comments yet