word2vec-style vector arithmetic on docs embeddings (technicalwriting.dev)

🤖 AI Summary
Researchers tested whether word2vec-style vector arithmetic (doc_vector - word_vector + word_vector) works on full-document embeddings using Google’s EmbeddingGemma (google/embeddinggemma-300m via SentenceTransformer). They encoded a Supabase doc ("Testing Your Database"), subtracted or added single-word embeddings (e.g., "supabase" → "angular", or "testing" → "vectors"), and measured cosine similarity against a small corpus of short docs (Angular, Supabase, Playwright, etc.). Results show the method can produce meaningful semantic shifts: in the "same topic, different domain" test (doc - "supabase" + "angular") the top match became Angular’s "Testing" when custom task types ("Retrieval-query") were used (cosine ≈ 0.751), whereas default task settings left the original Supabase doc as closest (≈ 0.659). In the "different topic, same domain" test (doc - "testing" + "vectors") the resultant vector matched Supabase’s "Vector Columns" consistently (≈ 0.64–0.67) under both settings. Significance: this demonstrates that vector algebra extends beyond single-token word2vec to document-level embeddings, enabling semantic query reformulation, cross-domain analogies, and targeted retrieval in documentation/workflow tooling. Key technical takeaways: task-type/prompting affects outcomes (prompt_name="Retrieval-query" changed top matches), verification was indirect via cosine similarity against candidate docs, and token limits (EmbeddingGemma ~2048 tokens) necessitate short docs or chunking. The mechanism remains conceptually opaque, but practical implications include enhanced retrieval, query augmentation, and semantic filtering for docs and developer tooling—while requiring care around prompt/task settings and input length.
Loading comments...
loading comments...