The Geometry of Grammar: Interpretable Dimensions in Word Embeddings (rewire.it)

0 points 5 days ago ago | visit original

🤖 AI Summary

Researchers revisited the classic vector-analogy idea (king − man + woman ≈ queen) to ask whether grammar — specifically verb tense — also maps to consistent directions in embedding space. Using the pre-trained word2vec-google-news-300 model (gensim), simple analogy queries reliably recovered past or participle forms for both regular and irregular verbs (similarity scores ~0.61–0.83). PCA projections of five verbs and their conjugations revealed that base→past and base→-ing vectors are largely parallel across verbs (first two PCs capture ~44% variance), providing visual evidence of a “tense axis.” Complementary t-SNE plots showed lemma (lexical) clustering is dominant, with a subtle, consistent displacement within clusters corresponding to grammatical form. This matters because it shows that static embeddings can encode grammatical transformations as interpretable linear offsets, useful for interpretability, controlled generation, and lightweight probes. But important caveats remain: probing (logistic classifiers on frozen embeddings) shows tense is often linearly separable, yet linear separability is not the same as a single constant offset for all verbs; noise, corpus distribution, and irregularity limit precision. Moreover, contextual models like BERT produce token-specific vectors, so a single global “tense vector” is less applicable even though tense/aspect information is still recoverable via probing. In short: geometry of grammar is real in static spaces but nuanced — modern contextual representations require different tools and careful interpretation.

Loading comments...

loading comments...