🤖 AI Summary
A simple but powerful trick: instead of materializing a huge random ±1 projection matrix for Johnson–Lindenstrauss (JL) style dimensionality reduction of bag-of-words/ngram vectors, use a good hash of each token to deterministically generate the random signs on the fly and push the multiplication into the streaming pass over the document. Because token counts are sums of individual occurrences, you can accumulate each output dimension by adding the signed contribution for each token occurrence (sign = mix(rotl(hash(token), row)) % 2 ? -1 : +1). That eliminates both the wide projection matrix and explicit high‑dimensional sparse vectors, letting you compute low‑dimensional embeddings in one pass per document. The post provides a small C implementation (mix32/mix64to32, rotl32, loop over ngram widths and rows, then L1 normalization) and reports embedding ~665k documents in 12.3s on a 2018 ThinkPad.
Why it matters: this is a memory- and compute-efficient realization of the hashing trick / signed feature hashing that preserves the JL-style randomness needed for low-distortion projections while operating streamingly and with tiny memory. Key technical points: hashes are used as pseudo-random generators for independent rows (bit-rotation + mixing to decorrelate), signs are produced by parity of mixed hashes, and normalization is applied after accumulation. Practical implications include fast, scalable featurization for text pipelines (low latency, low memory) but with the usual caveats about hash collisions and the statistical quality of the chosen mix/hash functions relative to formal JL guarantees.
Loading comments...
login to comment
loading comments...
no comments yet