Introduction to radix (best cognate-tree grower, pre-α, dormant) (tsvibt.blogspot.com)

0 points 3 hours ago ago | visit original

🤖 AI Summary

radix is a pre‑alpha, currently dormant tool for automatically growing “cognate forests” from Wiktionary: it crawls etymology links, uses abstract regexes to parse Wiktionary-ese, and builds graph structures showing ancestors and descendants of words. The author claims it’s already the strongest cognate-forest grower available but admits many errors and messy outputs (especially around Middle English and ambiguous Proto‑Indo‑European reconstructions). Key implementation ideas include representing etymologies as preorder-like graphs, precomputing a global transitive closure once, aggressively pruning and merging redundant subtrees for display, and storing language-specific restricted copies of a root’s descendants so each UI view only materializes what’s needed. For the AI/ML community radix is an instructive case study in extracting structured knowledge from noisy, semi-structured sources. The project highlights concrete engineering patterns—pattern-based extraction, sense disambiguation challenges, “shattering” global graphs into per-language subuniverses for efficient serving, and sandboxed end-to-end testing—that are directly applicable to information extraction, graph inference, and knowledge‑base construction. While the codebase is frozen and results need substantial quality work (better link inference, disambiguation, bug fixes, and updated parsing), the design and failure modes offer practical lessons for anyone building systems that must infer uncertain links and serve large, ambiguous graph structures.

Loading comments...

loading comments...