Creating Knowledge Graphs from NLP Datasets (graphtechnologydevelopers.github.io)

0 points 7 hours ago ago | visit original

🤖 AI Summary

A new open-source project, English Lexicon Time Machine, reconstructs the growth of English vocabulary from 1800–2019 by combining Wiktionary lemmas with Google Books 1-gram counts and rendering a radial “prefix trie” timelapse. Each node is a letter-prefix (up to six letters) sized by cumulative word-counts; first-use years are inferred from n‑gram frequencies, accumulated by prefix, and visualized as 220 frames encoded to a 1080p MP4/GIF. The repo offers a zero‑config ./setup.sh that spins up a Python venv, streams 26 Google 1‑gram shards, ingests a 1.4GB Wiktionary dump, checkpoints every artifact (lemmas, first-year inferences, prefix counts, layouts) for reproducibility, and supports render-only passes and tunable visualization flags (--min-radius, --base-edge-alpha, etc.). For the AI/ML and NLP community this is valuable as both a demonstrator and a dataset pipeline: it codifies lemma extraction, first-year inference, and prefix aggregation into repeatable scripts (src/ingest/*.py, src/build/*.py, src/viz/*.py), and exposes artifacts for downstream analysis or knowledge-graph construction. The project includes a ready Neo4j import (artifacts/years/first_years.tsv + example Cypher) so researchers can load temporal lexicon nodes and run diachronic or embedding‑based studies, augment language models with time-aware vocabularies, or build provenance-aware KGs from public NLP datasets.

Loading comments...

loading comments...