New project makes Wikipedia data more accessible to AI (techcrunch.com)

🤖 AI Summary
Wikimedia Deutschland launched the Wikidata Embedding Project, a new vector-based semantic search layer over nearly 120 million Wikidata/Wikipedia entries that makes Wikimedia’s structured knowledge directly queryable by modern AI. Built with Jina.AI and DataStax (IBM), the system converts Wikidata items into embeddings and adds support for the Model Context Protocol (MCP), enabling large language models and retrieval-augmented generation (RAG) systems to fetch semantically relevant facts, images and multilingual labels rather than relying on brittle keyword or SPARQL-only lookups. The change matters because it gives developers an open, editor-vetted knowledge source that can better ground LLM outputs and fine-tuning workflows — addressing a widespread need for high-quality, curated training and retrieval data. Technically, the project preserves Wikidata’s rich semantic structure (e.g., entity relationships, translations, related concepts and licensed images) so queries for terms like “scientist” return contextually grouped people, translations and concept links. By exposing this via MCP, the dataset becomes interoperable with tooling that expects contextual data endpoints, lowering friction for building more accurate, transparent retrieval pipelines without depending solely on noisy web crawls or proprietary datasets.
Loading comments...
loading comments...