Show HN: Hikugen – minimalistic LLM-generated web scrapers for structured data (github.com)

0 points 11 hours ago ago | visit original

🤖 AI Summary

Hikugen is a minimalist Python library that uses LLMs (via OpenRouter) to generate site-specific web scrapers that return Pydantic-compliant objects. You supply a URL (or raw HTML) and a Pydantic schema describing the structured output, and Hikugen prompts an LLM to emit extraction code, then runs an AST-based safety check (whitelisted imports), executes the code with timeout protection, validates scraped output against your schema, and caches the generated extractor for reuse. Example usage shows defining nested Pydantic models (Article, ArticlePage), creating a HikuExtractor with an OpenRouter key (default model: google/gemini-2.5-flash), and calling extract or extract_from_html; cache management methods (clear_cache_for_key, clear_all_cache) let you reset when page structure changes. For the AI/ML community this accelerates building structured datasets and ingestion pipelines by removing hand-written parsers—developers can focus on schema design while the model handles selectors and transforms. Key technical implications include faster prototyping, schema-driven validation, and extractor reuse via caching, but also operational considerations: cost and latency of LLM calls, brittleness when page layouts shift, and legal/ethical scraping constraints. The AST-based validation and timeouts mitigate some security/execution risks, but teams should still monitor extractor correctness and respect sites’ terms of service.

Loading comments...

loading comments...