LLMs are still surprisingly bad at some simple tasks (shkspr.mobi)

🤖 AI Summary
A quick test asked three commercial LLMs the same simple, deterministic question: which top-level domains (TLDs) share names with valid HTML5 elements? Rather than reliably cross-checking two lists, the models produced a mix of correct hits (e.g., .audio, .video, .link, .menu, .nav, .style, .select) alongside clear failures — omissions of valid matches, irrelevant outputs that listed HTML elements rather than matching TLDs, and outright hallucinations (the model claiming a .code TLD exists when only .codes does). One model even mixed correct subsets with invented caveats and historical trivia, creating a convincing but inaccurate answer. This matters because it highlights recurring LLM weaknesses on tasks that are trivial for rule-based code or a simple set-difference script: precise list comparison, up-to-date factualness, and avoiding confident hallucinations. For the AI/ML community it underscores the need for grounding and retrieval-augmentation, precise framing or tool use (e.g., querying authoritative registries or running programmatic checks), and stronger evaluation on deterministic tasks. The example also illustrates the Barnum-effect problem — fluent, plausible outputs that deceive non-experts — and reinforces why domain expertise and automated verification remain essential when using LLMs for factual or engineering work.
Loading comments...
loading comments...