LLMs are getting better at character-level text manipulation (blog.burkert.me)

🤖 AI Summary
Researchers testing recent LLM generations (notably GPT-4.1/5 series and Anthropic’s Sonnet family) found a clear generational uplift in character-level skills: models that previously botched simple character swaps, counting and ciphers now perform many of those tasks reliably. In experiments replacing letters in a sentence, GPT‑4.1 and above consistently got correct answers where earlier models failed; counting characters remained hit-or-miss but improved with GPT‑4.1 and GPT‑5 (especially when low-level reasoning was enabled). Crucially, several state‑of‑the‑art models can now decode Base64 even when the decoded payload is non‑natural text (ROT20-encrypted gibberish), indicating algorithmic generalization rather than simple memorization of common English patterns. Key technical takeaways: tokenizers still bias models away from pure character-level operations (tokens cluster characters/words), yet base models are increasingly able to manipulate and identify individual characters, substitute letters, and apply decoding pipelines. Specifics from the tests: GPT‑5 and GPT‑5‑mini passed both Base64 and ROT20 decoding; some models required reasoning prompts; others refused on safety grounds (Claude Sonnet 4.5, Grok 4) or produced very long internal reasoning traces (Chinese reasoning models consumed thousands of tokens). Implications include stronger built‑in algorithmic capabilities for text preprocessing, cipher/encoding work, and robustness to out‑of‑distribution encodings—but also new safety and tool‑use tradeoffs, since refusal behaviors and stochastic decoding still limit universal reliability.
Loading comments...
loading comments...