🤖 AI Summary
A follow-up experiment shows that a patch-transformer can generate recognizable cat images by composing each 8x8 RGB patch from a learned lookup table (LUT) of small patch primitives. The model is a transformer with 16 stacked self-attention blocks operating on 64 tokens (one per 8x8 patch). For each patch the network outputs logits over a dictionary (e.g., 512 entries), and the final patch is a softmax-weighted sum of those 8x8 learned RGB patterns. Training uses a denoising-style objective (lerp to noise and predict the original) and iterative inference starting from Gaussian noise. Despite the limited expressivity one might expect, with 512 patterns the model reliably produces coherent cat images; the learned entries behave like an arbitrary basis rather than interpretable motifs.
Technically notable points and implications: 8x8 RGB patches are 192-dimensional, so 512 dictionary entries give ample basis coverage, explaining unexpected performance. The author plans experiments with smaller dictionaries (e.g., 64), Gram-matrix off-diagonal penalties to encourage orthogonality, and replacing softmax with unnormalized tanh weights to allow non-convex combinations. They also explore dynamic LUT generation by having the network output factorized vectors used as RGB outer products (increasing on-the-fly diversity), plus static learnable tokens to capture global textures. Results are promising visually (no FID measured), suggesting patch-wise learned dictionaries plus transformers are a viable compact generative route and open trade-offs between capacity, interpretability, and computational/memory efficiency.
Loading comments...
login to comment
loading comments...
no comments yet