The Illustrated Evo 2 (research.nvidia.com)

🤖 AI Summary
OpenGenome2 and Evo 2 were released: OpenGenome2 is an open-source, non-redundant nucleotide corpus totaling over 8.8 trillion bases from bacteria, archaea, eukarya and bacteriophages, and Evo 2 is a family of autoregressive genomic language models (1.1B, 6.5B, 40.3B parameters) trained on that data (1T, 2.4T, 9.3T tokens respectively). The work is notable as the largest genomic modeling effort to date and shows how a hybrid architecture can learn both short-range functional motifs and long-range regulatory dependencies in DNA without task-specific fine-tuning—enabling downstream genetic experiments and analyses directly from next-token likelihoods. Technically, Evo 2 uses the StripedHyena2 architecture: a residual stack that interleaves Hyena operators (SE, MR, LI variants) with multi‑head attention (rotary embeddings). Hyena mixes gated elementwise interactions and convolutions, using an MLP-generated implicit global filter and FFT-based convolution (FFTConv) to capture whole-sequence context at sub-quadratic cost, while elementwise gating provides input-dependent behavior akin to attention. The chosen SE‑MR‑LI‑MHA block layout yields the best pretraining perplexity (2.83) versus pure-MHA baselines (3.09). Models are byte-tokenized and trained with a modified cross-entropy that downweights repetitive DNA regions (weight 0.1) to focus learning on informative sequence content. The release highlights efficient long-range modeling alternatives to dense attention and provides a large public dataset and pretrained foundations for genomics research.
Loading comments...
loading comments...