AI trained on bacterial genomes produces never-before-seen proteins (arstechnica.com)

🤖 AI Summary
A Stanford team announced Evo, a “genomic language model” trained on an enormous collection of bacterial genomes that learns DNA-level organization the same way large language models learn text: by predicting the next base in a sequence. The model leverages a key bacterial feature—operons and gene clustering, where functionally related genes sit next to one another and are often co-transcribed—to learn patterns linking nucleic-acid context to downstream protein-coding outcomes. Evo is generative (a prompt can produce multiple novel sequences) and can output predicted protein sequences, including candidates that don’t resemble known proteins. This work is significant because it shifts protein prediction and design from amino-acid–centric models to models that learn directly from genomic context, capturing noncoding signals, redundancy, and regulatory organization that influence how proteins arise in vivo. Practically, that means new avenues for discovering truly novel proteins, rethinking synthetic biology design at the DNA level, and gaining evolutionary insight from genome structure. Key technical implications are that operon-driven co-localization provides a powerful training signal for next-base prediction, and that DNA-trained generative models can produce functional-feeling protein sequences. Limitations remain—Evo’s signal derives from bacterial genome architecture, so cross-domain generalization and experimental functional validation of generated sequences are the next crucial steps.
Loading comments...
loading comments...