Announcing Magika 1.0: now faster, smarter, and rebuilt in Rust (opensource.googleblog.com)

0 points 6 hours ago ago | visit original

🤖 AI Summary

Google Security Research has released Magika 1.0, the first stable version of its open‑source AI file-type detector, rebuilt from the ground up in Rust and expanding support from ~100 to over 200 content types. The release includes a native Rust CLI (also bundled in the Python package), refreshed Python and TypeScript modules, and improved accuracy for tricky text-based formats like code and config files. Notable new detections cover Data Science/ML assets (ipynb, npy/npz, pytorch, onnx, parquet, h5), modern languages and web formats (TypeScript, Kotlin, Swift, wasm), DevOps manifests (dockerfile, toml, hcl, bazel), and binary formats (sqlite, psd, dwg). Magika also adds finer-grained distinctions (JSONL vs JSON, TSV vs CSV, C++ vs C), making file classification much more actionable for pipelines. Technically, Magika’s training dataset grew to over 3 TB uncompressed, so the team used SedPack to stream and decompress data in-memory to avoid I/O bottlenecks. To solve data scarcity for rare or legacy formats they generated high-quality synthetic examples with Gemini and applied advanced augmentation. The Rust core leverages ONNX Runtime for inference and Tokio for async parallelism, enabling hundreds of files/sec on a single core and ~1,000 files/sec on an M4 MacBook Pro, scaling to thousands on multi-core machines. For AI/ML teams this means faster, safer preprocessing, better dataset labeling and deduplication, and more reliable ingestion of specialized model artifacts and configs.

Loading comments...

loading comments...