🤖 AI Summary
Google Security Research has released Magika 1.0, the first stable version of its open‑source AI file-type detector, rebuilt from the ground up in Rust and expanding support from ~100 to over 200 content types. The release includes a native Rust CLI (also bundled in the Python package), refreshed Python and TypeScript modules, and improved accuracy for tricky text-based formats like code and config files. Notable new detections cover Data Science/ML assets (ipynb, npy/npz, pytorch, onnx, parquet, h5), modern languages and web formats (TypeScript, Kotlin, Swift, wasm), DevOps manifests (dockerfile, toml, hcl, bazel), and binary formats (sqlite, psd, dwg). Magika also adds finer-grained distinctions (JSONL vs JSON, TSV vs CSV, C++ vs C), making file classification much more actionable for pipelines.
Technically, Magika’s training dataset grew to over 3 TB uncompressed, so the team used SedPack to stream and decompress data in-memory to avoid I/O bottlenecks. To solve data scarcity for rare or legacy formats they generated high-quality synthetic examples with Gemini and applied advanced augmentation. The Rust core leverages ONNX Runtime for inference and Tokio for async parallelism, enabling hundreds of files/sec on a single core and ~1,000 files/sec on an M4 MacBook Pro, scaling to thousands on multi-core machines. For AI/ML teams this means faster, safer preprocessing, better dataset labeling and deduplication, and more reliable ingestion of specialized model artifacts and configs.
Loading comments...
login to comment
loading comments...
no comments yet