Show HN: DINOtool – ViT feature extraction for images and videos (github.com)

0 points 7 hours ago ago | visit original

🤖 AI Summary

DINOtool is a lightweight command-line and Python toolkit for extracting ViT visual features from images, videos or folders using modern foundation models (DINOv2, DINOv3, CLIP, SigLIP2, AM-RADIO). It exposes a unified API and CLI (pip install dinotool) to produce either global (frame-level) embeddings or local patch-level feature maps, and can optionally visualize patch features with PCA. That makes it useful for building vector databases, temporal/visual retrieval, patch-level analyses (attention map reconstruction, segmentation), clustering, and downstream research prototypes without custom model plumbing. Key technical details: models are selectable via shortcuts (default dinov2_vits14_reg) and DINOv3 gated models require HF authorization; OpenCLIP/timm models (e.g., SigLIP2) are supported too. Feature export modes—full (.nc / .zarr keeping spatial layout), flat (partitioned .parquet of patches), and frame (.parquet of global vectors)—tradeoff fidelity vs. scalability. Example shapes: local_features tensor e.g. [1,56,56,384], global [1,384]. Batch processing is supported (use --input-size W H for folders/HD video) and GPU performance depends on your torch/FFmpeg setup (ffmpeg required; Windows GPU needs WSL2 or manual GPU torch). Outputs are ready for analytics (Parquet, Zarr, NetCDF) and include demos for masked PCA and reading outputs, making DINOtool a practical bridge from foundation ViTs to applied visual ML workflows.

Loading comments...

loading comments...