LightlyStudio – an open-source multimodal data curation and labeling tool (github.com)

🤖 AI Summary
LightlyStudio is an open‑source, cross‑platform tool (pip install lightly‑studio) for multimodal data curation, annotation and dataset management, released as preview v0.4.0 on 2025‑10‑21. Built with Rust for performance and a Python 3.8+ interface, it runs on Windows, Linux and macOS (Lightly highlights it can index COCO/ImageNet subsets on an M1 MacBook Pro with 16GB RAM). The UI is started via ls.start_gui() and datasets are created and populated programmatically (ls.Dataset.create(); add_samples_from_path/add_samples_from_yolo/add_samples_from_coco/add_samples_from_coco_caption). The SDK persists to a local .db, supports cloud sources (s3/gcs), and can export query results to COCO formats. Significance: LightlyStudio aims to unify dataset engineering, inspection and labeling workflows so researchers and annotation teams can iterate faster and cheaper. It offers a powerful query language (AND/OR/NOT, SampleField, OrderByField, slicing) for filtering/sorting, per‑sample metadata/tags, and export utilities—making dataset slicing, audit and reproducibility straightforward. A premium automated selection feature uses computed typicality metadata and embedding‑based diversity strategies to pick representative + novel samples (multi_strategies API), which can reduce labeling cost and improve training data quality. The project is open for contributions, providing an extensible platform for integrating selection, embeddings and annotation pipelines into ML workflows.
Loading comments...
loading comments...