voyage-multimodal-3.5: a new multimodal retrieval frontier with video support (blog.voyageai.com)

0 points 14 days ago ago | visit original

🤖 AI Summary

The announcement of voyage-multimodal-3.5 marks a significant advancement in multimodal embedding models, now capable of integrating text, images, and video for enhanced retrieval tasks. Building upon its predecessor, voyage-multimodal-3, this model introduces explicit support for video frames and boasts higher retrieval accuracy—4.56% over Cohere Embed v4 and 4.65% over Google Multimodal Embedding 001—across various datasets. This model adopts a unified transformer architecture that minimizes the modality gap often seen in CLIP-based models, allowing for more precise semantic matching between different content types. Technically, voyage-multimodal-3.5 processes all inputs through a single encoder, embedding text, images, and now videos into a shared vector space, enhancing the relevancy of retrieved content. The model supports Matryoshka embeddings, offering flexible dimensionality options, while also providing best practices for segmenting longer videos to optimize embedding quality. With its competitive performance against top models in both multimodal and standard text retrieval tasks, voyage-multimodal-3.5 positions itself as a cutting-edge tool for companies looking to implement advanced search and retrieval capabilities across diverse content types.

Loading comments...

loading comments...