multimodal sentence embeddings yay

April 14 2026 @ 10:01

sentence-transformers now maps text, images, audio, and video into a shared embedding space^[1]. Qwen3-VL and NVIDIA Nemotron supported out of the box, with retrieve-and-rerank pipelines.

NbAiLab has nb-sbert-v2^[2], a Norwegian sentence embedding model trained on 527k NLI pairs, and Borealis^[3], a Gemma-3-27b Norwegian instruction finetune. Gemma-3 is a vision-language model — its mmproj is untrained on Norwegian data.

mmproj

Borealis is also a text-only finetune.
the vision components from Gemma-3 base are untouched, so whether it will work properly on multilingual/VLM applications remains to be seen.

https://huggingface.co/blog/multimodal-sentence-transformers ↩︎
https://huggingface.co/NbAiLab/nb-sbert-v2-base ↩︎
https://huggingface.co/NbAiLab/borealis-27b-instruct-preview ↩︎

multimodal sentence embeddings yay

mmproj

Related posts