multimodal sentence embeddings yay
April 14 2026 @ 10:01
sentence-transformers now maps text, images, audio, and video into a shared embedding space[1]. Qwen3-VL and NVIDIA Nemotron supported out of the box, with retrieve-and-rerank pipelines.
NbAiLab has nb-sbert-v2[2], a Norwegian sentence embedding model trained on 527k NLI pairs, and Borealis[3], a Gemma-3-27b Norwegian instruction finetune. Gemma-3 is a vision-language model — its mmproj is untrained on Norwegian data.
mmproj
Borealis is also a text-only finetune.
the vision components from Gemma-3 base are untouched, so whether it will work properly on multilingual/VLM applications remains to be seen.