multimodal sentence embeddings yay

April 14 2026 @ 10:01

sentence-transformers now maps text, images, audio, and video into a shared embedding space[1]. Qwen3-VL and NVIDIA Nemotron supported out of the box, with retrieve-and-rerank pipelines.

NbAiLab has nb-sbert-v2[2], a Norwegian sentence embedding model trained on 527k NLI pairs, and Borealis[3], a Gemma-3-27b Norwegian instruction finetune. Gemma-3 is a vision-language model — its mmproj is untrained on Norwegian data.

mmproj

Borealis is also a text-only finetune.
the vision components from Gemma-3 base are untouched, so whether it will work properly on multilingual/VLM applications remains to be seen.


  1. https://huggingface.co/blog/multimodal-sentence-transformers ↩︎

  2. https://huggingface.co/NbAiLab/nb-sbert-v2-base ↩︎

  3. https://huggingface.co/NbAiLab/borealis-27b-instruct-preview ↩︎