preliminærsider, avhandling.
sammendrag
Moderne språkmodeller kan effektivt prosessere ustrukturert og støyfylt data, slik man ofte finner i samtalebasert tekst.
Denne egenskapen har blitt muliggjort gjennom forhåndstrening på store og varierte datakilder, som lærer modellene å forstå språk til tross for grammatisk variasjon og uregelmessigheter.
Transformer-arkitekturen har vært sentral i denne utviklingen. Den brukes i alt fra oppgavespesifikke enkodermodeller for språkforståelse og klassifisering, til enkoder-dekoder-modeller for oversettelse og oppsummering, og dekoder-modeller for fleksibel tekstgenerering.
Dekoder-modeller har også drevet utviklingen av store språkmodeller: generative modeller med milliarder av parametere som kan utføre oppgaver etter instruksjoner på naturlig språk.
Ytelsen til slike modeller forbedres generelt ved oppskalering av parametere og datamengder.
De mest avanserte modellene er derimot så store at de krever skybaserte tjenester eller kostbar maskinvare. Dette hindrer bruk i applikasjoner der sensitive data, lav responstid og forutsigbare resultater er viktige aspekter.
I nyere tid har det kommet store forbedringer når det gjelder avveiningen mellom kapabilitet og størrelse. Bedre og mindre modeller utvikles stadig.
Kombinert med metoder som reduserer minnebruken til modellene, har avanserte anvendelser av språkteknologi blitt tilgjengeliggjort mer enn noen gang.
Denne avhandlingen utforsker effektiv bruk av språkteknologi for norsk og lavressursspråk, motivert av praktiske applikasjoner og gjennom samarbeid med bedriftspartnere.
Bruksområdene inkluderer komprimering og oversettelse av undertekster, emneklassifisering av kundeservicesamtaler, og dokumentanalyse i kriminaletterforskning.
Forskningsarbeidet er kategorisert under metadatagenerering og oppsummering.
Metadatagenerering omhandler oppgavespesifikke annoteringer fra koreferansebestemmelse og entitetsgjenkjenning, latente data fra vektorrepresentasjoner, og kontekstbasert metadata styrt av brukerspørringer.
Oppsummering brukes både til reduksjon av lengre tekster i et system for informasjonsberiket generering (Retrieval-Augmented Generation), og i kortere tekster ved setningskomprimering for undertekster med lengdebegrensninger.
Forskningen omhandler også evaluering og diskuterer problematikk som oppstår ved automatisk evaluering av generert tekst.
Arbeid med store språkmodeller, i kontekst av kriminaletterforskning, er grundig evaluert av mennesker.
En utviklet ytelsestest på norske kundeservicedata evaluerer generaliseringsevnen til store språkmodeller gjennom direkte (zero-shot) instrukser.
Resultatene viser at mindre modeller kan oppnå god ytelse på et bredt spekter av oppgaver innen ustrukturert og samtalebasert tekst.
Med et gjennomgående fokus på ressursbegrensninger understøtter avhandlingen fremtidige anvendelser av lavressurs språkteknologi.
abstract
Modern language models can effectively process unstructured and noisy data, features that characterize conversational text.
This property has been enabled through pre-training on large and diverse corpora, allowing models to understand language despite grammatical variation and imperfections.
The Transformer architecture has been central to this development, ranging from task-specific encoders for language understanding and classification to encoder-decoder models for translation and summarization, and decoder models for flexible text generation.
Decoder models have powered the development of Large Language Models (LLMs), generative models with billions of parameters that can perform tasks via instructions in natural language.
While performance generally improves by scaling up parameters and data, the largest state-of-the-art models require either cloud-based services or expensive hardware, hindering adoption in applications concerned with data sensitivity, low latency, and predictable outcomes.
Recently, there have been great improvements in the trade-off between capability and size, with powerful models frequently released at sensible sizes.
Combined with methods that reduce the memory footprint of these models, advanced applications of language technology have been made highly accessible.
This thesis explores efficient use of language technology for Norwegian and low-resource languages, motivated by practical applications and through collaboration with industry partners.
Applications include subtitle compression and translation, topic classification of customer service calls, and document analysis for criminal investigations.
The research is categorized under metadata generation and summarization.
Metadata generation includes task-specific annotations from Coreference Resolution and Named Entity Recognition, latent representations through embedding systems, and context-based metadata guided by user queries.
Summarization involves both long-form text reduction in a Retrieval-Augmented Generation system and sentence compression for subtitles.
Another aspect of the research is evaluation and the limitations when evaluating generated text with automatic metrics.
Work involving LLMs, in the context of criminal investigations, is thoroughly evaluated by humans.
A developed benchmark on Norwegian customer service data assesses the generalizability of LLMs in zero-shot settings.
Results show that smaller models can achieve promising performance across a wide range of tasks, efficiently processing unstructured and conversational text.
With a consistent focus on resource constraints, this thesis supports applications in low-resource language technology.
preface
In the fall of 2021, I started my PhD with an increasing fascination for language technology, a field that goes back to explorative work on machine translation and summarization in the 1950s. Despite incredible advances since, it still felt like a niche area of study when describing my plans to people. Few had heard of a language model.
Throughout 2022, however, GitHub Copilot changed how we code, Whisper gave us incredible speech recognition, and ChatGPT became the fastest-growing software application in history.
The term "AI" has turned into something more than an abstract concept for the general public: language models have become AI, in the form of a chatty stochastic parrot dressed up in a familiar interface.
Fast-forward a couple of years, and the fascination with these technologies is at an all-time high.
Researchers and major corporations are striving to discover the next big thing, whether by mimicing thinking, through agentic systems, or with protocols for language models to interact with external applications.
We are not dealing with just models anymore. Behind the interfaces are a concoction of systems, generating increasingly impressive output from less and less input.
As I observed language models turning into the new gold rush, I found myself wanting to step back and focus on practical applications and implementations that could run on consumer-grade hardware.
These decisions redefined my interests, and this thesis includes work that aligns with my hopes of contributing to useful technology in the near future.
What a time to be alive!
acknowledgements
Computers have always intrigued me, and I imagined myself following a path in technology from an early age.
This path, however, quickly turned crooked as I dropped out of high school -- twice.
With the help and encouragement from my family, and a hint of luck, I moved away and made a final attempt.
A special thank you to those who enabled and motivated me to complete this essential step.
While too many to name, I am particularly grateful for the supportive teachers Erik Edsberg and Lina Johanna Kjelsvik.
At university, I realized that practical applications engaged me in a way theory never could.
Through developer roles in student organizations like Revolve (Formula Student) and Studentersamfundet (the student society), I found both community and an incentive to learn programming.
Thanks to the friends I made here, and especially to Kim Isak Olsen, Eirik Vale Aase, and tech guru Alf Jonassen for making programming fun!
Moreover, none of this would have happened without the countless study sessions, dinners, and talks with Erlend Skarpnes, Håvard Løkensgard, Thomas Markussen, and many more.
After four years, Björn Gambäck's course on Natural Language Processing became the catalyst for my interest in the field.
I completed my Master's thesis under his supervision, and he later hinted at the possibility of pursuing a PhD. The thought had never struck me.
The year thereafter, I discovered a position in language technology, and the path ahead suddenly became clear. My utmost gratitude goes to my advisors, Ole Jakob Mengshoel and Björn Gambäck, for their expertise and guidance along the way.
To the members of the SCRIBE project; the monthly status updates provided a great motivation for progression!
I am also grateful to Mads Skipanes, Mike Riess, and to everyone I have collaborated with.
Thanks to the Language Technology Group at the University of Oslo for the many productive discussions over the years. A special thanks to Sondre Wold, Petter Mæhlum, and Egil Rønningstad for the memorable journeys to conferences.
To the 3rd floor gang (you know who you are) for being great colleagues and friends. To Truls Stoltenberg for the frequent phone calls about everything, Mattias Hansson for the chats and new hobbies, and David Francis for a friendship spanning all these odd parts of life.
Lastly, to Kristine, for your love and patience (and the "occasional" reminders to work out). To my daughter Irene for bringing an extra sense of purpose to everything I do.