Course conference 2025

Published

March 19, 2025

Schedule

Date and location: Wednesday, 2025-03-19, 13:15–15:00, Alan Turing

Slot	Time	Group
1	13:15–13:30	G1
2	13:30–13:45	G2
3	13:45–14:00	G3
4	14:00–14:15	G4
5	14:15–14:30	G5
6	14:30–14:45	G6
7	14:45–15:00	G7

Abstracts

G1

Analyzing and Mitigating Fairness Issues in NLP Models

Bias in NLP can lead to unfair treatment of different demographic groups, reinforcing societal inequalities and reducing trust in AI applications. In this work, we first identified key metrics commonly used to evaluate and illustrate bias in NLP models. We then explored the predominant mitigation strategies employed to address these biases. We selected the “Jigsaw Unintended Bias in Toxicity Classification” dataset to assess bias empirically, as it provides a suitable benchmark for evaluating bias across multiple domains, including racial, political, and general biases. Through our evaluation, we highlight both the strengths and limitations of our chosen approaches in mitigating bias within toxicity classification tasks.

G2

Self-Supervised CLIP Fine-Tuning with Medical Image-Text Data

Contrastive Language-Image Pre-Training (CLIP) is a multimodal model which connects images and text through large-scale pre-training, and has been greatly acclaimed ever since its release. Fine-tuning the base pre-trained model for image classification has shown increasing interest, especially in the medical domain. Generally, for this purpose, CLIP has been fine-tuned with image-label pairs. However, labeled data may not always be available. In this paper, we fine-tuned one CLIP model instance with medical image-text data, and another instance with medical image-label data, to measure how much they affected the base performance. Our results showed that fine-tuning with image-text data provided a significant performance increase compared to the base model. In turn, as expected, fine-tuning with image-label data performed even better, but the performance increase provided by image-text fine-tuning is not negligible, and is worth considering when working with unlabeled data.

G3

Fine-tuning DistilBERT for South Park Character Classification

In this study, we try 4 different methods to fine-tune a DistilBERT model to a multi-class classification task on a dataset with a relatively small amount of data. The methods tried and evaluated were introducing class weights when calculating the loss, doing hyperparameter fine-tuning, using layer-wise learning rate decay, and applying classification with alternating normalization. Using the macro F1-score as an evaluation metric, our study found that neither of these methods lead to significant improvements in the multi-class classification task.

G4

Cross-Lingual Adaptation of Coreference Resolution: Fine-Tuning English Models for Swedish

Co-reference resolution is a well studied sub-field of Natural Language Processing that involves clustering expressions referring to the same entity within a text. Most models are trained and evaluated on English and their ability to generalize to other languages, such as Swedish, is uncertain. This project explores whether an English-trained co-reference resolution model can be effectively adapted to Swedish using parameter-efficient fine-tuning methods. The goal is to improve performance on Swedish while maintaining or even enhancing the model’s original English capabilities, investigating potential cross-lingual transfer effects.

G5

Exploring Prompt Engineering for Few-Shot Text Summarization

Summarization of text is a common task for Large Language Models, and the effects of prompting techniques on these models has been explored in previous research. One such technique is few-shot prompting, where the model is given a few examples of the task at hand, as opposed to zero-shot prompting where the model is immediately asked to solve the given task. This paper explores the effects of zero-shot and few-shot prompting on different models’, such as Llama’s, ability to summarize a given input text. Datasets include CNN/Dailymail and reddit TIFU, consisting of news articles, and informal stories, respectively. The datasets consist of text and human-written summaries, which are then used to evaluate model output. More specifically, zero-shot and three-shot prompting, where the input includes three different examples of text and summary, is used to generate summaries of a text. The results are evaluated using the ROGUE score metric and using human evaluation through an online form.

G6

Using NER for Confidential Text Classification

The autonomous anonymization of text has become an important task for natural language processing, as the amount of private data being stored has grown in the last decades and for institutions to comply with privacy regulations such as GDPR. Near-entity models (NER) have been used for token classifications tasks to decide what entity a word belongs to. In this paper, we explore how pre-processing of text using NER to describe entities can help models in classifying words as confidential or not. Firstly, a Bert model was pre-trained on CoNLL to learn entity relationsships, we then fine-tuned the model on the TAB benchmark. Another binary classification model of confidential was then trained using techniques to incoporate the entities in the text. Our results showed us that there was a very little difference in replacing or including entities in the text for the classification of confidential information as to not pre-processing the text.

G7

Beyond BM25: A Dense Retrieval Approach Using Sentence-BERT and FAISS

We present a semester project in NLP focused on passage retrieval using Sentence-BERT (sBERT) embeddings combined with the FAISS indexing. Using the MS MARCO passage retrieval dataset, we evaluated a dense retrieval system, comparing its performance against a sparse retrieval baseline system (BM25). Our system targets efficient retrieval of relevant passages, employing sBERT for semantic embeddings and FAISS for scalable nearest-neighbor search using cosine similarity. We evaluated the quality of the ranking primarily using MRR@100, along with other ranking quality metrics. We compared different embedding models, all-MiniLM-L6-v2 and msmarco-MiniLM-L6-v3, using FAISS IndexFlatIP. The results show that msmarco-MiniLM-L6-v3 outperforms all-MiniLM-L6-v2, achieving an MRR@100 score of 0.3291 compared to 0.3149, demonstrating the advantage of using a fine-tuned embedding model. Both models significantly surpass the traditional BM25 baseline, which achieves an MRR@100 score of 0.167 on the same dataset. These findings show the effectiveness of dense retrieval methods over traditional lexical-based retrieval and demonstrate the impact of fine-tuning in improving passage ranking performance.