Course conference 2026

Published

March 18, 2026

Schedule

Date and location: Wednesday, 2026-03-18, 13:15–16:45, Alan Turing

Slot	Time	Group
01	13:15–13:30	G11
02	13:30–13:45	G07
03	13:45–14:00	G09
–	14:00–14:15	Break
04	14:15–14:30	G05
05	14:30–14:45	G03
06	14:45–15:00	G01
–	15:00–15:15	Break
07	15:15–15:30	G10
08	15:30–15:45	G08
09	15:45–16:00	G06
–	16:00–16:15	Break
10	16:15–16:30	G04
11	16:30–16:45	G02

Abstracts

G01

Automatic Text Anonymization using NLP

As organizations collect growing amounts of text data from emails, documents, and customer feedback, protecting personally identifiable information (PII) has become a critical challenge. Privacy regulations like GDPR require sensitive data to be anonymized, but doing this manually is slow, expensive, and simply not practical for large datasets. This project addresses that problem by building an automated text anonymization system using natural language processing. Our approach combines two techniques: a BERT-based Named Entity Recognition model (dslim/bert-base-NER) for detecting names, locations, and organizations, and regex pattern matching for identifying emails and dates. When the system finds sensitive information, it replaces it with standardized tags like [NAME], [LOCATION], or [EMAIL]. For example, “John Smith lives in Berlin” becomes “[NAME] lives in [LOCATION].” We built the system using Python with PyTorch and HuggingFace Transformers, and tested it on CoNLL-2003 style data. The results were promising: our system achieved 95% precision, 93% recall, and 94% F1-score. When compared to a spaCy baseline, our approach performed 13% better across all metrics. These findings show that combining deep learning with simple rule-based methods can provide an effective and reliable solution for automatic text anonymization.

Slides

G02

Algorithmic Bias in Automated Recruitment

Since LLMs are trained on human-generated data, they often inherit human biases. However, identifying these biases in model outputs allows AI to be used as a tool to uncover and explain bias in training data. This project examines gender bias in hiring through a two-stage transformer pipeline for job applications. The workflow includes an initial hiring prediction, followed by Personal Identifiable Information (PII) anonymization to remove protected attributes using a DistilBERT-based Named Entity Recognition (NER) model. The anonymized applications are then reclassified using a DistilRoBERTa-base classifier. Results show that the words men and women use in resumes influence hiring outcomes differently. A Bag-of-Words classifier further revealed embedded bias, where professional terms such as “security” or “data” had significantly different predictive weights depending on the applicant’s gender.

Slides

G03

Fine-Tuning Qwen2.5-VL for Ancient Tamil Inscription

Ancient stone inscriptions written in scripts such as Tamili and Vattezhuthu are an important source of historical and linguistic knowledge. However, reading and interpreting these inscriptions is difficult because the text is often damaged, unevenly carved on stone surfaces, and written in scripts that are no longer commonly used. In addition, there are very limited labeled datasets available for training modern machine learning models. This project aims to explore how recent vision—language models can be used to automatically recognize and interpret these inscriptions. The proposed system combines transformer-based vision language model Qwen 2.5–7B to extract text from images of inscriptions. Because real inscription datasets are limited, synthetic images will be generated by rendering text and placing it on stone-like backgrounds. To ensure realism, we have utilized the “Aksharamukha” Python package to convert the inscription text to image and used “Augraphy” library along with standard pre-processing to combine the inscription image to stone background. After this, qwen 2.5–7B model is then fine-tuned using a parameter-efficient method LoRA on our custom dataset and qwen2.5–0.5B model is used to convert the recognized characters to human recognizable form. After recognizing the characters and words, additional language processing methods will be used to reconstruct sentences and convert the text into modern Tamil. The system will be evaluated using BLUE and ROUGE evaluation metrics. The overall goal of this work is to investigate how modern NLP techniques can support the digital preservation and interpretation of historical inscriptions.

Slides

G04

Analyzing the Effect of Evidence Quality in a Science-Domain RAG System

In this project, we conducted a controlled study to evaluate the impact of hyperparameters and document quality on a RAG systems performance using the SciQ dataset. We set up a normal RAG baseline configured with fixed parameters: a chunk size of 256 tokens, a retrieval volume of \(\text{top-k}=3\), and a standard embedding model all-MiniLM-L6-v2. To isolate the system’s sensitivity to context quality, we compared this baseline against three configurations: LLM-only, oracle gold support, and corrupted documents. Our results demonstrated that RAG performance is constrained by evidence quality. The normal RAG performance fell between the oracle upper bound and the corrupted lower bound. The difference between the oracle and the RAG baseline pointed to an inherent error in the retrieval mechanism. We concluded that the exact answer sentence is the main driver of the effectiveness of the RAG system, since removing it causes the performance to drop to the almost LLM-only baseline level.

Slides

G05

Fingerprinting Large Language Models

This study investigates whether Large Language Models (LLMs) posses language-agnostic fingerprints, unique stylistic signatures that persist across different languages. Using a BERT-based classifier trained on English text and Low-Rank Adaption (LoRA) fine-tuning on the WildChat-50m dataset, we analyze outputs from three different models: Ministral 8B, Gemma2 9B, and DeepSeek Coder V2. We find that Ministral struggles with alternate alphabets, while DeepSeek handles them most effectively. Gemma has the best overall performance out of the three models. Bigram analysis hints that Gemma outperforms Ministral and DeepSeek because of its larger bigram overlap between training and test data. Apart from some exceptions, the models show clear fingerprints, even in languages the classifier has not been trained on.

Slides

G06

Evaluating Information Relevance in Retrieval-Augmented Generation

LLMs exhibit high linguistic fluency but often struggle with up-to-date information relevancy, particularly in the case of news reporting. To overcome the factual limitations of LLMs, Retrieval-Augmented Generation (RAG) has emerged as a framework for anchoring generated text in verified data. Our study evaluates several systems of increasing complexity to address the challenges of up-to-date information relevancy: a basic semantic vector search, a standard RAG pipeline, and a RAG pipeline extended with Chain-of-Thought (CoT). The results are evaluated using the RAGAS framework, specifically focusing on the faithfulness and answer relevancy metrics.

Slides

G07

Large Language Models as Therapeutic Conversational Agents

Large Language Models (LLMs) are becoming more prevalent in digital mental health interventions, for example as therapy chatbots. However, the precise influence of conversational memory and external clinical grounding on the accuracy of symptom tracking is not fully understood. This project assesses four unique prompt pipelines for a therapeutic conversational agent: Baseline (exclusively system prompt), Context-only (include patient journal history), RAG-only (include Retrieval-Augmented Generation based on a Cognitive Behavioural Therapy knowledge base), and a combination of both. To benchmark these configurations, the study conducts a clinical assessment through the evaluation of PHQ-8 scores by employing the DAIC-WOZ dataset. In parallel, the conversational quality and therapeutic alignment of the models’ responses are examined using a dual-evaluation methodology that combines human assessment with an automated LLM-as-a-judge framework. This dialogue evaluation scores the pipelines across seven key metrics—including guidance, understanding, empathy, and safety—designed to measure both cognitive support and affective resonance. By systematically comparing how each prompt pipeline influences clinical assessment accuracy and conversational engagement, this research isolates the technical mechanisms necessary to balance evidence-based therapeutic fidelity with highly contextualized empathy.

Slides

G08

Improving Mathematical Problem Solving through Post Training

This study investigates to what extent a pretrained language model, Qwen3–1.7B-Base, can improve its performance on mathematical problem solving tasks through post training. Two training approaches are explored, supervised fine-tuning and reinforcement learning. The goal is to evaluate whether these methods can enhance the model’s reasoning ability and improve its accuracy when solving math questions. Both models were trained on grade school level math questions, and evaluated based on accuracy. Starting from the base model, we train two independent branches: an SFT model utilizing reasoning trajectories distilled from a larger model, and an RL model optimized via Group Relative Policy Optimization (GRPO).

Slides

G09

Large Language Model Brain Rot

Large Language Models (LLMs) are vulnerable to data poisoning during fine-tuning, which is a safety and reliability concern. This paper investigates the hypothesis of “LLM brain rot” by examining the cognitive decay and ethical degradation of Gemma 3 4B when exposed to a fake news dataset. Our adaptation is made parameter-efficient with the use of Low-Rank Adaptation (LoRA), accommodating hardware constraints and allowing for the use of a more intelligent base model than would otherwise have been possible. The base and adapted models are evaluated using a custom generative framework across the established benchmarks of ARC Easy, ARC Challenge, HellaSwag and Winogrande. Gemma 3’s multi-modal architecture (text and images) caused compatibility issues with standard evaluation libraries, warranting a custom regex-based parsing method for assessing the model’s output. The purpose of our findings is to quantify the degradation of logical reasoning, factual recall, and common sense caused by exposure to fabricated data.

Slides

G10

CEFR-Oriented Controllable Sentence Simplification in Swedish Using LLM-Generated Paraphrases

Text simplification can improve accessibility for second-language learners, but controllable simplification remains underexplored for low-resource languages such as Swedish. In this project, we investigate Swedish text simplification with explicit control over the target CEFR level. We first evaluate the simplification quality of Large Language Models under different model and prompting settings, and use their outputs to construct a CEFR-oriented simplification dataset. This dataset is then used to train and evaluate controllable simplification models. We explore multiple approaches, including tuning encoder-decoder models such as BART and decoder-only models such as Llama 3.2. To assess performance, we use automatic evaluation metrics including SARI, BERTScore, and LIX, together with a CEFR-level classifier to measure alignment with the desired level. Our study examines how well different models can generate simplified Swedish text for a lower CEFR level after fine-tuning on LLM-generated data, and provides a practical low-cost approach to text simplification for low-resource languages.

Slides

G11

Making NanoGPT Run Fast on Google TPU v6e

NanoGPT is a minimal GPT-2 implementation by Andrej Karpathy that inspired a community speedrun challenge: reaching a validation loss of 3.28 as fast as possible on 8×H100 GPUs, with a current record of 1.435 minutes. In this work, we port and optimize this training pipeline in JAX for Google TPU v6e hardware, investigating how this architecture competes on a well-defined benchmark. Our implementation incorporates several algorithmic improvements including rotary positional embeddings (RoPE), the Muon optimizer with layer-wise learning rate decay for transformer weights, with Adam used for token embeddings and the language model head each with a warmdown schedule, logit soft-capping based on the Gemma 2 paper, and architecture dimensions specifically tuned to TPU v6e’s hardware constraints. Preliminary results show we achieve a validation loss of ≤ 3.28 in 24 minutes on 8×TPU v6e.

Slides