Back-translation is a technique that can be used to produce synthetic data. For less common languages it can be difficult to find sufficient bilingual corpora to develop machine translation models. Therefore, the possibility to create synthetic data can improve results for machine translations in absence of bilingual corpora. This project aims to investigate if iterative back-translation improves machine translation models to different extents depending on the source and target’s linguistic genetic relation. Our source language is Swedish and our target languages are Norwegian, Finnish, and Sami. We trained transformer models from openNMT to implement our back-translation system. To create synthetic data two models were trained with a fraction of the bilingual corpus. Data from the corpora which was not utilized in the training, was then translated to Swedish to generate our first set of bilingual synthetic data. This synthetic data was then combined with the previously utilized bilingual data to train a model from Swedish to the target language. The generation of synthetic data was then performed once again to train a final model from Swedish to the target language. Based on our results, the back-translation improves the performance in terms of BLEU scores for all of our corpora.
Beams are all you need - Evaluating Beam Search for Dependency Parsing
This project has aimed to improve the baseline dependency parser using beam search. The main methods explored were beam search and beam search with error states. The experiments showed that Beam search improved the prediction accuracy as suggested by Ashish Vaswani et al (2016), however error states did not show the same improvements as seen by Ashish Vaswani et al (2016). A possible reason for this are the differences in used features, baseline accuracy and different datasets. Other improvements were also explored, but not fully implemented, such as using globally normalized model and best-first beam search. The globalized model has the purpose of improving beam search, which suffers from locality in the model scoring as suggested by Daniel Andor (2016). Best-first beam search is an optimized beam search algorithm, based on the A* algorithm, to speed up predictions.
Enhancing Dependency Parsing: Insights from Non-Projective Parsing, Beam Search, and Error States
In this project, we explore the improvement of a dependency parsing model through three different modifications: support for non-projective parsing, using beam search as a dependency algorithm, and introducing error states. Introducing support for non-projective parsing led to increased UAS, especially for datasets with a lot of non-projective arcs, which is in line with previous research in the area. Furthermore, using beam search as the decoding algorithm gave significant improvements compared to the greedy algorithm. However, contrary to the previous research within the area, the introduction of error states did not improve the performance of our model. This work highlights the performance benefits of beam search and handling non-projective data while putting into question the benefit of introducing error states. More specifically, our work indicates that more research is necessary to verify whether the introduction of error states works for a wide range of systems.
Non-Projective Dependency Parsing Using a Swap Operation
Arc-standard transition-based dependency parsing has been a common approach for dependency parsing for many years. This method generally uses three operations to compute dependencies: Left Arc (LA), Right Arc (RA) and Shift (SH). One major drawback with this method is that it is not applicable to non-projective dependency trees. This study explores an extension to this algorithm that includes an additional Swap (SW) operation, thus being able to allow the processing of non-projective dependency trees. The study resulted in Unlabeled Accuracy Scores (UAS) of 66.0% and 59.4% for English and Swedish respectively. This was a similar performance to the equivalent implementation without the Swap operation (using projectivized data), which reached UAS of 66.0% and 60.9% respectively.
The goal of our project was to evaluate the performance of various language models in determining the sentiment in Steam reviews. Steam reviews were chosen as an interesting dataset to study as these reviews often contain humor, sarcasm and community specific language. We tried two models, one based on BERT and one based on Recurrent Neural Networks (RNN). Our dataset contains about 6.4 million steam reviews, from which we created subsets for training and evaluation. Baseline models from kaggle were used and modified to our specific needs. Due to hardware constraints on laptops, desktops and Google Collab we used a maximum subset size of 50k reviews. For the RNN model, the training accuracy steadily increased to a maximum of about 98% with 10 epochs on the full 50k. However, the validation accuracy didn’t show the same improvement, staying at around 87% for most epochs. For the BERT model using the same dataset size we only reached about 39% accuracy. In conclusion BERT models might be more powerful and yield better results when using large datasets and expensive hardware, however RNN models might yield better results when access to hardware is constrained and smaller datasets are used.
The aim of this project was to evaluate beam search in syntactic parsing. We established a baseline project that didn’t use beam search and then went on to create a project that utilized beam search on the trained parsing model. After creating a different parser, we tested and evaluated a number of parameters that could affect the results of the parser. We experimented with adding new features, different seeds, a number of different beam widths and different probabilities for the error states in the beam search parser. The results obtained were that a model utilizing error states and beam search could increase the unlabeled attachment score (UAS) compared to a model that doesn’t use it. We discovered through experimenting that the choice of features and the probability for the error states can have quite an effect on the resulting scores. Our findings demonstrate that employing beam search with an optimal beam width leads to a notable enhancement in the unlabeled attachment score, with an approximate increase of 1 percent.
Enhancing Dependency Parsing with Beam Search and Error State Classification
During this project, we explore an approach to enhance dependency parsing models by integrating beam search with error state classification introduced by Ashish Vaswani and Kenji Sagae. By extending the baseline project with beam search to explore multiple parsing paths simultaneously, the model can capture a broader range of syntactic structures and linguistic phenomena. Additionally, error state classification is implemented in the training phase, where parser states deviating from the gold-standard derivation are labeled as error states. The integration of the additional samples aims to enrich the training data and guide the model towards more accurate parsing decisions. To evaluate the implementation, experiments regarding beam-width, seeds, features and error states were made on 5 different datasets. Our experiments showed a consistent trend in improved UAS from baseline when integrating beam search across languages. The extension with error states had varying outcomes where it improved the result on the Swedish dataset while all the other datasets got lower UAS. Moreover, experimentation revealed an optimal beam-width around 6 or 7, and different seeds showing a variance of up to 3 percent. Furthermore, varying the number of features highlighted different optimal values for different datasets.
Parsing Accuracy: Integrating Beam Search and Error States
This project explores improvements to be done in parsing methodology by replacing the conventional greedy search algorithm in a parser with a beam search approach. By taking inspiration from the work of Vaswani and Sagae (2016) in “Efficient Structured Inference for Transition-Based Parsing with Neural Networks and Error States,” we adapted their model to integrate a beam search mechanism. This supposedly improved parser was trained and evaluated on two benchmarks: the Universal Dependency Project’s English Web Treebank and the German Hamburg Dependency Treebank. Our adaptation resulted in an improvement, increasing the unlabeled attachment score by 1% compared to the baseline model. Further we tried to implement Vaswani and Sagae’s idea on using error states for greater improvements, but the integration of these states did not give the anticipated improvements in our model. However, integrating these states did not give the anticipated improvements in our model. Our findings suggest that error states do not universally guarantee performance improvements across all datasets, as factors such as dataset size, feature set, and beam size may have a more significant impact
Evaluating a Beam Search Tagger and Parser on Different Dataset Sizes and using Fine-Tuning
We have implemented the beam search algorithm, which is used in conjunction with a feed-forward network for dependency parsing as well as part of speech tagging. The implementation was evaluated on four language datasets from the Universal Dependency Treebank. These languages are English, Swedish, Persian, and Chinese. Hyperparameter tuning was performed by conducting a set of tests on beam width, learning rate, and the number of epochs used in the model. We then evaluate the model with two sets of tests. The first test evaluated how the amount of training data impacts the accuracy using beam search, whilst the second test investigated how well the model performed when pre-trained on one language and later fine-tuned with 100 or 500 sentences of another language. From our testing on training-set length, using 2000 sentences to train on yields an accuracy of around 95% of what the full training dataset yields. Consequently, using a training dataset size of 2000 sentences can in general be deemed enough. The results of the fine-tuned model show better accuracy compared to a model only trained on the target language with the same training-set size.
Error states and early updates for beam search training and inference in dependency parsers
This project is an extension of the standard project, during which we replaced the greedy search in the baseline with a beam search. To produce scores suitable for beam search and global scoring we tried two different approaches. The first was by introducing error states in the local training, suggested by Vaswani and Sagae in 2016. The second was by introducing an early update, suggested by Andor et al. in 2016. Without any other modifications we found that using a beam width of 2 we obtained the same UAS score for both approaches. However as the beam width gets larger performance decreases. This divergent outcome, initially counterintuitive, might be attributed to the early updates’ ability to mitigate the model’s simplicity by effectively adapting to the increased decision complexity introduced by larger beam widths that the error states cannot capture effectively. Furthermore, we tried adding additional features and observed a slight increase in performance going two steps further into the stack. In the final evaluation we obtained an UAS of 67.5% for an English dataset using error states, an increase of 1.5 percentage points compared to our baseline, accuracy stayed the same. For an Italian dataset using error states, both UAS and accuracy stayed the same.
Locally normalized beam search in a dependency parser
In this project we analyzed how a locally normalized beam search algorithm affects a dependency parser. Beam search will generally cause the parser to consider multiple alternatives at once instead of simply taking the best choice in every step. We analyzed how beam size affects the results, as well as how different languages are affected. Our ambition was to check if there were any connections between the differences in the simple localized version and the more complex, but better solutions, presented in research literature. Unfortunately we did not complete those comparisons so our results are; Firstly, increasing beam size quickly plateaus and increasing it offers no further benefit. Secondly, while different languages do give different accuracies, those are due to the baseline having worse performance on some languages.
Exploring performance of multilingual tagger-parser implementations
In this project, we extended the arc-standard parser baseline by introducing an arc-hybrid parser and integrating a dynamic oracle alongside the existing static oracle setup. Our objective was to explore the effectiveness of these enhancements in improving deterministic dependency parsing performance. Drawing from Goldberg & Nivre’s article “Training Deterministic Parsers with Non-Deterministic Oracles”, which suggests that the arc-hybrid parser should outperform the arc-standard parser, we sought to validate this claim empirically. Our experiments revealed a marginal 1% improvement in attachment score, consistent with Goldberg & Nivre’s findings. However, performance discrepancies were observed across different treebanks. Interestingly, while Goldberg & Nivre anticipated the dynamic oracle’s superiority over the static oracle in arc-hybrid parsing, our empirical results did not align with this expectation. Despite efforts to optimize the dynamic oracle, including implementing exploration parameters, the dynamic oracle consistently underperformed compared to the static counterpart. In conclusion, our findings suggest that, among the parsing strategies explored, the arc-hybrid parser in combination with a static oracle produced the most favorable attachment scores across various datasets. However, the disparity between theoretical expectations and empirical results underscores the complexity of optimizing dynamic oracles and the need for further research in this area.
The presentations of groups G07 and G11 will be given as pre-recorded videos or remotely over Zoom. The groups will be available for questions over Zoom.
G10
BERT vs DistilBERT
We have fine-tuned a BERT and a DistilBERT model respectively for binary sentiment classification on 300k balanced samples from a Steam review dataset. On the Steam dataset, DistilBERT beats BERT with a narrow margin of 0.02 accuracy. Furthermore, we also evaluated the models on two additional binary datasets, namely IMDb and SST-2. Here, the DistilBERT model again beats BERT with a slightly smaller margin of 0.01 and 0.006 accuracy respectively. To verify that the datasets are different enough for the comparison to provide useful information about the performance difference, we also fine-tuned models for datasets other than Steam and plot their embeddings. Additionally, to expand our comparison, we also include a non-binary dataset: Yelp. On the Yelp dataset with 5 classes (1-5 stars), we note that both models achieve a very poor accuracy of less than 0.1 on 2-4 stars. Both models predict extremes correctly, but fail to catch subtleties in reviews, with BERT outperforming DistilBERT for non-extreme reviews. Finally, to more easily illustrate differences in how the models evaluate sentences, we have also implemented a function to plot attention for both models.
Improving fixed window Part-of-Speech tagging and arc standard dependency parsing using three concepts
The project explores three different continuations of a baseline Part-of-Speech tagger and dependency parser. These continuations are the addition of an attention mechanism to the fixed window model, enabling the arc-standard parser to create non-projective dependency trees and using beam search instead of the greedy approach when predicting arc-standard transitions. To produce non-projective dependency trees, an additional transition type called swap was added. The arc parser with swaps achieved the same performance on non-projective data as the original parser, without swaps, had on the data that had been preprocessed to be projective. Adding multi-head self-attention did not increase the accuracy of neither the tagger nor the parser, which was unexpected. At first, this was assumed to be due to the small feature windows, but drastically altering the window sizes did not significantly affect the impact of the attention mechanism. The addition of beam search inference did not yield improved results. Both the usage of wider beams and the addition of error states during training decreased the performance of the system, in terms of both accuracy and speed.
Tiny CamenBERT and CamenBERT - A Comparative Study of Two French BERT-Based Models
This project investigates the CamenBERT model, designed to mitigate linguistic bias in French models, and compares it with a similar model using the TinyBERT model. The latter employs state-of-the-art model compression techniques to accelerate inference. This model is approximately 7.5 times smaller and 9.4 times faster during data inference while maintaining base accuracy. TinyBERT achieves this through generic and task-specific transformer knowledge distillation (KD). In this project, we implemented a model, which we can refer to as TinyCamenBERT, on a specific French dataset. The models were trained using a subset of the French Wikipedia dataset and fine-tuned for sentiment analysis on a French Twitter dataset. Following our experiments, various results were obtained, particularly in sequence classification. Regarding the fine-tuning, both of CamenBERT and tinyCamenBERT for Natural Language Inference and Sentiment Analysis, we achieved an accuracy exceeding 80%, consistent with the papers. While our study demonstrates promising results, certain experiments were postponed due to time constraints. One such experiment involves assessing the effectiveness of TinyBERT distillation in scenarios with limited training data. This investigation would be valuable in determining the robustness of the distillation process in resource-constrained environments.
In this project, we have investigated the use of a dynamic oracle for dependency parsing. The project was based on a given syntactic parser which used the Arc-Standard algorithm with a static oracle. The parser we developed uses the Arc-Hybrid algorithm, which enables the use of dynamic oracles, since it is a decomposable parser. These parsers has been run on a large english dataset and a smaller swedish dataset. Multiple tests has been run on these parsers with these two languages and the tests measured the unlabeled attachment score (UAS). The results obtained from these tests revealed that the dynamic oracle did not clearly provide any improvement on the UAS for any of the two languages. This is based on the fact that changing the random seed could produce the same improvment as using a dynamic oracle. To draw a definitive conclusion to why this is, more testing has to be done with additional dataset of varying linguistic complexities.