Standard project
The standard project is on the topic of syntactic parsing.
Topic
Syntactic parsing is the task of mapping a sentence to a formal representation of its syntactic structure. We will introduce this task in the first week, return to it throughout the course, and cover it in detail in Unit 4.
To provide you with additional background material, we have compiled a reading list:
Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source (Google Research Blog, 2016-05-12). This blog post introduces syntactic parsing and its applications, as well as Google’s SyntaxNet framework, which can be used to train parsers on suitable data.
Universal Dependencies v1: A Multilingual Treebank Collection (research article, LREC 2016). This research article describes a collection of data sets that can be used to train syntactic parsers. These include parsers based on Google’s SyntaxNet and the parser that you will implement in the lab series. Homepage of the Universal Dependencies project
Grounded Compositional Semantics for Finding and Describing Images with Sentences (research article, TACL 2014). This research article presents an interesting use case for syntactic parsers. Note that we do not expect you to understand all technical details in this paper. The purpose is to give you a concrete, non-trivial example of what syntactic parsers can be used for.
Requirements
The minimal requirements for the standard project are as follows:
- You put together a baseline system based on existing code.
- You modify or apply this baseline system, implementing methods described in the research literature.
- You evaluate your system on the Universal Dependencies treebanks or in the context of another task.
- You analyse your results and draw conclusions about the effectiveness of the implemented methods.
Simple projects will make limited-scale modifications to the baseline system. Complex projects will be more varied and either implement substantial changes (such as a different parsing algorithm) or apply the parser in the context of some other task. In any case, the focus must be on implementing methods described in the NLP research literature.
Baseline
The baseline for the standard project is an implementation of a simple tagger–parser pipeline. It is available in the course repo in the form of a Jupyter notebook. For your baseline submission (D3), you must populate a GitLab repository with a stand-alone version of this notebook in the form of a Python script. Your script must be able to process any treebank released by the Universal Dependencies Project.
Some of the Universal Dependencies treebanks contain so-called non-projective trees. To train on these treebanks, you will first have to projectivise them. For this, you can use the Python script projectivize.py
(contains usage instructions).
Instructions: Send an e-mail to the examiner with a link to a GitLab repository containing your code. Make sure to grant the examiner Developer permissions for your repository.
The repository must contain a file README.md
stating the tagging accuracy and unlabelled attachment score for your baseline system when trained on the training sections and evaluated on the development sections of at least two treebanks from the Universal Dependencies Project:
- the English Web Treebank (EWT)
- one additional treebank in a language other than English
In addition to this file, your repository must contain everything needed for the examiner to replicate your results. This must be possible by running the following commands. (Replace abcxy999
with your LiU-ID and nlp-project
with the name of your repository.)
$ git clone git@gitlab.liu.se:abcxy999/nlp-project.git
$ cd nlp-project
$ python baseline.py
Main part
There are many different things that you can do to modify or apply the baseline system. Here are some ideas, roughly sorted from simple to complex. For each idea, we also list one relevant research article. You can also come up with your own ideas, and do your own literature search. Most research articles in the field of natural language processing are available for free via the ACL Anthology.
- Dynamic oracles
- Implement the arc-hybrid system and a dynamic oracle for choosing the best possible transition in a given configuration. Research article: Training Deterministic Parsers with Non-Deterministic Oracles
- Non-projective parsing
- Support the parsing to non-projective trees by implementing a transition system with a swapping operation. Research article: Non-Projective Dependency Parsing in Expected Linear Time
- Beam search
- Replace the greedy search in the baseline system with a beam search. Research article: Efficient Structured Inference for Transition-Based Parsing with Neural Networks and Error States
- Parsing with RNNs
- Replace the feedforward architecture in the baseline system with an architecture based on recurrent neural networks. Research article: Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations
- Extrinsic evaluation
- Apply your parser to an extrinsic task such as information extraction, and evaluate its performance. Research article: Multi-Way Classification of Semantic Relations Between Pairs of Nominals