|
Aoife Cahill : 97093246 Project Title: Probabilistic Context-Free Parsing Project Supervisor: Josef van Genabith Project Documentation |
||||||||||
|
|
||||||||||
|
Statistical Methods are being increasingly used in the area of natural
language processing. This document outlines my project which deals with probabilistic parsing. | ||||||||||
|
| ||||||||||
|
I propose to write a probabilistic context-free parser. I will first of all extract a probabilistic grammar from the Penn-Treebank. The output of this script will be a list of all the grammar rules found in the treebank and their associated probabilities. I hope to use this grammar to write a probabilistic parser. I intend to use a variation of the CYK parsing algorithm to find the most probable parse. I intend to write the parser in Java and link it to a web interface. The parser should take in a tagged sentence and output the most probable phrase structure of the sentence. | ||||||||||
|
1. Extract a probabilistic Context-Free Grammar
The first step of the project is extracting the probabilistic context free grammar from the Penn Treebank. This was partially achieved as part of my third year mini-implementation, however there are some improvements needed. One of the major improvements is to extract POS tags for untagged words from the fully tagged version of the Penn Treebank. (Since the syntactical representation of the treebank is only skeletal, some of the POS tags were omitted). Once this has been done I should have a more complete PCFG. 2. Convert the grammar into Chomsky Normal Form I have chosen to implement a CYK algorithm to parse the input sentence. One of the constraints of the CYK algorithm is that the grammar be in CNF. This means that it can only contain rules of the form Aà B C or A à a. It is not clear yet how the probabilities of the original rule can be carried over to these new rules, and I may need to rethink the parsing strategy used. 3. Write the CYK Chart Parser The next major section of the project is to create the CYK Chart Parser. The input sentence will be tagged. There may still be problems with unknown words since they will not have any probability associated with them in the lexicon, and in this case I will assign a non-zero probability. I will use the Probabilistic CYK algorithm outlined in Jurafsky and Martin[1] to calculate the most probable parse of the input sentence. 4. Create a Web-Interface to display the results Once I have calculated the most probable parse I will hopefully present the results in a tree-like fashion using a Web Interface.
| ||||||||||
| ||||||||||
| [1] Jurafsky, Daniel and Martin, James, H. (2000) Speech and Language Processing An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition Prentice-Hall Inc., New Jersey 07458 | ||||||||||
|
|
||||||||||
|