Technical Manual


Contents

Back to List of Documentation

Some Background Information
Back to Top

Statistical NLP

In the space of the last ten years, statistical methods have gone from being virtually unknown in computational linguistics to being a fundamental given. [1] The use of statistical methods has greatly furthered progress in many areas of language processing. Disambiguation, database classification of documents, speech recognition and grammar learning are just some of the areas that statistics has helped. We now have several large electronically available corpora, from which we can extract the statistical information that can help us understand language. These corpora are usually annotated and we can use these annotations to our advantage. Due to the sheer volume of text and data in these corpora, we can assume certain regularities of a language and try to describe the features of a language by observing frequent characteristics. Statistics is the ideal tool for analysing these frequently occurring phenomena.

Probabilistic Context-Free Grammars

Context-free grammars are widely used as models of natural language syntax. In their probabilistic version, which defines a language as a probability over strings, they have been used in a variety of applications. [8] The simplest way to gather statistical information about a grammar is to count the number of times each rule is used in a corpus containing parsed sentences and use this to estimate the probability of each rule being used. [2] The simplest probabilistic model for recursive embedding is a Probabilistic (or Stochastic) Context-Free Grammar. This is basically a CFG with probabilities associated with each rule, indicating how probable a rewrite rule is.

To find the probability of a tree in a PCFG model, just multiply the probabilities of the rules that built its local sub-trees. [7] The probability of a string is defined as the sum of the probabilities of the parse trees that have this string as a yield. The probability of a parse tree is defined as the probability of the corresponding leftmost derivation. [6]

The Penn Treebank

The Penn Treebank was developed in the University of Pennsylvania beginning in 1989. It is a corpus containing over 4.5 million words of American English. It has been annotated with Part Of Speech tags, as well as annotated with skeletal syntactic structures. The Part Of Speech tagging was first done automatically and then corrected by human annotators. The syntactic annotation was also semi-automatic. I extracted my probabilistic context-free grammar from what is called the Penn II Treebank, which means that the level of annotation is far greater than the original project specified. More information about the Penn Treebank can be found at its homepage http://www.cis.upenn.edu/~treebank/

Aim of Project
Back to Top

The aim of my project was to write a probabilistic context-free parser. There were two main stages involved. The first step was to extract a probabilistic grammar from the Penn-Treebank. This grammar was saved and used in the parser. The next step was to write a parser given a particular input, find a parse for it, and display the results in graphical form.

Extracting a Probabilistic Context-Free Grammar from the Penn Treebank
Back to Top

To extract a probabilistic grammar from the Penn Treebank I processed sections 02 to 21 of the parsed version of the Penn Treebank. Each section contains about 100 files, and the total number of sentences processed to extract the grammar was roughly 40,000. [4] The parsed version of the Penn Treebank contains skeleton parses. An example parse is as follows:
( (S (NP-SBJ (NP Pierre Vinken)
             ,
             (ADJP (NP 61 years)
		   old)
             ,)
     (VP will
         (VP join
             (NP the board)
             (PP-CLR as
		     (NP a nonexecutive director))
	     (NP-TMP Nov. 29)))
     .))
This is a skeleton parse in that the part of speech tags are missing from the words. As part of my third year project I wrote a program that attempted to extract rules from these parses. The results for the above sentence were:
Lexicon Rules Probabilistic Rules
ADJ->old: 1 ADJP->NP ADJ: 1 ADJP->NP ADJ: 1
NP->61 years: 1 NP-SBJ->NP ADJP : 1 NP-SBJ->NP ADJP : 1
NP->Pierre Vinken: 1 PP-CLR->P NP : 1 PP-CLR->P NP: 1
NP->a nonexecutive director: 1 S->NP-SBJ VP : 1 S->NP-SBJ VP: 1
NP->the board: 1 VP->V NP PP-CLR NP-TMP : 1 VP->V NP PP-CLR NP-TMP: 0.5
NP-TMP->Nov. 29: 1 VP->V VP : 1 VP->V VP : 0.5
P->as: 1    
V->join: 1    
V->will: 1    

Obviously this needed to be improved to incorporate the part of speech tags. My original program attempted to guess categories for words which had no explicit category in the parse, but it was often wrong. My original program also ignored all punctuation. I decided to include punctuation since I believe it does actually affect the structure of a sentence.

Extracting Tags for Untagged Words

I needed to extract the correct tags for the untagged words in the parsed structures. To do this I used the tagged version of each file as a reference. My first idea of how to extract the tag from the file was to simply count words. In theory each word that I parsed should have an equivalent word/tag pair in the tagged file. Each word/tag pair is separated by a slash in the tagged file. I therefore used this as a deliminator when splitting the tagged file up into words and tags.

Problems Encountered

The main problem with this was that some words actually contain slashes (1/2 for example), and although they are especially marked in the tagged file by a backslash, my program did not cope with this very well. I made adjustments so that fractions were treated differently, but then the program encountered further problems with alternatives like "pianist/bassoonist/composer". I then needed to rethink my strategy, so I introduced a while loop that dealt with slashes in words. This slowed down the process quite a bit. This meant that the program could now assign the correct tag to untagged words in the parsed file.

However there was another problem. There were sometimes inconsistencies between the tagged files and the parsed files. For example in wsj_0455.prd (line 663) there is no closing right bracket. This is obviously an oversight by a human annotator, since there is a right bracket in the raw text. The tagged version is correct and also contains the closing right bracket. By using a counting approach, inconsistencies like this completely skewed results. Although the parsed version is incorrect, my program cannot know that, and can only go on the data it is given.

Solution

I changed my approach to use pattern matching. I had originally avoided this approach because pattern matching takes longer than counting, and also punctuation would be complicated to deal with. I noticed a large drop in speed when I implemented this approach. There were also unexpected results from this approach due to inconsistencies between the files again. For example in file wsj_0142.pos (line 906) the tag appears as 2003\/2007/CD. The 2007 does not appear at all in the parsed equivalent. This meant that at first, the 2003 in the parsed file was getting the tag "2007/CD" which was obviously incorrect. So I then restricted the possible values of the tags to valid tags. This gave the desired results, although at a greatly increased speed in comparison to the original program.

There are still some incorrect lexical rules in the file, but this is due to inconsistencies in the files. For example wsj_0472 (line 685) should read something like (ADJP 2 1/2-year) but it only reads (ADJP 2 1). There are also some unusual tags, for example wsj_0271.pos (line 434) gives "an" the tag ",".

In Hindsight, maybe a better approach to extracting the tags would not have been to use the / as a deliminator and instead use spaces. This would mean that I would have to pre-process the file first to eliminate the rows of = signs, but I could then have used the pattern / followed by a tag followed by a space to determine the tag.

Transforming the skeleton syntactic structures

In order to extract a PCFG I first had to transform the trees in the treebank. This transformation involved getting tags for the words in the tree, removing some functional information and removing all empty elements. Transforming the grammar like this as a separate stage in the process was needed so that when it came to evaluating my parser, I would be able to compare my results with the exact structure that was used to generate the PCFG. I rewrote the new trees to a file. Because I would using the CYK Algorithm to parse the sentence, I put in the option at this stage to remove unit productions.

I removed some functional information. Tags such as ADVP-LOC-TPC-PRD were either reduced to ADVP-LOC or ADVP. Which form they were reduced to was determined by setting a flag at the beginning of pcfg.pl

Empty Elements are elements such as : (NP *T*-1) These occur as a result of functional information and I discarded them.

Removing Unit Productions was not as straight-forward as I thought it would be. At first I only considered unit productions of the form
BKT_TAG -> POS_TAG,     POS_TAG -> Word

e.g. NP -> NN,    NN-> Asbestos
which would be reduced to simply NN-> Asbestos, and the rule of which NP was a daughter would be updated to contain NN instead of NP. These were relatively easy to deal with since they occurred in the form
( BKT_TAG ( POS_TAG Word ) )

in the transformed sentence. Using regular expressions it was easy to change the tree to just contain ( POS_TAG Word ) and the correct rule would be extracted at a later stage.

Problems Encountered

However, I had not thought about unit productions where the daughter of the unit production was itself a complex tree, and not simply a part of speech tag and a lexical entry. These were of the form:
( S ( NP ( DT The )( NNP Manville )( NNP Personal )( NNP Injury )( NNP Settlement )( NNP Trust ) )( VP ( VBD said )( SBAR ( S ( NP ( PRP it ) )( VP ( VBZ is )( VP ( VBG considering )( NP ( NP ( JJ several )( NNS ways ) )( SBAR ( S ( VP ( TO to )( VP ( VB ease )( NP ( DT a )( NN liquidity )( NN crunch ) ) ) ) ) )( SBAR ( WHNP ( WDT that ) )( S ( VP ( MD could )( VP ( VB include )( NP ( NP ( DT the )( NN sale ) )( PP ( IN of )( NP ( NNP Manville )( NNP Corp. ) ) )( PP ( TO to )( NP ( DT a )( JJ third )( NN party ) ) ) ) ) ) ) ) ) ) ) ) ) )( . . ) )

What is not immediately obvious in this example is that there are several unit productions. So I had to rethink my method of finding unit productions. I had information about rules that had already been completed, and I also had information about rules that were waiting to be completed. In the above example, the first unit production rule encountered by my program is: S-> VP. At first I thought by simply looking for occurrences of S (VP I would be able to remove the rules like this. However in the above example this pattern appears more than once, and I cannot assume that the unit production rule is either of them, since the VP may itself contain an S which starts with a VP.

Solution

To solve this problem, I stored all the rules that had already been popped off the stack as completed rules, and recursively found the complete subtree of the unit production, excluding lexical entries. So in the above example I used the following stack of previously added rules:
NP-> DT NNP NNP NNP NNP NNP NP-> PRP NP-> JJ NNS NP-> DT NN NN VP-> VB NP  VP-> TO VP  S-> VP
to help me determine which S(VP was in fact the first unit production reached. Using these rules, I was able to determine that the sequence of non-terminals, brackets and terminal symbols that I was looking for was :
S ( VP ( TO word )( VP ( VB word )( NP ( DT word )( NN word )( NN word ) ) ) )
This then allowed me to replace the correct unit production in the sentence

( S ( NP ( DT The )( NNP Manville )( NNP Personal )( NNP Injury )( NNP Settlement )( NNP Trust ) )( VP ( VBD said )( SBAR ( S ( NP ( PRP it ) )( VP ( VBZ is )( VP ( VBG considering )( NP ( NP ( JJ several )( NNS ways ) )( SBAR ( S ( VP ( TO to )( VP ( VB ease )( NP ( DT a )( NN liquidity )( NN crunch ) ) ) ) ) )( SBAR ( WHNP ( WDT that ) )( S ( VP ( MD could )( VP ( VB include )( NP ( NP ( DT the )( NN sale ) )( PP ( IN of )( NP ( NNP Manville )( NNP Corp. ) ) )( PP ( TO to )( NP ( DT a )( JJ third )( NN party ) ) ) ) ) ) ) ) ) ) ) ) ) )( . . ) )

At first I thought it might have been possible to simply pop off the last rule added and search for that pattern, but that proved to be insufficient. I found this particular problem to be one of the most difficult to solve, probably because at first I was unwilling to accept the fact that I would have to build up the complete subtree. Once I had decided that that was what I had to do, I changed my way of tackling the problem and came up with a much better solution than some of the work-arounds I had been trying.

Extracting the PCFG

Once the parsed sentences had been transformed, it was straightforward to extract the PCFG. There was no more extra information that I had to ignore, or no more missing information that I had to find. The total space of grammars that could be extracted by using my program is illustrated in the table below:

Unit ProductionsFRAG StructuresExtra Functional Information
000
001
010
011
100
101
110
111


The two grammars I concentrated on for using with my parser were the 000 and 001, i.e. the two grammars that had no unit productions and no FRAG sentences. The difference between the two was that one had slightly more functional information than the other, so had bracket tags such as NP-SBJ, rather than simply NP.

Converting the PCFG into Chomsky Normal Form
Back to Top

Once I had the probabilistic context-free grammar extracted, the next thing that needed to be done was to convert it into Chomsky Normal Form (CNF) since The CYK Parser requires a grammar in this form. CNF grammars have rules of the form:
A --> a or A --> B C,
where {A,B,C} are non-terminals and a is a terminal symbol

There are three steps in converting a standard CFG into this form.
  1. Remove Empty Productions
  2. Remove Unit Productions
  3. Change grammar so all rules are in CNF
The first two steps were taken care of in the transformation stage of extracting the grammar. My perl script convert.pl deals with the last step. Rules of the form A --> BCDE (p), where p is the probability, are replaced by a set of rules:
A --> BC’ (p), C’--> CD’ (1) and D’--> DE (1)

This was done in a simple foreach loop. New categories were given labels R followed by some number, e.g. R0, R1, R2 etc.

At first I simply generated the grammar as specified in the algorithm. For the "000" set of rules extracted from sections 02-21, which originally had just over 19000 rules, this resulted in a grammar of 66533 rules. I knew speed was going to be a problem in writing my parser, so obviously I wanted to have as small a grammar as possible. I realised that there was going to be some duplication. This can clearly be seen from this small example grammar:
NP --> DT CD ADJP NN .
NP --> DT ADJP NN .
The resulting grammar in CNF is as follows:
NP --> DT R0
R0 --> CD R1
R1 --> ADJP R2
R2 --> NN .
NP --> DT R3
R3 --> ADJP R4
R4 --> NN .
Clearly R2 and R4 express the same rule, so can be collapsed into the same rule, making sure to update any rules with these non terminals on the RHS. This leads to the following grammar:
NP --> DT R0
R0 --> CD R1
R1 --> ADJP R2
R2 --> NN .
NP --> DT R3
R3 --> ADJP R2
Again it can be seen that there is a duplication. R1 and R3 are essentially the same rule. By collapsing rules like this it was possible to reduce the number of rules from over 66,000 to 32,231. Similarly for the "001" grammar, the total number of rules was reduced from over 92,000 to 47,430.

The convert.pl program creates a file called cnf_grammar which contains the new grammar in Chomsky Normal Form. It also creates a file called cnf_info which calculates the number of rules and non-terminal symbols in the grammar. These files will be used later in the parser

Writing the Probabilistic Context-Free Parser
Back to Top

I based my algorithm for writing the parser on the one Michael Collins [5] described in 1999. The pseudo-code for the algorithm is as follows:

#initialisation
for all i,j,k
	p[i,j,k] = 0

#base case

for i = 1 ... n
	for k = 1 ... G
		if k -> wi is in grammar
			p[i,i,k] = P(k -> wi)

#recursive case

for s = 2 ... n
for i = 1 ... n-s+1
j=i+s-1
for m = i ... j-1
	for k = 1 ... G
	for k1 = 1 ... G
	for k2 = 1 ... G

	prob = p[i,m,k1] * p[m+1,j,k2]* P(k -> k1 k2)
	if (prob > p[i,j,k])
		
		p[i,j,k] = prob
		B[i,j,k] = {m,k1,k2}
Where G is the number of non-terminal symbols in the grammar and n is the number of words in the input string. p is the dynamic programming array. B is an array of back-pointers allowing recovery of the highest probability tree.



This algorithm runs in O(n3 |N3|), where n is the number of words in the input string and N is the number of non-terminals in the grammar. Obviously when the grammar contains more than 13,000 non-terminals, we need to try and make the implementation as efficient as possible.

The base case could be made more efficient since I was taking in a tagged sentence. This meant that I did not have to add more than one entry per word into the chart, since there could only be one possible entry, the one specified by the tag in the input. So to implement the base case I read in all the word/tag pairs from the input, processed the probabilistic lexicon file to get the probability for each pair, and then added set p[i,i,tag] to that probability. For word/tag pairs that were not in the lexicon file, I assigned them a minimal probability.

The recursive case, containing 6 for loops , 3 of which went from 1 - 13,000, obviously needed to be examined. At first I implemented it as it is in the pseudo-code printing out some of the information it was processing as it went along. The first thing I noticed was that the inner three loops try to generate possible rules in the grammar by combining all possibilities of non-terminal symbols. Since I already had all the rules stored in a hashtable, it did not make much sense to me to use up valuable processor speed trying to generate possible rules, when I could simply loop through the ones I actually had.

So my first attempt at improving the speed of the parser was to replace the inner three loops with one loop which was:
foreach rule k -> k1 k2 in the grammar
	prob = p[i,m,k1] * p[m+1,j,k2]* P(k -> k1 k2)
	if (prob > p[i,j,k])
		
		p[i,j,k] = prob
		B[i,j,k] = {m,k1,k2}
At this stage, p was a hashtable with the string "i\tj\tk" as it's key and a probability as the value. B was also a hashtable with a similar key and it's value was a Pointer object. The Pointer class allows for objects with an index and two separate non-terminal strings. My rules were also stored in a hashtable where the key was a Rule object and the value was its probability. This class gave easy access to the individual non-terminal symbols of the rule. I did not use the Rule class in the final version of the parser, so this is what its main constructor looked like
class Rule
{
	String lhs,rhs1,rhs2;
	Double prob;

	public Rule(String l,String r1,String r2,Double d)
	{
		lhs = l;
		rhs1=r1;rhs2=r2;
		prob=d;
	}
}
Making this change made a very noticeable improvement in the speed of the parser, even though at this early stage in development I was only testing it on a small grammar of no more than 20 rules.

However, there was more room for improvement. Since the CYK Parsing Algorithm is a Bottom-up algorithm it made sense to concentrate on what was on the RHS of the rule. This lead me to change my data-structures. I stored the rules using the RHS of the rule as the hashing key. The value for this key was another hashtable which contained possible LHSs of the rule as keys and the probability of LHS --> RHS as the value. This meant that when I had found the RHS of some rule, I could easily access all the possible LHSs of the rule and their associated probabilities.
This also lead me to change the structure of p. I changed this so that it was still a hashtable, but its key now only contained two indices, it no longer contained any non-terminals. The value of each key then became another hashtable,similar to the rules hashtable, with all possible non-terminal symbols as keys and probabilities as values. I did not change the structure of B since it was not really contributing much to the speed of parsing, it was used at a later stage in the program.

Making this change made noticeable improvements when I began testing on larger grammars. The program will parse short sentences (less than 20 words including punctuation) almost instantly. As sentence length grows, obviously the slower it gets. It can parse sentences up to 40 words in length, depending on their structure, but it usually runs out of memory while parsing for anything above that. It often has difficulty with sentences with length greater than 30.

Writing the Graphical User Interface
Back to Top

Before I started this project, I had no knowledge about writing graphical interfaces to programs. I know quite a bit more now, though most of the SWING code I have written has been adapted from code on the Internet or code written by other people. Having said that, I was still able to design and write a simple user interface to do exactly what I wanted it to do relatively easily, once I had learned the basics of SWING.

The GUI consists of one main frame and panel, four buttons, a short menu, a text area to type in a tagged sentence, and a progress monitor text area which you cannot edit. The code for the "Choose File" button was taken from java.sun.com, but I wrote the others myself.

The code for the Progress Monitor was the last piece of code to be written and uses Java's Thread mechanism. I had tried to update the progress monitor as the parser was running through each sentence, but for some reason it would only update after it had finished parsing a whole file. This completely went against the whole idea of a progress monitor. So to get around this problem, I created three separate subclasses of Thread. The first subclass was to actually update the progress textarea, the second parsed a single sentence coming from the text area in the GUI, and the last parsed a complete file. Using threads solved my problem of the progress not being updated at the right time, though my program does not make use of threads as they were probably originally intended. I could maybe have made use of threads to try and get my program to start parsing several sentences at once, though with the high demands on memory by my program, I'm not sure how more efficient this would have been, if at all.

The Graphical Output of the Parser
Back to Top

I had originally planned to write my own program that would represent the output in a tree-like representation, but that proved to be too difficult for the scale of this project. Instead I downloaded a program from http://www.cis.upenn.edu/~josephr/Treefig/ . The main problem with this code was that it was written in C,which I am not very familiar with, so I found it very difficult to deal with any errors it gave when compiling. With some help and after some minor modifications to the C code so that it would compile on windows, I managed to get an executable program compiled for both Unix and Windows. All I needed to do then was convert the bracketed output of my parse to a format containing only square brackets and call this executable from my java program.

I also made some minor adjustments to the java code which displays the end-result so that it would also show the probability associated with each parse.

Evaluation of the Parser
Back to Top

I had also planned to write my own program to evaluate the results of my parser, but there was not much time remaining when I got around to this section, so I also downloaded a program from http://www.cs.nyu.edu/cs/projects/proteus/evalb to do the evaluation.

The Evaluation did not run as smoothly as anticipated. The main problems again were discrepancies between the parsed version of the treebank and the tagged version. I was generating a parse based on the tagged version, but when this was different to the parsed version, obviously it was not going to be easy to compare them

The evaluation program runs on a sentence by sentence basis. This in itself was a stumbling block, since sentence boundaries were not always clear in the tagged files. It made sense to run the parser program in batch mode, where it simply took in an input file and generated a parsed output file. I assumed there was some sort of pattern to the way the sentence boundaries ( as defined by the parsed version of the treebank) were marked in the tagged file. I failed to see this pattern however, though my guess was close, and as a result I often parsed what I thought was one sentence, but the manual parsers of the treebank decided were two. This meant in order for me to be able to properly compare my results, I had to re-parse any sentences where this type of mis-match occurred. This was extremely time-consuming. Also adding to the problem of time was the fact that sometimes it took several hours just to parse one file.

Other small discrepancies, which did not show up until I had parsed my files, where things like 0.7 appearing in the tagged version, but .7 appearing in the parsed version. This caused the evaluation program to generate an error. I had to manually correct all these mistakes before the evaluation program would run successfully.

I used the PARSEVAL measures [3] to evaluate my parser. These measures are:
 
	Labeled Precision = number of correct constituents in proposed parse
			   ----------------------------------------------------
				    number of constituents in proposed parse


	Labeled Recall = number of correct constituents in proposed parse
			---------------------------------------------------
				number of constituents in treebank parse

	Crossing Brackets = number of constituents which violate constituent 
		 	    boundaries with a constituent in the treebank parse.
For a constituent to be correct it must span the same set of words (including punctuation) and have the same label as a constituent in the treebank parse.

The results of the evaluation were as follows:
Number of sentence        =   2132
Number of Error sentence  =      0
Number of Skip  sentence  =      0
Number of Valid sentence  =   2132
Bracketing Recall         =  74.45%
Bracketing Precision      =  71.88%
Complete match            =  14.87%
Average crossing          =   2.19
No crossing               =  38.93%
2 or less crossing        =  67.40%
Tagging accuracy          = 100.00%
I then ran some tests using the grammar extracted which keept the functional information. I evaluated the results on about 80% of the files parsed. There were a lot more sentences in this instance that did not get parsed because of memory limitations. The results of the evaluation were as follows:
Number of sentence        =   1062
Number of Error sentence  =      0
Number of Skip  sentence  =      0
Number of Valid sentence  =   1062
Bracketing Recall         =  78.46%
Bracketing Precision      =  79.07%
Complete match            =  26.27%
Average crossing          =   1.23
No crossing               =  53.86%
2 or less crossing        =  81.73%
Tagging accuracy          = 100.00
The results seem to be slightly improved for those sentences that did get parsed. This was to be expected since with more functional information included in the rules, the rules were more specific.

Further Developments of the Project
Back to Top
There are some changes that could be made to this project in future developments. The code for extracting the grammar could probably be improved to reflect the recursive nature of the bracketed structures even more. The parsing algorithm could be changed to one which did not require a grammar in CNF, and therefore use a smaller grammar. This may improve parsing speed, though obviously the algorithm would be more difficult to implement than the CYK algorithm. The User Interface could be improved to catch all errors, I did not get a chance to complete testing of the interface, so it is unlikely to perform well when given unexpected data. The grammar could also be lexicalised to see if this makes any improvements to results.
References
  1. Abney, Steven (1996) Statistical Methods and Linguistics: In Judith Klavans and Philip Resnik, eds., The Balancing Act. MIT Press, Cambridge, MA.
  2. Allen, James (1995) Natural Language Understanding Benjamin/Cummings Publishing Company Inc.
  3. Black, E. et al (1991) A Procedure for Quantitavely Comparing the Syntactic Coverage of English Grammars in Proceedings of the February 1991 DARPA Speech and Natural Language Workshop.
  4. Collins, Michael (1996) A New Statistical Parser Based on Bigram Lexical Dependencies in the proceedings of ACL 1996.
  5. Collins, Michael (1999) Head-driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania, Philadelphia.
  6. Krenn, Brigitte and Samuelssonm Christer (1997) The Linguist's Guide to Statistics http://www.coli.uni-sb.de/~krenn/edu.html Last visited: 25th May 2001
  7. Manning, Christopher D and Schütze, Hinrich (1999) Foundations of Statistical Natural Language Processing MIT Press, Cambridge, MA
  8. Stolcke, Andreas (1993) An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities Berkley, CA (revised 1994)