next up previous
Next: Analysis Up: Error Detection Evaluation Previous: Experimental Setup


Results

Table 1 shows the results for Method 1, which uses XLE starredness, parser exceptions5 and zero parses to classify grammaticality. Table 2 shows the results for Method 2, the basic n-gram approach. Table 3 shows the results for Method 3, which classifies based on a decision tree of XLE features. The results for Method 4, the n-gram-based decision tree approach, are shown in Table 4. Finally, Table 5 shows the results for Method 5 which combines n-gram and XLE features in decision trees.

In the case of Method 2, we first have to find optimal parameters. As only very limited integer values for $n$ and the threshold are reasonable, an exhaustive search is feasible. We considered $n=2, \ldots, 7$ and frequency thresholds below 20,000. Separate held-out data (400,000 sentences) is used in order to avoid overfitting. Best accuracy is achieved with 5-grams and a threshold of 4. Table 2 reports results with these parameters.

The standard deviation of results across cross-validation runs is below 0.006 on all measures, except for Method 4. Therefore we only report average percentages. The highest observed standard deviation is 0.0257 for recall of Method 4 on agreement errors.

For Methods 3, 4 and 5, the decision tree learner optimises accuracy and, in doing so, chooses a trade-off between precision and recall.


Table: Classification results with XLE starredness, parser exceptions and zero parses (Method 1)
Error type Pr. Re. F-Sc. Acc.
Agreement 66.2 64.6 65.4 65.8
Real-word 63.5 57.3 60.3 62.2
Extra word 64.4 59.7 62.0 63.4
Missing word 59.2 47.8 52.9 57.4
Mixed errors 63.5 57.3 60.3 62.2



Table: Classification results with 5-gram and frequency threshold 4 (Method 2)
Error type Pr. Re. F-Sc. Acc.
Agreement 58.6 51.7 55.0 57.6
Real-word 64.0 64.9 64.5 64.2
Extra word 64.8 67.3 66.0 65.4
Missing word 57.2 48.8 52.7 56.1
Mixed errors 61.5 58.2 59.8 60.8



Table: Classification results with decision tree on XLE output (Method 3)
Error type Pr. Re. F-Sc. Acc.
Agreement 67.0 79.3 72.6 70.1
Real-word 63.4 67.6 65.4 64.3
Extra word 63.0 66.4 64.7 63.7
Missing word 59.7 57.8 58.7 59.4
Mixed errors 63.4 67.8 65.6 64.4



Table: Classification results with decision tree on vectors of frequency of rarest n-grams (Method 4)
Error type Pr. Re. F-Sc. Acc.
Agreement 61.2 53.8 57.3 59.9
Real-word 65.3 64.3 64.8 65.1
Extra word 66.4 67.4 66.9 66.7
Missing word 59.1 49.2 53.7 57.5
Mixed errors 63.3 58.7 60.9 62.3



Table: Classification results with decision tree on joined feature set (Method 5)
Error type Pr. Re. F-Sc. Acc.
Agreement 67.1 75.2 70.9 69.2
Real-word 65.8 70.7 68.1 67.0
Extra word 65.9 71.2 68.5 67.2
Missing word 61.2 58.0 59.5 60.6
Mixed errors 65.2 68.8 66.9 66.0




Footnotes

... exceptions5
XLE parsing (see footnote 2 for configuration) runs out of time for 0.7 % and out of memory for 2.5 % of sentences, measured on training data of the first cross-validation run, i.e. 540,000 grammatical sentence and 135,000 of each error type. 14 sentences of 3 million caused the parser to terminate abnormally.

next up previous
Next: Analysis Up: Error Detection Evaluation Previous: Experimental Setup
jwagner@computing.dcu.ie