Next: Analysis
Up: Error Detection Evaluation
Previous: Experimental Setup
Results
Table 1 shows the results for Method 1,
which uses XLE starredness, parser exceptions5 and zero parses
to classify grammaticality.
Table 2 shows the results for Method 2,
the basic n-gram approach.
Table 3 shows the results for Method 3,
which classifies based on a decision tree of XLE features.
The results for Method 4, the n-gram-based decision tree approach,
are shown in Table 4.
Finally, Table 5 shows the results for
Method 5 which combines n-gram and XLE features in decision trees.
In the case of Method 2, we first have to find optimal parameters.
As only very limited integer values for
and the threshold are
reasonable, an exhaustive search is feasible.
We considered
and frequency thresholds below 20,000.
Separate held-out data (400,000 sentences) is used
in order to avoid overfitting.
Best accuracy is achieved with 5-grams
and a threshold of 4.
Table 2 reports results with these parameters.
The standard deviation of results across cross-validation runs
is below 0.006 on all measures, except for Method 4.
Therefore we only report average percentages.
The highest observed standard deviation
is 0.0257 for recall of Method 4 on agreement errors.
For Methods 3, 4 and 5, the decision tree learner optimises accuracy and, in doing so,
chooses a trade-off
between precision and recall.
Table:
Classification results with XLE starredness,
parser exceptions and zero parses
(Method 1)
| Error type |
Pr. |
Re. |
F-Sc. |
Acc. |
| Agreement |
66.2 |
64.6 |
65.4 |
65.8 |
| Real-word |
63.5 |
57.3 |
60.3 |
62.2 |
| Extra word |
64.4 |
59.7 |
62.0 |
63.4 |
| Missing word |
59.2 |
47.8 |
52.9 |
57.4 |
| Mixed errors |
63.5 |
57.3 |
60.3 |
62.2 |
|
Table:
Classification results with 5-gram and frequency threshold 4
(Method 2)
| Error type |
Pr. |
Re. |
F-Sc. |
Acc. |
| Agreement |
58.6 |
51.7 |
55.0 |
57.6 |
| Real-word |
64.0 |
64.9 |
64.5 |
64.2 |
| Extra word |
64.8 |
67.3 |
66.0 |
65.4 |
| Missing word |
57.2 |
48.8 |
52.7 |
56.1 |
| Mixed errors |
61.5 |
58.2 |
59.8 |
60.8 |
|
Table:
Classification results with decision tree on XLE output
(Method 3)
| Error type |
Pr. |
Re. |
F-Sc. |
Acc. |
| Agreement |
67.0 |
79.3 |
72.6 |
70.1 |
| Real-word |
63.4 |
67.6 |
65.4 |
64.3 |
| Extra word |
63.0 |
66.4 |
64.7 |
63.7 |
| Missing word |
59.7 |
57.8 |
58.7 |
59.4 |
| Mixed errors |
63.4 |
67.8 |
65.6 |
64.4 |
|
Table:
Classification results with decision tree on vectors
of frequency of rarest n-grams
(Method 4)
| Error type |
Pr. |
Re. |
F-Sc. |
Acc. |
| Agreement |
61.2 |
53.8 |
57.3 |
59.9 |
| Real-word |
65.3 |
64.3 |
64.8 |
65.1 |
| Extra word |
66.4 |
67.4 |
66.9 |
66.7 |
| Missing word |
59.1 |
49.2 |
53.7 |
57.5 |
| Mixed errors |
63.3 |
58.7 |
60.9 |
62.3 |
|
Table:
Classification results with decision tree on joined feature set
(Method 5)
| Error type |
Pr. |
Re. |
F-Sc. |
Acc. |
| Agreement |
67.1 |
75.2 |
70.9 |
69.2 |
| Real-word |
65.8 |
70.7 |
68.1 |
67.0 |
| Extra word |
65.9 |
71.2 |
68.5 |
67.2 |
| Missing word |
61.2 |
58.0 |
59.5 |
60.6 |
| Mixed errors |
65.2 |
68.8 |
66.9 |
66.0 |
|
Footnotes
- ... exceptions5
- XLE
parsing (see footnote 2 for configuration)
runs out of time for 0.7 %
and
out of memory for 2.5 % of sentences, measured
on training data of the first cross-validation run,
i.e. 540,000 grammatical sentence and 135,000 of each error
type.
14 sentences of 3 million caused the parser to terminate abnormally.
Next: Analysis
Up: Error Detection Evaluation
Previous: Experimental Setup
jwagner@computing.dcu.ie