Next: Results
Up: Error Detection Evaluation
Previous: Error Detection Evaluation
Subsections
Experimental Setup
Test Data and Evaluation Procedure
The following steps are carried out to produce training and test data
for this experiment:
- Speech material, poems, captions and list items are removed from the BNC.
4.2 million sentences remain.
The order of sentences is randomised.
- For the purpose of cross-validation, the corpus is split into
10 parts.
- Each part is passed to the 4 automatic error insertion
modules described in Section 3.3, resulting in 40 additional sets
of varying size.
- The first 60,000 sentences of each of the 50 sets,
i.e. 3 million sentences,
are parsed with XLE.3
- N-gram frequency information is extracted for the first 60,000
sentences of each set. An additional 20,000 is extracted as
held-out data.
- 10 sets with mixed error types are produced by joining a
quarter of each respective error set.
- For each error type (including mixed errors)
and cross-validation set, the 60,000 grammatical
and 60,000 ungrammatical sentences are joined.
- Each cross-validation run uses one set out of the 10
as test data (120,000 sentences)
and the remaining 9 sets for training (1,080,000 sentences).
The experiment is a standard binary classification task.
The methods classify the sentences of the test sets as grammatical
or ungrammatical.
We use the standard measures of precision, recall, f-score and accuracy
(Figure 1).
Figure:
Evaluation measures: tp = true positives, fp = false positives, tn = true negatives, fn = false negatives, pr = precision, re = recall
|
True positives are understood to be ungrammatical sentences that are
identified as such.
The baseline precision and accuracy is 50% as half of the test data is
ungrammatical.
If 100% of the test data is classified as ungrammatical, recall will be
100% and f-score
.
Recall shows the accuracy we would get if the grammatical half of the
test data was removed.
Parametrised methods are first optimised for accuracy and then
the other measures are taken.
Therefore, f-scores below the artificial
baseline are meaningful.
According to the XLE documentation, a sentence is marked with a star (*)
if its optimal solution uses a constraint marked as ungrammatical.
We use this star feature,
parser exceptions
and zero number of parses
to classify a sentence as ungrammatical.
In each cross-validation run, the full data of the remaining 9 sets
of step 2 of the data generation (see Section 4.1.1)
is used as a reference corpus of
assumedly grammatical
sentences.
The reference corpora and data sets are POS tagged with
the IMS TreeTagger [Schmidt 1994].
Frequencies of POS n-grams (
) are counted in the
reference corpora.
A test sentence is flagged as ungrammatical if it contains an
n-gram below a fixed frequency threshold.
Method 2 has two parameters:
and the frequency threshold.
The XLE parser outputs additional statistics for each sentence
that we encode in six features:
- An integer indicating starredness (0 or 1) and
various parser exceptions
(-1 for time out, -2 for exceeded memory, etc.)
- The number of optimal parses4
- The number of unoptimal parses
- The duration of parsing
- The number of subtrees
- The number of words
Training data for the decision tree learner is composed of
feature vectors from grammatical sentences and
feature vectors from ungrammatical sentences of each
error type, resulting in equal amounts of grammatical
and ungrammatical training data.
We choose the weka implementation of machine learning algorithms
for the experiments [Witten and Frank 2000].
We use a J48 decision tree learner with the default model.
Method 4 follows the setup of Method 3.
However, the features are the frequencies of the rarest n-grams
(
) in the sentence.
Therefore, the feature vector of one sentence contains 6 numbers.
This method combines the features of Methods 3 and 4 for training
a decision tree.
Footnotes
- ... XLE.3
- We use the XLE command
parse-testfile with parse-literally set to 1,
max xle scratch storage set to 1,000 MB,
timeout to 60 seconds,
and the XLE English LFG.
Skimming is not switched on and fragments are.
- ... parses4
- The use of preferred versus dispreferred constraints are used to distinguish optimal parses from unoptimal ones.
Next: Results
Up: Error Detection Evaluation
Previous: Error Detection Evaluation
jwagner@computing.dcu.ie