next up previous
Next: Results Up: Error Detection Evaluation Previous: Error Detection Evaluation

Subsections


Experimental Setup


Test Data and Evaluation Procedure

The following steps are carried out to produce training and test data for this experiment:

  1. Speech material, poems, captions and list items are removed from the BNC. 4.2 million sentences remain. The order of sentences is randomised.
  2. For the purpose of cross-validation, the corpus is split into 10 parts.
  3. Each part is passed to the 4 automatic error insertion modules described in Section 3.3, resulting in 40 additional sets of varying size.
  4. The first 60,000 sentences of each of the 50 sets, i.e. 3 million sentences, are parsed with XLE.3
  5. N-gram frequency information is extracted for the first 60,000 sentences of each set. An additional 20,000 is extracted as held-out data.
  6. 10 sets with mixed error types are produced by joining a quarter of each respective error set.
  7. For each error type (including mixed errors) and cross-validation set, the 60,000 grammatical and 60,000 ungrammatical sentences are joined.
  8. Each cross-validation run uses one set out of the 10 as test data (120,000 sentences) and the remaining 9 sets for training (1,080,000 sentences).

The experiment is a standard binary classification task. The methods classify the sentences of the test sets as grammatical or ungrammatical. We use the standard measures of precision, recall, f-score and accuracy (Figure 1).

Figure: Evaluation measures: tp = true positives, fp = false positives, tn = true negatives, fn = false negatives, pr = precision, re = recall
precision = tp/(tp+fp), 
recall = tp/(tp+fn),
f-score = 2 pr * re / (pr + re),
accuracy = (tp+tn)/(total)

True positives are understood to be ungrammatical sentences that are identified as such. The baseline precision and accuracy is 50% as half of the test data is ungrammatical. If 100% of the test data is classified as ungrammatical, recall will be 100% and f-score $2/3$. Recall shows the accuracy we would get if the grammatical half of the test data was removed. Parametrised methods are first optimised for accuracy and then the other measures are taken. Therefore, f-scores below the artificial $2/3$ baseline are meaningful.

Method 1: Precision Grammar

According to the XLE documentation, a sentence is marked with a star (*) if its optimal solution uses a constraint marked as ungrammatical. We use this star feature, parser exceptions and zero number of parses to classify a sentence as ungrammatical.

Method 2: POS N-grams

In each cross-validation run, the full data of the remaining 9 sets of step 2 of the data generation (see Section 4.1.1) is used as a reference corpus of $0.9 \times 4,200,000 = 3,800,000$ assumedly grammatical sentences. The reference corpora and data sets are POS tagged with the IMS TreeTagger [Schmidt 1994]. Frequencies of POS n-grams ( $n=2, \ldots, 7$) are counted in the reference corpora. A test sentence is flagged as ungrammatical if it contains an n-gram below a fixed frequency threshold. Method 2 has two parameters: $n$ and the frequency threshold.

Method 3: Decision Trees on XLE Output

The XLE parser outputs additional statistics for each sentence that we encode in six features:

Training data for the decision tree learner is composed of $9 \times 60,000 = 540,000$ feature vectors from grammatical sentences and $9 \times 15,000 = 135,000$ feature vectors from ungrammatical sentences of each error type, resulting in equal amounts of grammatical and ungrammatical training data.

We choose the weka implementation of machine learning algorithms for the experiments [Witten and Frank 2000]. We use a J48 decision tree learner with the default model.

Method 4: Decision Trees on N-grams

Method 4 follows the setup of Method 3. However, the features are the frequencies of the rarest n-grams ( $n=2, \ldots, 7$) in the sentence. Therefore, the feature vector of one sentence contains 6 numbers.

Method 5: Decision Trees on Combined Feature Sets

This method combines the features of Methods 3 and 4 for training a decision tree.



Footnotes

... XLE.3
We use the XLE command parse-testfile with parse-literally set to 1, max xle scratch storage set to 1,000 MB, timeout to 60 seconds, and the XLE English LFG. Skimming is not switched on and fragments are.
... parses4
The use of preferred versus dispreferred constraints are used to distinguish optimal parses from unoptimal ones.

next up previous
Next: Results Up: Error Detection Evaluation Previous: Error Detection Evaluation
jwagner@computing.dcu.ie