next up previous
Next: Error Detection Evaluation Up: Ungrammatical Data Previous: Commonly Produced Grammatical Errors

Subsections


Automatic Error Creation

The error creation procedure takes as input a part-of-speech-tagged corpus of sentences which are assumed to be well-formed, and outputs a corpus of ungrammatical sentences. The automatically introduced errors take the form of the four most common error types found in the manually created corpus, i.e. missing word errors, extra word errors, real-word spelling errors and agreement errors. For each sentence in the original tagged corpus, an attempt is made to automatically produce four ungrammatical sentences, one for each of the four error types. Thus, the output of the error creation procedure is, in fact, four error corpora.

Missing Word Errors

In the manually created error corpus of Foster 2005, missing word errors are classified based on the part-of-speech (POS) of the missing word. 98% of the missing parts-of-speech come from the following list (the frequency distribution in the error corpus is given in brackets):
det (28%) $>$ verb (23%) $>$ prep (21%) $>$ pro (10%) $>$ noun (7%) $>$ ``to" (7%) $>$ conj (2%)
We use this information when introducing missing word errors into the BNC sentences. For each sentence, all words with the above POS tags are noted. One of these is selected and deleted. The above frequency ordering is respected so that, for example, missing determiner errors are produced more often than missing pronoun errors. No ungrammatical sentence is produced if the original sentence contains just one word or if the sentence contains no words with parts-of-speech in the above list.

Extra Word Errors

We introduce extra word errors in the following three ways:
  1. Random duplication of any token within a sentence: That's the way we we learn here.
  2. Random duplication of any POS within a sentence: There it he was.
  3. Random insertion of an arbitrary token into the sentence: Joanna drew as a long breadth.
Apart from the case of duplicate tokens, the extra words are selected from a list of tagged words compiled from a random subset of the BNC. Again, our procedure for inserting extra words is based on the analysis of extra word errors in the 20,000 word error corpus of Foster 2005.

Real-Word Spelling Errors

We classify an error as a real-word spelling error if it can be corrected by replacing the erroneous word with another word with a Levenshtein distance of one from the erroneous word, e.g. the and they. Based on the analysis of the manually created error corpus [Foster 2005], we compile a list of common English real-word spelling error word pairs. For each BNC sentence, the error creation procedure records all tokens in the sentence which appear as one half of one of these word pairs. One token is selected at random and replaced by the other half of the pair. The list of common real-word spelling error pairs contains such frequently occurring words as is and a, and the procedure therefore produces an ill-formed sentence for most input sentences.

Agreement Errors

We introduce subject-verb and determiner-noun number agreement errors into the BNC sentences. We consider both types of agreement error equally likely and introduce the error by replacing a singular determiner, noun or verb with its plural counterpart, or vice versa. For English, subject-verb agreement errors can only be introduced for present tense verbs, and determiner-noun agreement errors can only be introduced for determiners which are marked for number, e.g. demonstratives and the indefinite article. The procedure would be more productive if applied to a morphologically richer language.


Covert Errors

James 1998 uses the term covert error to describe a genuine language error which results in a sentence which is syntactically well-formed under some interpretation different from the intended one. The prominence of covert errors in our automatically created error corpus is estimated by manually inspecting 100 sentences of each error type. The percentage of grammatical structures that are inadvertently produced for each error type and an example of each one are shown below:

The occurrence of these covert errors can be reduced by fine-tuning the error creation procedure but they can never be completely eliminated. Indeed, they should not be eliminated from the test data, because, ideally, an optimal error detection system should be sophisticated enough to flag syntactically well-formed sentences containing covert errors as potentially ill-formed.2



Footnotes

... ill-formed.2
An example of this is given in the XLE User Documentation (http://www2.parc.com/isl/groups/nltt/xle/doc/). The authors remark that an ungrammatical reading of the sentence Lets go to the store in which Lets is missing an apostrophe, is preferable to the grammatical yet implausible analysis in which Lets is a plural noun.

next up previous
Next: Error Detection Evaluation Up: Ungrammatical Data Previous: Commonly Produced Grammatical Errors
jwagner@computing.dcu.ie