next up previous
Next: Automatic Error Creation Up: Ungrammatical Data Previous: An Artificial Error Corpus


Commonly Produced Grammatical Errors

Following Foster 2005, we define a sentence to be ungrammatical if all the words in the sentence are well-formed words of the language in question, but the sentence contains one or more error. This error can take the form of a performance slip which can occur due to carelessness or tiredness, or a competence error which occurs due to a lack of knowledge of a particular construction. This definition includes real-word spelling errors and excludes non-word spelling errors. It also excludes the abbreviated informal language used in electronic communication. Using the above definition as a guideline, a 20,000 word corpus of ungrammatical English sentences was collected from a variety of written texts including newspapers, academic papers, emails and website forums [Foster and Vogel 2004, Foster 2005]. The errors in the corpus were carefully analysed and classified in terms of how they might be corrected using the three word-level correction operators: insert, delete and substitute. The following frequency ordering of the three word-level correction operators was found:
substitute (48%) $>$ insert (24%) $>$ delete (17%) $>$ combination (11%)
Stemberger 1982 reports the same ordering of the substitution, deletion and insertion correction operators in a study of native speaker spoken language slips. Among the grammatical errors which can be corrected by substituting one word for another, the most common errors are real-word spelling errors and agreement errors. In fact, 72% of all errors fall into one of the following four classes:
  1. missing word errors:
    What are the subjects? $>$ What the subjects?
  2. extra word errors:
    Was that in the summer? $>$ Was that in the summer in?
  3. real-word spelling errors:
    She could not comprehend. $>$ She could no comprehend.
  4. agreement errors:
    She steered Melissa round a corner. $>$ She steered Melissa round a corners.
A similar classification was adopted by Nicholls 1999, having analysed the errors in a learner corpus. Our research is currently limited to the four error types given above, i.e. missing word errors, extra word errors, real-word spelling errors and agreements errors. However, it is possible for it to be extended to handle a wider class of errors.


next up previous
Next: Automatic Error Creation Up: Ungrammatical Data Previous: An Artificial Error Corpus
jwagner@computing.dcu.ie