Next: Automatic Error Creation
Up: Ungrammatical Data
Previous: An Artificial Error Corpus
Commonly Produced Grammatical Errors
Following Foster 2005, we define a sentence to be ungrammatical if all the words in the sentence are well-formed words of the language in question,
but the sentence contains one or more error. This error can take the form of a performance slip which can occur due to
carelessness or tiredness, or a competence error which occurs due to a lack of knowledge of a particular construction.
This definition includes real-word spelling errors and excludes non-word spelling errors.
It also excludes the abbreviated informal language
used in electronic communication.
Using the above definition as a guideline, a 20,000 word corpus of ungrammatical English sentences was collected from a variety of written texts including newspapers, academic papers, emails and website
forums [Foster and Vogel 2004, Foster 2005].
The errors in the corpus were carefully analysed and classified in terms of
how they might be corrected using the three word-level correction operators: insert, delete and substitute.
The following frequency ordering of the three word-level correction operators
was found:
substitute (48%)
insert (24%)
delete (17%)
combination (11%)
Stemberger 1982 reports the same ordering of the substitution, deletion and insertion correction operators in a study of native speaker spoken language slips. Among the grammatical errors which can be corrected by substituting one word for another, the most common errors are real-word spelling errors
and agreement errors.
In fact, 72% of all errors fall into one of the following four classes:
- missing word errors:
What are the subjects?
What the subjects?
- extra word errors:
Was that in the summer?
Was that in the summer in?
- real-word spelling errors:
She could not comprehend.
She could no comprehend.
- agreement errors:
She steered Melissa round a corner.
She steered Melissa round a corners.
A similar classification was adopted by Nicholls 1999, having analysed the errors in a learner corpus.
Our research is currently limited to the four error types given above,
i.e. missing word errors, extra word errors, real-word spelling errors
and agreements errors.
However, it is possible for it to be extended to handle a wider class
of errors.
Next: Automatic Error Creation
Up: Ungrammatical Data
Previous: An Artificial Error Corpus
jwagner@computing.dcu.ie