Next: Commonly Produced Grammatical Errors
Up: Ungrammatical Data
Previous: Ungrammatical Data
An Artificial Error Corpus
In order to meaningfully evaluate a shallow versus deep approach to automatic error detection, a
large test set of ungrammatical sentences is needed.
A corpus of ungrammatical sentences can take the form of a learner corpus [Granger 1993, Emi 2004],
i.e. a corpus of sentences produced by language learners,
or it can take the form of a more general error corpus comprising sentences which are not necessarily produced in a language-learning context and which contain competence and performance errors produced by native and non-native speakers of the language [Becker 1999, Foster and Vogel 2004, Foster 2005].
For both types of error corpus, it is not enough to collect a large set of sentences which are likely to contain an error - it is also necessary to examine each sentence in order to determine whether an error has actually occurred, and, if it has, to note the nature of the error. Thus, like the creation of a treebank, the creation of a corpus of ungrammatical sentences requires time and linguistic knowledge, and is by no means a trivial task.
A corpus of ungrammatical sentences which is large enough to be useful can be created automatically by inserting, deleting or replacing words in grammatical sentences. These transformations should be linguistically realistic and should, therefore, be based on an analysis of naturally produced grammatical
errors.
Automatically generated error corpora have been used before in natural language processing.
Bigert 2004
and Wilcox-O'Hearn et al. 2006,
for example, automatically introduce spelling errors into texts.
Here, we
generate a large error corpus by automatically inserting four different kinds of grammatical errors
into BNC sentences.
Next: Commonly Produced Grammatical Errors
Up: Ungrammatical Data
Previous: Ungrammatical Data
jwagner@computing.dcu.ie