next up previous
Next: Ungrammatical Data Up: Background Previous: Precision Grammars

N-gram Methods

Most shallow approaches to grammar error detection originate from the area of real-word spelling error correction. A real-word spelling error is a spelling or typing error which results in a token which is another valid word of the language in question.

The (to our knowledge) oldest work in this area is that of Atwell 1987 who uses a POS tagger to flag POS bigrams that are unlikely according to a reference corpus. While he speculates that the bigram frequency should be compared to how often the same POS bigram is involved in errors in an error corpus, the proposed system uses the raw frequency with an empirically established threshold to decide whether a bigram indicates an error. In the same paper, a completely different approach is presented that uses the same POS tagger to consider spelling variants that have a different POS. In the example sentence I am very hit the POS of the spelling variant hot/JJ is added to the list NN-VB-VBD-VBN of possible POS tags of hit. If the POS tagger chooses hit/JJ, the word is flagged and the correction hot is proposed to the user. Unlike most n-gram-based approaches, Atwell's work aims to detect grammar errors in general and not just real-word spelling errors. However, a complete evaluation is missing.

The idea of disambiguating between the elements of confusion sets is related to word sense disambiguation. Golding 1995 builds a classifier based on a rich set of context features. Mays et al. 1991 apply the noisy channel model to the disambiguation problem. For each candidate correction $S'$ of the input $S$ the probability $P(S') P(S\vert S')$ is calculated and the most likely correction selected. This method is re-evaluated by Wilcox-O'Hearn et al. 2006 on WSJ data with artificial real-word spelling errors.

Bigert and Knutsson 2002 extend upon a basic n-gram approach by attempting to match n-grams of low frequency with similar n-grams in order to reduce overflagging. Furthermore, n-grams crossing clause boundaries are not flagged and the similarity measure is adapted in the case of phrase boundaries that usually result in low frequency n-grams.

Chodorow and Leacock 2000 use a mutual information measure in addition to raw frequency of n-grams. Apart from this, their ALEK system employs other extensions to the basic approach, for example frequency counts from both generic and word-specific corpora are used in the measures. It is not reported how much each of these contribute to the overall performance.

Rather than trying to implement all of the previous n-gram approaches, we implement the basic approach which uses rare n-grams to predict grammaticality. This property is shared by all previous shallow approaches. We also test our approach on a wider class of grammatical errors.


next up previous
Next: Ungrammatical Data Up: Background Previous: Precision Grammars
jwagner@computing.dcu.ie