|
Corpus Preprocessing for DCU CLEF 2004
|
Each type that has not been removed and is not considered to be an SGML tag is placed in a temporary file in random order. This file is then stemmed by Snowball. That means that tokens of the same type are stemmed only once and that context information cannot be used. (Context would be artificial anyway since we already removed stopwords.)
From the stemmed file a mapping from types to lemmata is build and applied to the sequence of tokens.
What problems had to be solved to use Snowball? Anything that involved a decision and hence may have influenced our results?
Thursday, 14-Oct-2004 19:14:39 IST