|
Corpus Preprocessing for DCU CLEF 2004
|
for each line:
Then any sequence of whitespace separates tokens. A token starting with < and ending with > is considered to be a SGML tag. Note that tags cannot contain whitespace because any such tag would have been split into at least two tokens during tokenisation.
Non-breakable space (\xa0 in ISO 8859-1) seems not to be treated as whitespace. Some tokens in the Finnish corpus start with this character. Furtunatly, the other steps accepted non-breakable space as a token character as well.
Thursday, 14-Oct-2004 19:06:19 IST