|
Corpus Preprocessing for DCU CLEF 2004
|
Punctuation has been separated from the words during tokenisation and encoded as individual tokens. We reconstruct punctuation by substituting every space followed by any encoded punctuation by the corresponding punctuation. We reconstruct the punctuation in order to provide the sentence structure to other modules, for instance text summarisation.
Why do we only do this if the punctuation is followed by
whitespace, for example s/\ssb(\s)/\,$1/g;?
Answer: First of all, other encoded words may start with 'sb'.
Secondly, if a word starts or contains punctuation, it
cannot be replaced afterwards as the word is encoded as one
unit.
Tuesday, 22-Nov-2005 12:24:45 GMT