|
Corpus Preprocessing for DCU CLEF 2004
|
The Russian CLEF corpus 'IZVESTIA' is encoded in Unicode UTF-8. We convert the documents to KOI8-R using ISO 8859-5 as an intermediate step because those converters that we found on the web do not handle Unicode and converting Unicode to ISO is very easy.
The corpus contains some box drawing symbols (Unicode U+25xx) that apparently represent other characters. We mapped these characters to whitespace on the Unicode level before doing any conversion. In unambiguous cases we mapped them to the correct Russian character. Note that we have to ommit the degree sign although KOI8-R supports it because ISO 8859-5 does not contain it. In addition, we map both vowels je and jo to one character.
| from | to | comment |
|---|---|---|
| U+00b0 (°) | \x20 (SPACE) | degree sign, not available in ISO 8859-5 |
| U+2510 (┐) | \x20 (SPACE) | maybe the number sign, occurs once |
| U+2514 (└) | \x20 (SPACE) | once |
| U+2524 (┤) | \x20 (SPACE) | no comment |
| U+252c (┬) | \x20 (SPACE) | nothing |
| U+2553 (╓) | \x20 (SPACE) | may be the question mark |
| U+2554 (╔) | U+0422 (Т) | occurs in one sentence only |
| U+2557 (╗) | \x20 (SPACE) | most likely the left parenthesis |
| U+2558 (╘) | \x20 (SPACE) | apostrophe or back quote or front quote |
| U+2559 (╙) | \x20 (SPACE) | closing quote, maybe >> (») |
| U+255b (╛) | \x20 (SPACE) | could be a quote in all but one case |
| U+255d (╝) | \x20 (SPACE) | appears to be a dash |
| U+2561 (╡) | U+043e (о) | occurs in one sentence only |
| U+2563 (╣) | \x20 (SPACE) | non-word character |
| U+2565 (╥) | \x20 (SPACE) | ambiguous: degree sign, quote, bI (ы) |
| U+2569 (╩) | \x20 (SPACE) | ambiguous |
| U+256c (╬) | U+0447 (ч) | occurs once |
| U+0451 (ё) | U+0435 (е) | optionally distinguished vowels je and jo |
Use the Unicode module of your favorite programming language to convert between different encodings. For example, in Python you write:
binaryData = sys.stdin.read()
unicodeString = binaryData.decode('utf-8')
isoData = unicodeString.encode('iso 8859-5', 'replace')
sys.stdout.write(isoData)
We used cyrcode.pl, a universal cyrillic decoder written by
Ilya Sandler.
It is a simble 8 bit substitution engine supporting some Russian character sets
and is restricted to basis Russian characters.
It does not change all other input bytes.
Thursday, 14-Oct-2004 19:15:20 IST