|
Corpus Preprocessing for DCU CLEF 2004
|
for each line:
We implemented a general ISO 8859-1 case mapping:
| from | to |
|---|---|
| \x41 - \x5a (A - Z) | \x61 - \x7a (a - z) |
| \xc1 - \xd6 (À - Ö) | \xe1 - \xf6 (à - ö) |
| \xd8 - \xde (Ø - Þ) | \xf8 - \xfe (ø - þ) |
Please note when reading the following table that characters are not in Russian alphabetic order in KOI8-R. See Wikipedia article on KOI8-R for details. Note also that we did not map Latin letters that might appear in proper nouns.
| from | to |
|---|---|
| \xe0 - \xff (Ю - Ъ) | \xc0 - \xdf (ю - ъ) |
Thursday, 14-Oct-2004 19:14:59 IST