|
Corpus Preprocessing for DCU CLEF 2004
|
Always check that DOCID and DOCNO are intact and their content not encoded. If they extend over several lines they will not be passed correctly. Futhermore, these fields should not include whitespace at all (they must not include more than one sequence of whitespace), because words that also occure in the normal text will be encoded in any case.
There are scripts do_f[ir]_preproc.sh that can be used if all documents are in sibling directories *_iso and with filenames with the extension .sgml. For example,
../nonE_workspace/fr/orig/ats95_iso
../nonE_workspace/fr/orig/ats94_iso
../nonE_workspace/fr/orig/lem94_iso
../nonE_workspace/fr/orig/lem95_iso
and
../nonE_workspace/fr/orig/ats95_iso/ats_19950101.sgml
The scripts create *_encoded directories, in the examples
../nonE_workspace/fr/orig/ats95_encoded
and so on, and place the encoded files with the new extensions
_encoded.sgml in them.
(I moved them down to fr/ later on.)
The scripts do_stemming_ru[12].sh
work very similar to the script described
above. The have not been renamed because I they are also stored in a CVS.
The scripts assume that the documents are available in ISO 8859-5 in
directories named _iso88595 and create directories
_koi8 and _encoded.
Use the *preproc* scripts directly, for example
num fields can be brought on a single line with the script
$ cp -dp /usr/local/clef/2004/Systran/Top-de04_GF.txt .
$ french-preproc.pl Top-de04_GF.txt Top-de04_GF_encoded.txt
num_correction.py.
Saturday, 16-Oct-2004 16:57:06 IST