How To Use Our Tools

Converting Document Collection

General Note

Always check that DOCID and DOCNO are intact and their content not encoded. If they extend over several lines they will not be passed correctly. Futhermore, these fields should not include whitespace at all (they must not include more than one sequence of whitespace), because words that also occure in the normal text will be encoded in any case.

Finnish and French

There are scripts do_f[ir]_preproc.sh that can be used if all documents are in sibling directories *_iso and with filenames with the extension .sgml. For example,

../nonE_workspace/fr/orig/ats95_iso
../nonE_workspace/fr/orig/ats94_iso
../nonE_workspace/fr/orig/lem94_iso
../nonE_workspace/fr/orig/lem95_iso
and
../nonE_workspace/fr/orig/ats95_iso/ats_19950101.sgml

The scripts create *_encoded directories, in the examples ../nonE_workspace/fr/orig/ats95_encoded and so on, and place the encoded files with the new extensions _encoded.sgml in them. (I moved them down to fr/ later on.)

Russian

The scripts do_stemming_ru[12].sh work very similar to the script described above. The have not been renamed because I they are also stored in a CVS. The scripts assume that the documents are available in ISO 8859-5 in directories named _iso88595 and create directories _koi8 and _encoded.

Converting Topic Files

Use the *preproc* scripts directly, for example

$ cp -dp /usr/local/clef/2004/Systran/Top-de04_GF.txt .
$ french-preproc.pl Top-de04_GF.txt Top-de04_GF_encoded.txt
num fields can be brought on a single line with the script num_correction.py.

Saturday, 16-Oct-2004 16:57:06 IST