Overview of Tools

The scripts call other scripts without absolute path. You should include clef/bin in your path, for example write export PATH=$PATH:/usr/local/clef/bin in your .bashrc.

Basic Tools

num_correction.py
joins <num> \n* pcdata </num> to one line
finnish-preproc.pl
french-preproc.pl
russian-preproc_koi8.pl
russian2-preproc_koi8.pl (with stopword2)
tokenisation, de-capitalisation, stopword removal, stemming, word encoding and reconstruction of punctuation
www.jowagner.net/cgi-bin/baseconv.py
web interface to word encoding
do_*.sh
convenience scripts

Character Set Conversion

cyrcode.pl
ISO 8859-5 <--> KOI8
utf82iso_plain.py
UTF-8 -> ISO8859-5
utf82iso.py
corrects some character level errors of the CLEF corpus as well
utf8_to_utf16.py, utf16_to_utf8.py
UTF-8 <--> UTF-16

Character Set Analysis

chardump.py
charhistogram.py
interprets its input with the encoding provided as an argument and prints an unicode histogram or the data in hex

Tuesday, 21-Jun-2005 14:23:05 IST