Overview of Tools
The scripts call other scripts without absolute path.
You should include clef/bin in your path, for example write
export PATH=$PATH:/usr/local/clef/bin
in your .bashrc.
Basic Tools
- num_correction.py
- joins <num> \n* pcdata </num> to one line
- finnish-preproc.pl
- french-preproc.pl
- russian-preproc_koi8.pl
- russian2-preproc_koi8.pl (with stopword2)
- tokenisation, de-capitalisation, stopword removal, stemming,
word encoding and reconstruction of punctuation
-
www.jowagner.net/cgi-bin/baseconv.py
- web interface to word encoding
- do_*.sh
- convenience scripts
Character Set Conversion
- cyrcode.pl
- ISO 8859-5 <--> KOI8
- utf82iso_plain.py
- UTF-8 -> ISO8859-5
- utf82iso.py
- corrects some character level errors of the CLEF corpus as well
- utf8_to_utf16.py, utf16_to_utf8.py
- UTF-8 <--> UTF-16
Character Set Analysis
- chardump.py
- charhistogram.py
- interprets its input with the encoding provided as an argument
and prints an unicode histogram or the data in hex
Tuesday, 21-Jun-2005 14:23:05 IST