|
Corpus Preprocessing for DCU CLEF 2004
|
Joachim Wagner - School of Computing - Dublin City University
jwagner@computing.dcu.ie - http://jowagner.dnsalias.org/
2004-04-20, 2004-08-13
Download baseconv.tgz [10 KB]: baseconv.py, some other scripts and this documentation. You will need Python 2.2 or higher.
The OKAPI system does not accept special characters that are used in Finnish, French and Russian. To avoid any problems, we decided to encode all words with just the 26 small letters a to z. We reused a tool written by one of our students [that's me] which interprets its octet stream input as a big number and rewrites the number in base N representation with digits taken from a given set of characters. Here, N is 26. Practically, the encoding guarantees that different input words are discriminably represented and that the reverse operation (decoding) can be easily performed. However, the encoded form is not readable by humans and string similarities do not stay intact. The latter is not a problem, since we do not want to retrieve fuzzy matches to our queries with OKAPI. Example: The three words pécheur, pêcheur and pêcheurs are encoded as gropmdpbtfui, cbppmdpbtfui and klcgrwruwanejd.
encoding:
$ ./baseconv.py -s -w -b 26outfile
decoding:
$ ./baseconv.py -s -w -b 26 -uoutfile
Option -n inserts newlines each 72 bytes:
$ ./baseconv.py -n <baseconv.py
#9B&7khJ>51-cmaXU.b!ceeh(zj~?Bka>0|BpgjOcQnwpkhG"-/2=\/}R~88l+0u"\TAshEN
@BUd^<jC^iEbu"vW6A^C?mfnzDu~5egp@w}(4Qu\XD}bps{KJR\2Zp35dK/MWtE/W<]o{z[c
...
You can change the range of characters to use: (The hard coded alphabet starts with lower case letter, upper case letter, and numbers.)
$ ./baseconv.py -n -b 4 <baseconv.py dacabacaaacaddcabbdbdadbcadbddcacacbbccbcdcbddcabbcbcdcbcbdbaacaaadbbcdb abdbaccbddcbcdcbccaaccaacadbcacacacacacadaabddcbcdcbcbdbbbcbcadbdadbbccb ...
You can specify different bases for the first digits of the representation:
$ ./baseconv.py -n -b 2,4,6,8,10,12,15,18,24,30,40,50,60,70,80,90 <baseconv.py bbagcdkqpCnk1Pe5K"@[#x-~Q[4nX4x=Wcj:j!)^a~E;f=EC)$H&-5VKc7=_Qde24Nu$+-9R -S<7a$d=pT=X+p97+F8OO&X8eRTE//VX]m]oLH(CHtKlQFvN)r1dv@(fP0gA?!DVA9^J&g"%
Restriction: Trailing zero bytes are discarded:
$ echo -n -e "\000\000" | ./baseconv.py | wc -c
0
(empty output of baseconv.py)
Option -w splits the input and processes each word separately:
$ ./baseconv.py -w -b 52 <baseconv.py hhd RbnARgzCfREQPuUQm SzFNuJfoc
All sequences of whitespace are converted into a single newline. Each word is encoded the same way as in 1.:
$ head -1 baseconv.py #! /usr/bin/env python $ echo -n -e "#\041" | ./baseconv.py -b 52 ; echo hhd $ echo -n -e "/usr/bin/env" | ./baseconv.py -b 52 ; echo RbnARgzCfREQPuUQm
In most applications, we may use 0-9 from the second letter on:
$ ./baseconv.py -w -b 52,62 <baseconv.py | head hNc RBf7ffNAQu5GpXZ4 Sltkj1gP
Option -d and -u can be used for decoding. The do exactly the same.
$ echo hhd RbnARgzCfREQPuUQm SzFNuJfoc | ./baseconv.py -u -w -b 52 ; echo #! /usr/bin/env python
Workaround for trailing zero bytes if length is known:
$ echo -n -e "abc\000" | ./baseconv.py | ./baseconv.py -u | wc -c
3
$ echo -n -e "abc\000" | ./baseconv.py | ./baseconv.py -u -z 8 | hexdump -c
0000000 a b c \0 \0 \0 \0 \0
0000008
Files can be processed directly:
$ ./baseconv.py baseconv.py baseconv.py | wc -l
2
Extra newlines are appended after each file in order to separate the output strings.
Attention: All files are read into memory!
There is no option for this. Use tr instead:
$ echo "a,b.-c" | tr '.,-' ' ' a b c
For French, include the following special punctuation:
$ echo $'\xab\xbb' «»
Surprisingly, German quotes („Schönes Wetter heute.“) are not defined in ISO Latin1.
Maybe the following special hyphen should also be dealt with:
$ echo $'\xad'
Processing slows down considerable if the chunks to be transformed are big, say 20 KB or more. The script is not intended to be a full substitue for mmencode or uuencode. It is suited for binary data with fixed width (option -z) or words of textfiles. With option -w, the performance is only an issue if single words are longer than a few KB.
In any case, if the input is bigger than 1/3 of available memory then there will be problems because it is processed in one step. Upon request, the implementation of option -w can be improved to process the input line by line. Just send me an email.
Option -s provides simple SGML treatment: Every word starting with '<' and ending with '>' is passed without de-/encoding. If tags are separated by whitespace from PCDATA and do not contain whitespace this option will allow to process tagged text.
Examples (-s -w -b 52,62):
<title> This is a title </title> -> <title> uI1oGc jkj Tb iVSM5ij </title> <title>This is a title</title> -> Kwqqd4uo2FY6Ozn jkj Tb S4fjfZikY2IWqgQN9b <title>This-is-a-title</title> -> <title>This-is-a-title</title>
Hope the script is useful
Joachim
Thursday, 14-Oct-2004 18:11:21 IST