Word Encoding

Tokens are almost arbitrary sequences of 8 bit characters. They just do not contain whitespace, upper case letters, and zero bytes. We had to encode all tokens in such a way that the OKAPI system can handle then. We chose to use letters a to z only just to be sure that we get no problems.

Tokens that are considered to be SGML tags are not encoded, i.e. any token starting with < and ending with > stays intact. Therefore, SGML tags must not contain anything that confuses OKAPI. If you try to clean up preprocessed data please read carefully how input is tokenised. For example, the string <DOCID>problematic@@sequence</DOCID> will be considered to be one tag and remain unchanged.

Base N Representation

In decimal representation, we write number using N = 10 symbols 0 - 9 called digits. Each of them represents one of the number zero to nine if used alone. Bigger numbers are represented by a sequence of digits is a way the reader is assumed to be familiar with. The position determines which power of N the digit is to be multiplied with. The sum of all factors is the represented number. The mapping between numbers and representations is one to one.

Mapping Tokens To Numbers

We interpret the string of 8 bit characters as base 256 number. However, we reversed the string first, i.e. the string starts with the least significant digit. This way, trailing zero bytes that in some applications mark the end of a string would have no effect.

stringreversedigits (base 256) number
aa9797
abba98, 97256*98+97 = 25185
pêcheurruehcêp114, 117, 101, 104, 99, 234, 112 32217225748540016

Encoding Numbers

We choose a base 26 representation. Zero to twenty five are represented with a to z. Some examples:

numberbase 26 (numeric)base 26 (symbols)reverse representation
261, 0baab
271, 1bbbb
281, 2bccb
522, 0caac
26 * 26 = 676 1, 0, 0 baaaab
26 ^ 3 = 17576 1, 0, 0, 0 baaaaaab
17580 1, 0, 0, 4 baaeeaab
25185 1, 11, 6, 17 blgrrglb
32217225748540016 8, 20, 5, 19, 1, 15, 3, 12, 15, 15, 1, 2 iuftbpdmppbccbppmdpbtfui

Difference To MIME Base64 Encoding

The Base64 MIME encoding does a similar job. It encodes a string of 8-bit quantities with a save subset of characters. In Base64, the subset has been chosen to be save in transmission of networks. It contains A-Z, a-z, 0-9, + and /. Furthermore, Base64 uses the character = to fill up output to a multiple of four characters and inserts linebreaks.

We don't know if + and / can be used to form words in OKAPI. Leading digits might also cause trouble. More severely, case of letters is important in Base64. If OKAPI works case insensitive, different words might look the same for it. The following table shows 3 out of 29 collisions that have been found in the 723,420 token list of the Finnish AAMU corpus.

stringMIME Base64 encoding
ariYXJp
axéYXjp
levibGV2aQ==
lköibGv2aQ==
ssvc3N2
syöc3n2

To make these collisions less probable, you could diffuse bits. However, it does not work to substitute individual characters with a randomly chosen permutation table because MIME Base64 is actually quite good at avoiding collisions. See collisions experiments.

Friday, 15-Oct-2004 19:36:04 IST