|
Corpus Preprocessing for DCU CLEF 2004
|
Given the low number of collisions, it might seem reasonable to just use MIME Base64 encoding and to accept small degradation of the Okapi retrieval performance. If we assume that Okapi can deal with numerical characters (0-9) and allows 2 more special characters that we use for Mime Base64 + and /, we store the MIME Base64 encoded token with = replaced by A. When decoding, we strip the trailing zero bytes that may occure due to the latter replacement.
To reduce the number of collisions and thereby approaching optimal Okapi performance, we try to spread information within the token. Instead of using sophisticated diffusion algorithms from cryptography, we employ alternately string substitutions and MIME Base64 encoding and decoding.
Decoding is still easy since we can construct the inverse mapping for a one to one mapping of 64 or 256 elements in very short. See below.
First of all, we measure the number of collisions with MIME Base64 encoding. Then we apply one to one substitution tables. The tables are initialised by permuting the identity map:
# initialise identity map
map = [] # empty list
for i in range(256):
map.append(chr(i))
mapFrom = string.join(map, '')
# permute map
i = 255
while i > 0:
j = int(random() * (i+1)) # 0 <= j <= i
map[i], map[j] = map[j], map[i] # swap entries
i = i - 1
# create string translation table
mapTo = string.join(map, '')
transTable = string.maketrans(mapFrom, mapTo)
If the permutation is really randomly chosen, repeatedly applying it will not improve diffusion of information. We have to spread information across character bounderies. We can do this with the help of MIME Base64 encoding. It represents blocks of 3 characters, i.e. 24 bits, with 4 characters that encode 6 bit each:
char | 1 | 2 | 3 bit | 1 2 3 4 5 6 7 8 | 9 10 11 12 13 14 15 16 | 17 18 19 20 21 21 23 24 bit | 1 2 3 4 5 6 | 7 8 9 10 11 12 | 13 14 15 16 17 18 | 19 20 21 21 23 24 MIME | 1 | 2 | 3 | 4 |
If we apply substitutions to character 2 and 3 of the MIME Base64
representation, we spread information accross those bounderies at
bit 8 and 16. However, we have to pay attention that the result
still is a valid MIME Base64 string.
First of all, the mapping must be restricted to the characters representing
groups of 6 bits.
Then we have to consider padded blocks that occure if the length of
the input is not a multiple of 3 characters.
Either we need special mappings for those cases or we use the following
trick.
MIME Base64 pads the input with \x00 characters (all bits zero) and
uses '=' instead of 'A' that normally represents 6 zero bits to
encode the number of padded characters.
The trick is to just map '=' to 'A' and keep in mind that we have to
strip trailing \x00 characters when decoding.
We do not lose inforamtion since our tokens never contain \x00 characters.
The following Python
code initialises such a substitution table. We assume that the
permutation code above has been put into a functions permute.
b64from = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789/+'
b64to = string.join(permute(list(b64from)), '')
transTable2 = string.maketrans(b64from + '=', b64to + 'A')
To further spread information, we decode the result and repeat the whole process. After a few rounds, information should be spread evenly across the full 24 bit block.
def encodestring(s, rounds):
if rounds % 2:
s = string.translate(s, ttable256)
for i in range(int(rounds/2)):
b = base64.encodestring(s)
b = string.translate(b, ttable64)
s = base64.decodestring(b)
s = string.translate(s, ttable256)
return base64.encodestring(s)
For decoding, you do the steps in reverse order and swap parameters 'to'
and 'from' in string.makestrans. Finally you strip
trailing \x00 characters.
The following figure shows how information is mixed. Each line represents two bits. Note that the first 6 bits cannot influence the last 8 bits even with 4 rounds while middle bits can influence edge bits already after 2 rounds.
| | | | | | | | | | | |
round 1 (XXXXXXX)(XXXXXXX)(XXXXXXX)(XXXXXXX)
| | | | | | | | | | | |
round 2 (XXXXXXXXXX)(XXXXXXXXXX)(XXXXXXXXXX)
| | | | | | | | | | | |
round 3 (XXXXXXX)(XXXXXXX)(XXXXXXX)(XXXXXXX)
| | | | | | | | | | | |
round 4 (XXXXXXXXXX)(XXXXXXXXXX)(XXXXXXXXXX)
| | | | | | | | | | | |
|
We run the experiment with different numbers of rounds and different initialisation of the random number generator that is used to permute the mappings. Tokens that contain digits or a '<' (probably a tag) are excluded from the experiment because most of them appear in the document header that should not be used for retrieval.
| rounds | initialisation vector | ||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| 0 | 92 | 92 | 92 | 92 | 92 | 92 | 92 |
| 1 | 228 | 1949 | 1370 | 2916 | 500 | 1101 | 622 |
| 2 | 859 | 1293 | 1579 | 2215 | 544 | 371 | 531 |
| 3 | 622 | 160 | 242 | 1174 | 221 | 809 | 795 |
| 4 | 495 | 610 | 233 | 494 | 193 | 259 | 114 |
| 5 | 369 | 196 | 610 | 68 | 61 | 83 | 170 |
| 6 | 106 | 79 | 83 | 140 | 123 | 156 | 32 |
| 7 | 302 | 76 | 99 | 75 | 33 | 228 | 55 |
| 8 | 72 | 47 | 92 | 50 | 71 | 58 | 94 |
| 9 | 295 | 44 | 96 | 64 | 28 | 63 | 66 |
| 10 | 37 | 44 | 30 | 40 | 32 | 36 | 51 |
| ... | |||||||
| 20 | 37 | 35 | 35 | 29 | 34 | 24 | 21 |
| 21 | 290 | 24 | 76 | 26 | 42 | 44 | 30 |
Between 1 and 4 rounds do not help at all. The number of collisions is much bigger than without any substitution. Only in some cases, 5 rounds can yield a small advantage. In order to reduce the number of collisions by more than 50% you have to try different initialisation vectors and pick one that gives good results. Example vector 7 shows that 6 rounds can be sufficient.
The diffusion process seems to be very slow. 20 rounds give much better results than 10 rounds.
We tried to improve the diffusion by using different permutation tables in each round. The first to rounds are the same as in the previous experiment. The following table shows similar behaviour. It is strange that the additional rounds cannot polish the bad first round of initialisation vector 1.
| rounds | initialisation vector | ||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| 0 | 92 | 92 | 92 | 92 | 92 | 92 | 92 |
| 1 | 228 | 1949 | 1370 | 2916 | 500 | 1101 | 622 |
| 2 | 859 | 1293 | 1579 | 2215 | 544 | 371 | 531 |
| 3 | 889 | 373 | 652 | 196 | 356 | 724 | 147 |
| 4 | 348 | 1101 | 162 | 190 | 345 | 102 | 1737 |
| 5 | 540 | 86 | 395 | 123 | 57 | 408 | 149 |
| 6 | 43 | 57 | 99 | 62 | 308 | 94 | 118 |
| 7 | 308 | 110 | 170 | 79 | 240 | 119 | 67 |
| 8 | 58 | 68 | 75 | 45 | 33 | 35 | 147 |
| 9 | 323 | 71 | 63 | 41 | 57 | 25 | 70 |
| 10 | 58 | 100 | 28 | 38 | 30 | 38 | 36 |
| ... | |||||||
| 20 | 46 | 30 | 36 | 26 | 36 | 43 | 19 |
| 21 | 282 | 21 | 67 | 20 | 33 | 29 | 20 |
23 [290, 20, 96, 40, 38, 19, 22] 25 [270, 23, 70, 39, 27, 22, 41] 27 [301, 30, 103, 32, 30, 28, 35] | |||||||
Here, a full 24 bit substitution is employed. That's a permutation of a list of 224 = 16,777,216 elements. The core of this has been implemented in C to do it more efficiently.
In each round the 3 character substitution window shifts by one character. This way information spreads over the whole token. Boths ends are connected to speed up diffusion. For exmaple, in round 3 the window starts at the third character and extends over the endpoints of the token if its length plus one is not a multiple of three:
1 2 3 4 5 6 7 | ^
| | | | | | | E |
round 1 (XXXXXXX)(XXXXXXX)(X) N D
| | | | | | | C E
round 2 (X)(XXXXXXX)(XXXXXXX) O C
| | | | | | | D O
round 3 XX)(X)(XXXXXXX)(XXXXX E D
| | | | | | | | E
1 2 3 4 5 6 7 V |
|
Smaller substitution tables (16 and 8 bit) are employed to deal with remaining characters. Note that substitions blocks overlap by 8 bits allowing much more information to be exchanged than with 2 or 4 bit overlap in the previous experiments.
With exactly one round, the experiment is supposed to confirm that our cascade of 6 and 8 bit substitutions of the previous experiments was a good 24 bit substitution and that we would not have improved results if we had increased the number of rounds even further.
| rounds | initialisation vector | ||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| 0 | 92 | 92 | 92 | 92 | 92 | 92 | 92 |
| 1 | 310 | 331 | 82 | 516 | 393 | 551 | 56 |
| 2 | 34 | 29 | 35 | 34 | 35 | 28 | 42 |
| 3 | 38 | 34 | 30 | 24 | 31 | 37 | 25 |
| 4 | 34 | 42 | 31 | 27 | 29 | 23 | 33 |
| 5 | 39 | 39 | 25 | 31 | 22 | 29 | 50 |
| ... | |||||||
| 10 | 29 | 30 | 39 | 27 | 33 | 29 | 30 |
With one round the results are surprisingly bad. Additional runs with different initialisation vectors confirm that this is not by mischance. Thinking about how it is possible that results in the previous experiment are better, we notice that there is a difference in padding of tokens that are not a multiple of 3 characters long. For example, look at a 4 character token. In the previous experiments, we used all 8 Base64 characters to represent it. Now we use only 6 characters and two padding characters ('='). (In addition, the 6th one is only used to one third: Only 'A', 'Q', 'g' and 'w' can appear before two padding character. They encode 2 bits.) This increases the probability of collisions.
With only 2 rounds, it is possible to reduce collisions by more than 66%. However, this is also possible with the algorithm of the previous experiments that can be easily implemented with any high level programming language. (See IV 5 with 9 rounds in experiment #1.) Doing more than 3 rounds seems not to improve results any more.
Tokens are padded to a multiple of three characters to compare 24 bit substitution with experiments #1 and #2.
| rounds | initialisation vector | ||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| 0 | 92 | 92 | 92 | 92 | 92 | 92 | 92 |
| 1 | 35 | 41 | 25 | 21 | 16 | 20 | 23 |
| 2 | 24 | 16 | 21 | 13 | 24 | 16 | 18 |
| 3 | 18 | 28 | 18 | 27 | 18 | 18 | 26 |
| 4 | 21 | 15 | 14 | 24 | 17 | 20 | 19 |
| 5 | 16 | 15 | 19 | 28 | 19 | 15 | 23 |
| ... | |||||||
| 10 | 25 | 19 | 21 | 19 | 22 | 24 | 14 |
According to round 1, we got good 24 bit substitutions with our cascade of simple substitutions in experiment #1 and #2.
With 2 or more rounds, it is easy to reduce collisions by more than 82%. Initialisation vector 6 with 8 rounds (not shown in table 4) produces only 11 collisions (-88%), all 3 letter tokens.
Tokens are padded to a length of at least six characters and to a multiple of three characters. This way, those collisions of three letter tokens in the previous experiment should disappear.
| rounds | initialisation vector | ||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| 0 | 577 | 577 | 577 | 577 | 577 | 577 | 577 |
| 1 | 29 | 36 | 35 | 27 | 16 | 15 | 20 |
| 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
This works! Note that we don't need low level programming for it. The shifting window can be implemented by slicing the input string.
The high number of collisions for 0 rounds is caused by collisions of 'h' (encoded as 'aAAA') and the padding blocks (encoded as 'AAAA').
Here we apply those things learned from the previous experiments to the simple cascade schema of experiment 1.
| rounds | initialisation vector | ||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| 0 | 577 | 577 | 577 | 577 | 577 | 577 | 577 |
| 1 | 1575 | 1949 | 1490 | 2916 | 519 | 1169 | 622 |
| 2 | 304 | 669 | 541 | 351 | 101 | 475 | 402 |
| 3 | 199 | 171 | 320 | 102 | 89 | 134 | 466 |
| 4 | 36 | 70 | 177 | 97 | 40 | 32 | 44 |
| 5 | 55 | 57 | 269 | 16 | 7 | 21 | 36 |
| 6 | 25 | 7 | 16 | 9 | 4 | 33 | 14 |
| 7 | 5 | 6 | 5 | 1 | 7 | 8 | 13 |
| 8 | 9 | 7 | 4 | 2 | 2 | 2 | 1 |
| 9 | 2 | 3 | 1 | 2 | 2 | 0 | 1 |
| 10 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
| 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 13 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
download collisions.tgz [7 KB]
Thursday, 21-Oct-2004 20:25:49 IST