Tokenisation

for each line:

Then any sequence of whitespace separates tokens. A token starting with < and ending with > is considered to be a SGML tag. Note that tags cannot contain whitespace because any such tag would have been split into at least two tokens during tokenisation.

Non-breakable space (\xa0 in ISO 8859-1) seems not to be treated as whitespace. Some tokens in the Finnish corpus start with this character. Furtunatly, the other steps accepted non-breakable space as a token character as well.

Thursday, 14-Oct-2004 19:06:19 IST