Wikify a page
Can be done in 50 lines of shell or so.
- Usage: wikify file.html > file.wiki.html
- "Wikifies" file.html, output to stdout
- For all capitalised (i.e. might be
and un-linked words Word ...
- See Parsing XML / HTML
- Can find capitalised word with
- Can extract all links with something like:
cat file.xhtml | xpath '//a[@href]'
- ... Link the word to
- (We could check if that URL exists,
but I don't want this class practical
to cause trouble for Wikipedia's servers,
so we will not check here.)
- Only link the first occurrence of Word, not subsequent occurrences.
- Q. How do you avoid Wikifying words inside tags:
<title> Word Word Word </title>
<a href=url> Word Word Word </a>
- Test on a sample page from
corpus of the works of Shakespeare.
- If you pick the same page as another student, I may get suspicious
and compare your code.
- What to hand up
(Note show HTML source before and after).