Wikify a page

Can be done in 50 lines of shell or so.

  1. Usage: wikify file.html > file.wiki.html
  2. "Wikifies" file.html, output to stdout

  3. For all capitalised (i.e. might be proper noun) and un-linked words Word ...
    • See Parsing XML / HTML
    • Can find capitalised word with grep '[A-Z][a-z]'
    • Can extract all links with something like:
      cat file.xhtml | xpath '//a[@href]'

  4. ... Link the word to http://en.wikipedia.org/wiki/Word
  5. (We could check if that URL exists, but I don't want this class practical to cause trouble for Wikipedia's servers, so we will not check here.)
  6. Only link the first occurrence of Word, not subsequent occurrences.

  7. Q. How do you avoid Wikifying words inside tags:
    <title> Word Word Word </title>
    <a href=url> Word Word Word </a>

  8. Test on a sample page from the corpus of the works of Shakespeare.
  9. If you pick the same page as another student, I may get suspicious and compare your code.

  10. What to hand up (Note show HTML source before and after).