Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact


CA170      CA216      CA249      CA318      CA651

w2mind.computing.dcu.ie      w2mind.org

Wikify a page

Can be done in 50 lines of shell or so.

  1. Usage: wikify file.html > file.wiki.html
  2. "Wikifies" file.html, output to stdout

  3. For all capitalised (i.e. might be proper noun) and un-linked words Word ...
    • See Parsing XML / HTML
    • Can find capitalised word with grep '[A-Z][a-z]'
    • Can extract all links with something like:
      cat file.xhtml | xpath '//a[@href]'

  4. ... Link the word to http://en.wikipedia.org/wiki/Word
  5. (We could check if that URL exists, but I don't want this class practical to cause trouble for Wikipedia's servers, so we will not check here.)
  6. Only link the first occurrence of Word, not subsequent occurrences.

  7. Q. How do you avoid Wikifying words inside tags:
    <title> Word Word Word </title>
    <a href=url> Word Word Word </a>

  8. Test on a sample page from the corpus of the works of Shakespeare.
  9. If you pick the same page as another student, I may get suspicious and compare your code.

  10. What to hand up (Note show HTML source before and after).

Feeds      On Internet since 1987