Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

My big idea: Ancient Brain

Search:

CA114      CA170

CA668      CA669      Projects


Wikify a page

Can be done in 50 lines of shell or so.

  1. Usage: wikify file.html > file.wiki.html
  2. "Wikifies" file.html, output to stdout

  3. For all capitalised (i.e. might be proper noun) and un-linked words Word ...
    • See Parsing XML / HTML
    • Can find capitalised word with grep '[A-Z][a-z]'
    • Can extract all links with something like:
      cat file.xhtml | xpath '//a[@href]'

  4. ... Link the word to http://en.wikipedia.org/wiki/Word
  5. (We could check if that URL exists, but I don't want this class practical to cause trouble for Wikipedia's servers, so we will not check here.)
  6. Only link the first occurrence of Word, not subsequent occurrences.

  7. Q. How do you avoid Wikifying words inside tags:
    <title> Word Word Word </title>
    <a href=url> Word Word Word </a>

  8. Test on a sample page from the corpus of the works of Shakespeare.
  9. If you pick the same page as another student, I may get suspicious and compare your code.

  10. What to hand up (Note show HTML source before and after).


ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.

Wikipedia: Sometimes I link to Wikipedia. I have written something In defence of Wikipedia. It is often a useful starting point but you cannot trust it. Linking to it is like linking to a Google search. A starting point, not a destination. I automatically highlight in red all links to Wikipedia and Google search and other possibly-unreliable user-generated content.