CA217 Practical


The practical

Write a Java program to:
  1. Take a URL as a command-line argument.
  2. Return if the URL exists or not.
  3. If it exists, download its content, examine all the pages it links to, and display broken links.
  4. Pages it links to can be found by searching for: <a .. href

  5. You might do your own parsing of the HTML, or you might like to use Swing:

  6. Produce an output file listing:
    1. All the links that failed.
    2. Details of why they failed (e.g. HTTP error code).
    3. For every link that your program claims is broken, test the link manually in a browser. If you can view the link, explain (in your printout) why your program thinks it is broken.
    Do not bother listing the links that worked.

  7. Count the number of broken links per host.
    Produce a chart sorted by which host has the most number of broken links to it.


Test on these URLs:

Your final output should demonstrate your program working on these URLs:

http://computing.dcu.ie/~humphrys/ai.links.html
http://computing.dcu.ie/~humphrys/robot.links.html
http://computing.dcu.ie/~humphrys/evolution.links.html
http://computing.dcu.ie/~humphrys/computers.internet.links.html
http://computing.dcu.ie/~humphrys/news.links.html
http://humphrysfamilytree.com/links.html
http://humphrysfamilytree.com/sources.html

These pages may contain:

  1. Relative links
  2. Forms (check the ACTION= link)
  3. Image href links
  4. href followed by mailto, ftp, telnet, news or gopher (I do not expect you to check these - just to warn they may crash your basic algorithm)
  5. host does not exist any more
  6. host exists but not web server any more
  7. host exists but timeout
  8. host exists but file doesn't

Hint: Get your program working on smaller pages first, before testing it on larger pages. Some of these pages are huge!

You can discuss it with your peers on the CA217 discussion board:


If you're going for 90-100, do these:

  1. Links to a label on a page:
    <a href="file.html#label">
    Check if the label exists on that page.

  2. Image src links:
    <img src="url">
    Check if image file exists.

  3. Check mailto, ftp, telnet, news or gopher links, where possible.

  4. See notes on Sites that restrict scripts. Deal with this if you're going for maximum marks.


To hand up:

  1. A commented printout of your program.
    (Hint: Colour printout in landscape mode is normally the best way to print out code.)
  2. Your program on diskette or CD.
  3. A printout of the output when run on the URLs above.
  4. The output on diskette or CD.
  5. Use the Project Submission Form as the front page.