Link checker
Write a Java program to:
- Take a URL as a command-line argument.
- Examine all the pages it links to,
and display broken links.
- Pages it links to can be found by searching for:
<a .. href
- See Parsing XML / HTML
- We will define a "broken" link
as any link with a
HTTP return code
other than 200,
or a link that times out.
- For timeout settings try
here.
- Produce an output file listing in a nice 3-column table:
- All broken links (return code other than 200).
- Their HTTP return codes.
- Comment on why we do not get return code 200.
Is the link really broken?
Or just moved?
For every link that your program claims is broken,
test the link manually in a browser.
If you can view the link, explain
why your program thinks it is broken.
- I suggest the output file should be a web page that you can browse (offline)
and click on the allegedly broken links to check them in the browser.
- Do not bother listing the links that worked (return code 200).
Test on these URLs:
Your final output should demonstrate your
program working on these URLs:
http://humphrysfamilytree.com/links.html
http://humphrysfamilytree.com/sources.html
Hint: Get your program working on smaller pages first,
before testing it on larger pages.
These pages are huge!
For 100 percent, check these:
- Relative links, like:
<a href="subdir/file.html">
<a href="../index.html">
- Links to a location within a page, like:
<a href="#location">
<a href="file.html#location">
Check if the destination location (marked by name= or id=)
exists on that page.
- href links to files that are not web pages, like:
<a href="pic.jpg">
Check if file exists.
- Embedded image src links, like:
<img src="pic.jpg">
Check if image file exists.
- Forms
(check the ACTION= link)
Ignore these:
- You may ignore href followed by (common):
mailto
or (rare):
ftp, telnet, news, gopher.
Ignore all links to Google
-
Google doesn't allow scripting of search results.
- Your program can ignore all Google searches of the form:
http://www.google.DOMAIN/search?ARGS
e.g. biscuits
These searches never break anyway.
If this search link once worked (i.e. is formatted correctly), it will always work.
- Ignore all links to Google's directory:
http://www.google.DOMAIN/Top/PATH
e.g. Biscuits
I need to delete all these.
- In fact, ignore all links to Google.
To hand up:
What to hand up
(Include a printout of the output when run on the URLs above.)