Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

My big idea: Ancient Brain


CA114      CA170

CA668      CA669      Projects

Find overloaded pages

Find web pages (among local files on disk, not remote pages) that are overloaded with too many / too large embedded images. (Slowly-loading pages.)


Write this script:
totalimg file.html
Add up total size of all embedded images in this HTML file.


Embedded images look like:
  <img src="filename">
  <img ....  src="filename">
  <img ....  src="filename" .... >
To test it we will run it on pages in my test suite:
cd /users/tutors/mhtest15/share/testsuite


For 40%

  1. grep the file for lines with an embedded image.
  2. Put newlines before and after every HTML tag.
  3. grep again for embedded images.
  4. Use grep to get rid of lines with 'http:'

You should now just have a list of embedded local images, like this:

$ cd /users/tutors/mhtest15/share/testsuite/Cashel
$ totalimg george.html

<img border=0 src="../Icons/pdf.gif">
<img border=0 src="../Icons/pdf.gif">
<img src="Bitmaps/ric.crop.2.jpg">
<img src="../Icons/me.gif">
<img width="98%" src="../Kickham/08.Mullinahone/SA400010.small.JPG">
<img width="98%" src="../Kickham/08.Mullinahone/SA400028.small.JPG">
<img width="95%" src="07.Carlow.Stn/SA400069.lores.jpg">
<img border=1  width="95%" src="07.Carlow.Stn/SA400070.lores.adjust.jpg">

For 60%

Extract the image file names as follows.
  1. Use sed to change from start-of-line to src=" to blank
  2. Use sed to change from " to end-of-line to blank.
You should now have a better list of local images, like this:

$ totalimg george.html

For 80%

  1. Pipe the previous into a second script total2 which will add up the file sizes.
  2. One issue is that some of the testsuite pages have broken links. For each file, we need to test if it exists before going to find its file size.
  3. So we start off with total2 looking like this:

    while read file
     if test -f $file
      ls -l $file

  4. Check this works before proceeding. Something like:

    $ totalimg george.html
    -rwxr-xr-x 1 mhtest15 tutors 426 Sep 17  2015 ../Icons/pdf.gif
    -rwxr-xr-x 1 mhtest15 tutors 426 Sep 17  2015 ../Icons/pdf.gif
    -rwxr-xr-x 1 mhtest15 tutors 39139 Sep 17  2015 Bitmaps/ric.crop.2.jpg
    -rwxr-xr-x 1 mhtest15 tutors 1005 Sep 17  2015 ../Icons/me.gif
    -rwxr-xr-x 1 mhtest15 tutors 339817 Sep 17  2015 07.Carlow.Stn/SA400069.lores.jpg
    -rwxr-xr-x 1 mhtest15 tutors 190968 Sep 17  2015 07.Carlow.Stn/SA400070.lores.adjust.jpg

  5. (Note we have removed the files that do not exist.)
  6. Comment out the ls
  7. To just print the file size, insert:
    stat --printf="%s" $file
  8. Check this works before proceeding. Something like:

    $ totalimg george.html

For 100%

  1. Pipe the above to a 3rd script which looks like this:

     while read size
        (missing line)
     echo "$TOTAL"

  2. The missing line uses Arithmetic in Shell to do:
    TOTAL = TOTAL + size.
  3. When you have filled in the missing line, your program should work like this:

    $ totalimg george.html


Here are the outputs your script should produce for some other pages:
3274461  Cashel/bushfield.html

2515730  ORahilly/the.orahilly.note.html    
1654649  ORahilly/ballylongford.html

ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.

Wikipedia: Sometimes I link to Wikipedia. I have written something In defence of Wikipedia. It is often a useful starting point but you cannot trust it. Linking to it is like linking to a Google search. A starting point, not a destination. I automatically highlight in red all links to Wikipedia and Google search and other possibly-unreliable user-generated content.