How to Archive Changed Web Pages with Bash

If you use cloud services, especially consumer sites, do you know what you signed up for?  And have you kept up on the changes in privacy and terms of service?  Lawyers relying on the Web for their law practice may want to know what those policies say.  I decided to see if there was a simple way to automate the process.

I’m not sure any lawyer would do this, but it would be interesting to have a collection of policy changes from major consumer Web sites that could be referenced to see what the policy looked like at a given point in time.  The scope of the project was this:

  • download a Web page using its URL
  • compare it with a previously downloaded copy of the same URL document
  • save a copy if the new copy is different from the previous copy

My solution uses bash on Ubuntu, as well as curl, cut, and diff.  The script (below, and my first shell script) saves changed files and can be scheduled to run on a regular basis.  Now that I have worked it out, I may end up using it like my IFTTT account, except for things for which there isn’t a channel.

Giving it a bash

Shell scripts are common on Linux-based machines.  I run Ubuntu 14.04 in Oracle’s Virtualbox on my Windows PC.  My initial choice was to learn how to write a shell script for Bash and then use a cron job to schedule the script to run daily.

I had heard of diff – a command line tool in Linux to compare two files – and eventually stumbled onto cURL, a utility that will download a URL (among many other things).   You can use wget as well but I was persuaded by the cURL developer’s rationale to use his program.

There are a number of people who have done similar things, mostly outputting the changes into an RSS or e-mail based alert, rather than capturing the changed document.  It took me a few attempts to figure out how to blend these ideas into my eventual solution.

As with any shell script, you first have to tell the computer where the interpreter (bash, python, etc.) is.

#!/bin/bash
# Script to download privacy policies, check for changes,
# and store copies by date as changes are made

I also included a comment (lines beginning with #) to describe what the script was.

The URL Array and Variables

One script I found that used the diff command assigned a variable to indicate whether this was the first time a file had been downloaded for comparison.  It set the first_time variable to 0 (not the first time).

first_time=0

[If I got this idea from you, my apologies for not citing it.  I saw this in someone’s script and duplicated it identically; let me know and I’ll link to you.  I just can’t find where I came across it.] 

This was where my script got a bit hairy.  I wanted to use an array for the URLs for the privacy pages.  The script would cycle through the array, downloading and checking each file in turn.

policy[0]=’http://www.dropbox.com/privacy’
policy[1]=’http://www.facebook.com/policy.php’
policy[2]=’…’

This is the only part of the script that would change on a regular basis, as I add more URLs.  The next thing to create was the loop.  This took me a long time to figure out, mostly because I don’t know what I’m doing.  I started with a for i in … but that didn’t appear to grab the array elements properly.  Eventually I got to this:

for (( i = 0 ; i < ${#policy[@]} ; i++ )) do

with ${#policy[@]} being the variable that pulls in the number of the array.  If you look at the bash man page, it seems to matter whether your array elements are enclosed in single or double quotes.  I may have that wrong.  What was true throughout this experience was that my use or lack of quotes was what typically broke the script.

I created a variable called $site.  This is because I am creating a folder for each Web site, and saving the relevant files into that folder.  To do this, I used the cut function to grab the domain name from the URL in the element from the policy array.  This is pretty straightforward (now!).  I’m not sure this is exactly right, but the practical result is that “-d” identifies what to split on and “-f” grabs the second iteration.  You can see in mine that I am cutting at a “.” and before the 2nd one.  If it was -f1, I’d get “http://www”; with -f2, I get “facebook”.

site=$( echo ${policy[$i]} | cut -d. -f2 )
polURL=${policy[$i]}

I also create a date variable so that I can save the files with the date on which they are downloaded.

now=$(date +”%m_%d_%Y”)

These are lines to help me see what my variables are.  I have put a # in front to comment them out, but when I was first writing this, it helped in debugging.

#echo -e $site
#echo -e $polURL

Download with CURL

Now to see if I already had a policy from that site.  The following test will look in the relevant folder, look for a last.html file, and leave the first_time variable at 0 if it is found.  Otherwise, it sets the variable to 1.

if [ ! -e $site/last.html ]
then
    first_time=1
fi

The next step was to grab the remote policy file.  I used curl and it’s –output (or -o) function to download a URL and save it to a file.  I was not able to get this to work by using the array element (${policy[@]}) which is why I created the variable above.  It may work and my problem could have been misuse of quote marks.

curl -L –output $site.html $polURL  

You can use –silent to hide the curl activity but I found that I’d miss errors.  I also used the -v (verbose) for debugging.  I added the -L (location) switch because some of these sites were being redirected (301 redirects) and the -L will cause curl to follow the link.

Compare Files, Keep the Changes

The next part looks to see if this is the first time the page has been downloaded.  If not, it sets a variable for the output of diff when the newly downloaded file is compared with the last.html in the relevant folder.  If changes are made, it writes a message to the screen and then copies the file over.  I make two copies – one to save the changes, and one to save for comparison the next time.  In the second file, I put the $site variable (facebook, dropbox, whatever) and the date $now variable together so the files are easy to distinguish.

if [ $first_time -ne 1 ]
then
    changes=$(diff -u $site.html $site/last.html)
    if [ -n “$changes” ]
    then
        echo -e “There were changes.  Files being copied …\n”
                cp -v “$site.html” “$site/last.html”
                cp -v “$site.html” “$site/$site-$now.html”
    else
        echo -e “No changes.”
    fi

If this is the first time (and the variable $first_time is set to 1) the file has been downloaded, the script skips down here, writes a message to the screen, and copies the file to last.html in the appropriate folder.

else
    echo “[First run] Archiving…”
    cp -v “$site.html” “$site/last.html”
fi

Finally, the loop is closed.

done

This seems to work, although I have only tested it the last two days.  The proof is in the pudding.  I am getting the feeling that I do not have the right recipe for the diff part of this script.  It seems to find changes each time, so I will need to fine tune this before it’s ready to put into production.

The Whole Script

Here’s the full script, in case it’s any use to anyone.  Feel free to make it better!

#!/bin/bash
# Script to download privacy policies, check for changes,
# and store copies by date as changes are made

first_time=0

policy[0]=’http://www.dropbox.com/privacy’
policy[1]=’http://www.facebook.com/policy.php’

for (( i = 0 ; i < ${#policy[@]} ; i++ )) do

site=$( echo ${policy[$i]} | cut -d. -f2 )
polURL=${policy[$i]}
now=$(date +”%m_%d_%Y”)

#echo -e $site
#echo -e $polURL

if [ ! -e $site/last.html ]
then
first_time=1
fi

curl -L –output $site.html $polURL

if [ $first_time -ne 1 ]
then
changes=$(diff -u $site.html $site/last.html)
if [ -n “$changes” ]
then
echo -e “There were changes.  Files being copied …\n”
cp -v “$site.html” “$site/last.html”
cp -v “$site.html” “$site/$site-$now.html”
else
echo -e “No changes.”
fi
else
echo “[First run] Archiving…”
cp -v “$site.html” “$site/last.html”
fi

done

Leave a Reply

Your email address will not be published. Required fields are marked *