Liberate Google Wiki, Add to SharePoint

It was funny to be working on this project this morning as the news that Google’s platform-as-a-service cloud is now hosting Microsoft applications, including SharePoint.  Over the last 5 or 6 years, when I needed a wiki, I defaulted to Google Sites.  It has a couple of benefits if you need a quick and dirty collaborative platform:

  • It is free and you have 100 MB of storage;
  • People create Google accounts, so you don’t need to manage usernames and passwords – people manage their own accounts;
  • It’s pretty simple to use and navigate.

Google Sites are one of the few products that you can’t export or download from Google’s suite of products.  Fortunately, someone wrote a handy utility in Java to do just that.  It’s dead simple and works.

Download Your Site

I won’t belabor the instructions.  It’s pretty simple.  It didn’t work for me at first, despite following the instructions.  This made me wonder whether the utility works or not.  So I tried it from home and it worked fine.  That makes me wonder if there are some network settings that can filter out the file transfer that is occurring.  If you are getting an error about invalid user credentials and you’re pretty sure yours are correct, try it from a different location (or using a VPN to bypass whatever limits your network admins have placed on the network).

It downloads the Web pages as HTML.  If you had comments on the pages, those are downloaded as well.  Attachments stored on Google Sites are downloaded; if you link to files offsite, they are not.  I don’t know if you can use a web whacker like HTTrack or a utility like wget to grab both on and off site files.  I tried HTTrack at first but was having errors (that were ALSO probably due to trying it at work).

The HTML was pretty dirty.  I’m not sure why, but a lot of single and double quotes weren’t converted properly.  My guess is that it was originally created from Microsoft Word files with “smart” quotes and the encoding finally fell apart as it was exported.

At the end, you’ll have a folder full of all the HTML and attachments in your site.  In my case, I wanted to re-use one site and archive the others.  The other folders are now zipped up and I have deleted the sites from Google.

Migrating to SharePoint

The end goal for this Google Site was to get it into SharePoint.  I’d created a sub site to host the content.  The first thing I did was to drag and drop the entire exported content into the Document library of the sub site.  Note that if the exported Site content is in folders, you’ll need to go into each folder and drag the contents over.  This meant, in my case, renaming a bunch of index.html files as I went along.  Not a big deal but something to count on doing.

This is fine such as it is.  Unfortunately, SharePoint does not serve up HTML files from the Document Library as Web pages.  You’re prompted to download them.  This is weird behavior since SharePoint will open PDFs and other files in the Word Web App.  My suspicion is that we need to configure something in our server to handle HTML pages.

SharePoint has a wiki app.  I added one to the sub site and then just cut and pasted the HTML over into new wiki pages as I went.  Since I’d uploaded the attachments (PDFs and Excel spreadsheets), it was easy to replace the old links with links to the content now residing within SharePoint.

I’d also looked at moving the .HTML pages over and into .ASPX pages, since I could see the original Home.aspx and How to use this library.aspx files created when the SharePoint wiki app is activated.  My initial assumption was that these files were somehow holding the respective content.  Not so.  I’m sure someone with some deeper programming chops might have been able to strip off the HTML tags and ingest the remaining code into the wiki but for the 20 or so pages I had, it was just as simple to do it manually.

This was similar to the experience I had when migrating Plone content over for WordPress.  You end up with a pretty flat set of HTML that then needs to be manipulated.  There are masses of migration vendors out there who can figure out how to automate it if you’ve got a big whack of content.  The benefit of doing it yourself is that you can weed out the unnecessary content during the migration rather than just automating retention of rubbish.

David Whelan

I improve information access and lead information teams. My books on finding information and managing it and practicing law using cloud computing reflect my interest in information management, technology, law practice, and legal research. I've been a library director in Canada and the US, as well as directing the American Bar Association's Legal Technology Resource Center. I speak and write frequently on information, technology, law library, and law practice issues.