From Word to Epub Requires Clean Up

One of the things I would eventually like to do is have an easy author-to-legal information publishing route for lawyers who want to share content through the library.  Law library as publisher, whether a courthouse law library working with practitioners or academics working with their law librarians.  Blogs would be an interesting part of that (although blogs were dead in 2012, 2014, 2016, 2018, …).  But legal professionals create content in many ways, most often in Microsoft Word when it comes to long-form documents.  It would be useful to be able to get them from the document they created to a simple ebook.

Lawyers who blog can take advantage of tools like Pressbooks and Anthologize, if they use the WordPress content management system.  That’s the ideal, I think.  The content doesn’t need to be touched a second time, you just use the framework or tool to organize it, and export an epub format file.

Three Easy Steps

The steps for getting from Word to epub are actually pretty simple but there are small hurdles that make it hard to sell.  In essence, you:

  1. Save your Word document as HTML file, the language of the web
  2. Open your HTML file in an ebook editor like the free, open source Sigil
  3. Make minor alterations (add metadata like title, author, copyright, published date; add a table of contents) and save

That’s it.

Except, as I found out recently, it’s not.

Word to HTML is Weak

Microsoft Word’s export to HTML has been a loathsome thing for as long as I can remember.  In order to maintain the look of your Word document when you view it as a brand new web page, it includes a massive amount of extra code.  These styles and other inserts were sometimes be longer than my original document was.   Many content management systems have purpose-built workarounds for this very problem.

Let’s start with a simple Word document.  As Michael Flanders said, “the rain in Spain stays mainly in the hills.

Save the Word document (this is docx) into a web page format.  Microsoft Word has a number of them.  I would recommend the filtered page.  It’s the least worst option.

The default web page format, unfiltered, ended up generating 756 lines of code.

The result when you convert a single line Microsoft Word document to a web page.
The result when you convert a single line Microsoft Word document to a web page.

The filtered version strips out a lot of that junk, leaving just styles and a bit of other garbage. About 706 fewer lines of code.

The result when you convert a single line Microsoft Word document to a filtered web page.
The result when you convert a single line Microsoft Word document to a filtered web page.

Now to do the final clean up.

Not Just What is Obvious

The problem is two-fold.  The first is that you have to get rid of nearly everything but the text.  Then you need to revisit the text to see if anything else snuck in.

I hadn’t really thought about this much except that I was playing with Microsoft Edge’s e-book reader features.  Most books I opened were fine, but one or two would display the cover only.  The slider at the bottom showed that it registered more content – I was at 1% – but you couldn’t get to it.  It turns out that the reader function couldn’t handle certain types of extra characters – this is a typical example, where it can’t handle diacritics.  In the end, rather than trying to debug to see which specific characters, I just decided to remove all unnecessary HTML and ASCII characters.

If you are opening your HTML file in Sigil, it will look like a Word document – WYSIWYG.  Look for the button with tag endings on it ( < > ) to toggle to the code view.  This is where you can do the cleanup.

The first was to remove the HTML and the style sheet that runs from the top of the document.

Filtered HTML web page before clean up.

When I was finished, I was down from 50 lines to 13.  I over-deleted, naturally, and lost the <html> line that is at line 5 in this next screenshot.  That will give you an XML error that there is extra content at the end of the document.  Putting the HTML tag back in made the file display fine.

But what about those quotation marks?

They’re not going to work either.  Compare the ones in line 11, around the text, with the code ones in line 5.  Not the same.  I’m wondering what other characters like that make the transition.

A find-and-replace is easy but you need to highlight the offending character (there are two different quotation marks here) and replace both with one typed directly from the keyboard.  Once these quotes were replaced with straight quotes, I saved the file again and it worked in all of the ebook readers, including Microsoft Edge.

Back to the beginning:  how to create a simple process that catches these small steps so that a lawyer can go from writing a brief or other document to share, save it out to HTML, and generate an epub with a tool like Sigil?  The most obvious answer is to just cut and paste the text over – no conversion – but then you lose the base HTML that will show where a paragraph or chapter heading is.  If the lawyer uses Word styles properly, that can be a huge save on a long document.

I’m not sure.  The second most obvious answer is to hire a third party publisher – like Pressbooks or someone – to host the content and automagickly fix these sorts of things.  I expect there is a way to run a script in the background, at the command line level, to go through a file and convert all odd characters, if you know what they are.

A puzzle to work on.



David Whelan on EmailDavid Whelan on FlickrDavid Whelan on LinkedinDavid Whelan on Twitter
David Whelan
I improve information access and lead information teams. My books on finding information and managing it and practicing law using cloud computing reflect my interest in information management, technology, law practice, and legal research. I've been a library director in Canada and the US, as well as directing the American Bar Association's Legal Technology Resource Center. I speak and write frequently on information, technology, law library, and law practice issues.