Getting Online Content into Usable Text

Most online legal information is text:  case law, statutes, blog posts, whatever.  But there are times that text isn’t really text, and that can make it hard to manage once you have found it.  Take a PDF, for example.  If it is made from a Microsoft Word document, chances are it has retained its text format, so that you can index and search it.  But if someone has created PDFs from a scanned image, even one that presents text when you read it, you might find retrieving it again difficult.  When a PDF contains pictures as representations of text, rather than text, you can’t keyword search it with Google Desktop, X1, or your other desktop search tools.

One way around this is to use an online OCR tool to run the scanned image through optical character recognition.  This will convert the text in the document into something that can be searched later.  Here’s an example.  A document from the Ontario Superior Court of Justice was scanned into PDF, yielding a picture of a text file.  First, you download the file to your computer and then you go to a site like Free Online OCR, highlighted by MakeUseOf.  You identify the file you downloaded as the file to convert, select the output format (you can even put it BACK into PDF when finished) and click the convert button.

Here’s a quick screencast of how that works:

There are other free online OCR sites that you can retrieve with a quick Web search.  Some, like use file limitations to throttle usage, so you may want to hunt around if you end up needing to OCR more than 15-20 documents per hour on a regular basis.  As I mention in the screencast, you will be uploading these files during the conversion process.  If you are not comfortable having the files hosted on a remote server and out of your control, you may want to look for OCR software to install.  But since much of what you find on the Internet during your research will be public knowledge, this shouldn’t impact your use of free OCR resources.

If you use Google Docs, there is built-in OCR.  Click on the Upload… button and you will be prompted to select files.  Select the option to convert text from PDF or image and select your files.  As Google uploads the files to your Google Docs accounts, it will perform OCR on the files.  It’s a great alternative to the other services, since your files end up in your file system as soon as the upload completes.

2 Replies to “Getting Online Content into Usable Text”

Leave a Reply