The phone call was the third time in the last 6 months or so that I’d received the question: why was a document appearing as a Google search result when it had been deleted? The short answer is: because Google can still see it. The longer answer is: I have no idea, let’s take a look.
Call 1: HTML page deleted
Let me clarify that these weren’t all co-workers. Sometimes I get calls from colleagues in other organizations who are flummoxed by something. Sometimes I can help, sometimes not. In this case, the person had archived a web page from their content management system but a search on Google kept returning a result. Their CMS lived on two servers – one for staging content, and one for live or production access.
At first, it looked as though the file hadn’t been properly archived. Then we looked at whether it was a synchronization issue between the production and staging servers. It wasn’t really clear and in the end, the easiest solution was to delete the file rather than archive it. This forced whatever process was missing to kick in, and Google could no longer find the page.
Process point: what this first call raised was a lack of understanding of who was responsible for what. If the content owners manage their own content, then they can handle the deletion and archiving. What I heard during this call was that, because it was a Google search, somehow either (a) Google needed to fix it or (b) the corporate Web search application owner needed to fix it. But both of these search tools are indexing content that is published. In the case of (b), it might have been possible that their search tool had a corrupted index but I haven’t found that to be the case yet with Google. To paraphrase Blue Lou Marini, the page is there on purpose.
Call 2: CEO’s Signature on Google Images
I’m not a fan of publishing image files of my signature. I know they’re a way to sign documents digitally, and I’ll use them then, but there’s no need for them to be sitting on the Web. Apparently it’s a bit of a thing to include a CEO’s signature at the end of marketing e-mails. In this case, as the caller explained, someone had sent out a fundraising e-mail, published the e-mail’s content to the Web site as part of their marketing process, and now the CEO’s signature was appearing in Google Images.
Process point: if you don’t want people to see or read something, don’t publish it on the Web. Before you publish it, think about what you’re publishing and why. If a CEO’s signature is really necessary to make a personal touch, why not use a marketing-only one (say, of the CEO’s first name only)?
The organization had contacted Google to have it removed but had not heard back. No surprises there. Google Webmaster tools (free) include a method of removing URLs from the index. But they’re clear that, if the page is still available, it won’t be removed. You need to do some legwork first.
That’s where this organization got hung up. Upon realizing that the signature image was indexed, they deleted the image from the Web page. Problem solved.
The Web page had been created by cutting and pasting from an e-mail. The e-mail had been sent out from an e-mail marketing blast service. Not uncommonly, when the HTML e-mail was cut and pasted, the URLs in the e-mail were hard-coded in the message. So the source URL for the image was not for the organization’s Web site, it was for the e-mail marketing Web site. Google was still able to reach that URL, even though it had been deleted from the Web page, so it continued to index the signature file.
Process point: it’s worth knowing where your content resides. In most cases, you want to have just one copy of a thing. In this case, if they had managed their images on their own Web site, they could have linked to that image from the e-mail, rather than the other way around. Also, it helps for someone to know what HTML looks like.
This issue was easy to discover by:
- going to Google Images and typing in the CEO’s name. In this case, the image didn’t use the CEO’s name as it’s title attribute, but Google coughed up any images adjacent to the CEO’s name in text. It also retrieved annual reports and other documents that the staff hadn’t been aware ALSO published the CEO’s signature
- clicking on the image
- clicking on “view image”
- noticing that the URL where the image lived was not the organization’s
- looking for that URL in the source HTML of the Web page on the organization’s Web site by going to the page, right-clicking on it, selecting View Source from the menu, and then using CTRL-F to find the URL (or part of it)
Since the staff had already removed it from the Web page, I looked at a cached version of the original Web page and saw the HTML pointing to the image. I’m not sure how this worked, but I think that, because Google could see the page AND the image, even though they were no longer linked, it kept both in the index. Once the staff removed it from their e-mail marketing account, it dropped off Google’s Image index.
Call 3: PDF Deleted
PDFs are the bane of my Web existence. Organizations can tend to overuse them if they don’t really understand how to use the Web themselves, or how their users use their Web site. In this case, someone called about a PDF that was appearing in Google’s search results. It had private information in it and needed to be removed immediately.
Process point: you can only manage the systems that you manage. If something has made it into a Google search index, you may need to manage expectations as to when it can be removed. As I said earlier, don’t publish what you don’t want read. It’s easier to make that choice up front than to have to get something off the Internet quickly later.
“Web site” means different things to different people. In many cases, the fact that it appears in a Web browser means it’s “on the Web site”. But there are different servers involved in Web sites and, on those servers, different applications performing different functions. In this case, the content management system could handle PDFs in multiple ways. In some cases they were stored in the database of the CMS and in others they were stored in the file system. This may sound dumb but there were potentially logical reasons to put a file in one place or another.
Similar to the CEO’s signature, the PDF lived somewhere different from where the content owner thought it was. This isn’t unusual: on a recent inventory of a subset of Web content, I came across 4 versions of a single practice rule. The content owner had been unaware that each new rule had been added to the site but, by not over-writing or deleting the previous files, they all remained accessible. Lesson learned (and this content all migrated to HTML). Once the PDF was removed from the file server on which it was sitting, Google dropped it from the index.
Your Problem, Not Google’s
One of the things you get used to, when working on Web technology, is that anything that fails in the Web browser may be blamed on the Web site. Sometimes the Web app or site really isn’t working but then you have to parse out which part – the Web server, the CMS behind it, the synchronization between servers, and so on. From what I can tell, if your content is in Google’s index, it’s because it’s accessible to Google’s crawler. To remove the search result, rather than calling or e-mailing Google, you’ll probably need to do some legwork on your own end. Is your content live somewhere? is it somewhere unexpected? are you caching a copy on your network (like in Varnish or something) that isn’t reflecting what you have published or deleted?
It’s not alchemy, just plodding detective work. I was actually surprised to have received the first, let alone all three, of these calls. One forgets how this might not be common knowledge, or perhaps there are just more people able to do things on Web sites, thanks to tools like content management systems, who may not have a grasp of the knock-on effects of publishing content.