logo Subscribe to: rss Email Feed:


Searching Scanned Documents Made Possible

Saturday, November 1st, 2008

Google is in the process of converting the electronic formats of documents that are scanned and stored in PDF files into digital text. Google will be using the OCR or the optical character recognition technology to convert the scanned electronic documents to digital text. Now why does Google do this and what is the advantage of such a conversion? As we know scanned images that are in PDF format will be considered as images and as such their content cannot be searched. We will only be search and find the documents themselves but we cannot make a content search for these electronic documents. But when we convert the electronic copies to digital text, we can search them like any other HTML documents.

In the statement of Google Product Manager, Evin Levey in one of his blog posts he said, “In the past, scanned documents were rarely included in search results as we couldn’t be sure of their content. We had occasional clues from references to the document – so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored inAdobe’s PDF format.”

Google will be allowing the converted digital text in the search results. We will be able to access these digital versions of the scanned PDF files by clicking the, “View as HTML” link. This is not possible yet in Yahoo. When you click view as HTML will only show blank pages. You can check out the same in Microsoft’s Live Search, AltaVista or Ask.com you will be able to see only blank pages.

By this conversion, Google’s index has expanded dramatically into the biggest index. Each image is converted into thousands of words. Through this initiative of Google now we can lay our hands on more information through the internet. This takes us a step further into accessing useful information through internet searches.

How does Google deal with the documents if there are graphic elements? While Google converts the textual part of the scanned images into digital text through OCR, whenever, it comes across a graphic element, it is programmed to omit them. They do not get inserted into the HTML version. However, it is expected that Google will incorporate this feature too whereby the graphic elements too will be inserted into the HTML view in the appropriate locations. So as of now we can see only the textual part of the scanned document. Though this is a good step forward into accessing the information and searching the information in scanned images, omission of images still makes it incomplete.

After Google’s success in converting the scanned images into digital text many new uses will be emerging of this effort. One such immediate use is making court documents now accessible through Google. This will also make public government documents easily accessible through Google search.

Leave a Reply