Chris Shearer Cooper wrote: > I have several PDF files, from various companies, which are scans of > historical books which have had some OCR (optical character recognition) > done on them. The result is a semi-searchable book, but of course OCR is an > imperfect science, so the resulting text is kind of flaky, and the search > capabilities provided by the standard Adobe Reader aren't really up to the > task of figuring out what the text is "supposed" to be. > > For example - I'm searching for Daniel Cooper, and where that text appears > in the original document, the OCR gives me > Danjek Cooqer > Daniel Coo er > Danjel Coopes > > Most of these PDF files are "locked" (meaning you can't easily extract the > images to run them through another OCR program) and in any case the scanned > images aren't of high enough resolution that another OCR program would do > any better. To be fair, the original books are often not in great shape, so > it's not the fault of the scanner or the company providing the PDFs that the > converted text isn't perfect. > > Is there a program that can do searches on PDF files, that (1) knows a > little about the mistakes OCR software commonly makes, or (2) lets you > specify the text your searching for with a "fuzziness" factor, so it catches > things similar to the searched-for text? > > Thanks and Happy Holidays, > Chris > > > Do you have MS Office or a computer that has a introduction version of MS Office? On my HP there is a demo version that includes Microsoft Office document image writer. I works like a printer. You open the document in it native file pdf, jpg, etc. and print the file to the "Printer" It works quite well. In fact it almost worth paying for. Unless the file is locked to printing it should overcome most locks they place on the file. Once in the image writer file format you can cut and past and search on fragments of words -- Keith Nuttle 3110 Marquette Court Indianapolis, IN 46268 317-802-0699
Keith nuttle <keith_nuttle@sbcglobal.net> wrote in news:IrYbj.999$se5.661@nlpi069.nbdc.sbc.com: > Chris Shearer Cooper wrote: >> I have several PDF files, from various companies, which are >> scans of historical books which have had some OCR (optical >> character recognition) done on them. The result is a >> semi-searchable book, but of course OCR is an imperfect >> science, so the resulting text is kind of flaky, and the >> search capabilities provided by the standard Adobe Reader >> aren't really up to the task of figuring out what the text is >> "supposed" to be. >> >> For example - I'm searching for Daniel Cooper, and where that >> text appears in the original document, the OCR gives me >> Danjek Cooqer >> Daniel Coo er >> Danjel Coopes >> >> Most of these PDF files are "locked" (meaning you can't >> easily extract the images to run them through another OCR >> program) and in any case the scanned images aren't of high >> enough resolution that another OCR program would do any >> better. To be fair, the original books are often not in >> great shape, so it's not the fault of the scanner or the >> company providing the PDFs that the converted text isn't >> perfect. >> >> Is there a program that can do searches on PDF files, that >> (1) knows a little about the mistakes OCR software commonly >> makes, or (2) lets you specify the text your searching for >> with a "fuzziness" factor, so it catches things similar to >> the searched-for text? >> >> Thanks and Happy Holidays, >> Chris >> >> >> > Do you have MS Office or a computer that has a introduction > version of MS Office? On my HP there is a demo version that > includes Microsoft Office document image writer. I works like > a printer. You open the document in it native file pdf, jpg, > etc. and print the file to the "Printer" It works quite well. > In fact it almost worth paying for. > > Unless the file is locked to printing it should overcome most > locks they place on the file. No, it won't. You'll only be creating a completely unsearchable document by creating a PDF of a PDF. The PRINT TO PDF option basically creates an image of the document that is inserted into a PDF page. Scanning the new pdf will only reveal the image, and not the components of the image. -- }:-) Christopher Jahn {:-( http://manormaniac.blogspot.com/ Delicious and nutritious, tastes like chicken!