DTP


 
Lively discussions on the graphic arts and publishing — in print or on the web


Go Back   Desktop Publishing Forum > General Discussions > Software

Reply
 
Thread Tools Display Modes
Old 04-03-2010, 09:11 AM   #1
shamrock838
Member
 
Join Date: Feb 2009
Location: East Norriton, PA, USA
Posts: 14
Default Just Starting Out in OCR

Just Starting Out in OCR:

I’m looking for a reliable Optical Character Recognition (OCR) program in the under-$200 range. My aim is to convert English-language printed text from various scanned sources into clear, accurate, editable MS Word and Adobe PDF files. Nothing fancy.

I have two computer setups at home:
1 – Primary - Dell Vostro 420 Desktop (Vista Business 32 bit) that works with an HP C-4480 All-In-One color printer/scanner/copier.
2 – Backup - Dell Latitude D820 Laptop operating (XP Professional) that works with an Epson Perfection 4490 flatbed scanner.

Can anyone offer suggestions – tips -- do’s and don’ts -- links – leads to published material for OCR newbies? Lastly, will the aforementioned scanners work with any OCR software?

P.S. – how does the I.R.I.S. Readiris 13 Pro OCR program stack up in this regard?

Many thanks.
shamrock838 is offline   Reply With Quote
Old 04-03-2010, 09:57 AM   #2
Michael Rowley
Member
 
Join Date: Jan 2005
Location: Ipswich (the one in England)
Posts: 5,105
Default

Quote:
how does the I.R.I.S. Readiris 13 Pro OCR program stack up in this regard?
Very well, but Abbyy's Fine Reader and Nuance's OmniPage are about equally good and also meet your specification. The OCR performance of most modern OCR programs is excellent; they differ (possibly) in the number of scripts and languages they can deal with. English and Word are easy.

   
__________________
Michael
Michael Rowley is offline   Reply With Quote
Old 04-03-2010, 01:11 PM   #3
Howard Allen
Member
 
Howard Allen's Avatar
 
Join Date: Oct 2007
Location: Calgary, Alberta, Canada
Posts: 824
Default

I have experience with the ABBYY FineReader engine (an early version, on the Mac platform), and found it to be very accurate. Adobe Acrobat also has a built-in OCR function, though it would probably bust your "under $200" price range unless you need Acrobat for other purposes.

Note that OCR software can be very good, but it's not perfect. Things to watch out for are italic text, fine print and ligatures, which appear commonly in older and British text (like the ae in archaeology, or the oe in foetus). Also "dirty" scans with lots of dust specks, creased pages, penned-in notes, underlining, etc. are likely to give trouble.

PLEASE make sure you proofread your OCR'd text (at the very least, run it through a spell-checker)! I'm very wary of "digitized" PDFs of old printed material--this is becoming very common in online versions of archived scientific journals, for example-- that claims to be "searchable". These PDFs typically consist of a rasterized image (a scanned "picture" of the page) with an invisible layer of OCR'd text underneath, usually riddled with OCR errors, meaning that keyword searches are likely to fail. Acrobat's OCR function produces these sorts of files (as can other OCR software), so be careful.

   
__________________
Howard

OSX 10.10.5
Howard Allen is offline   Reply With Quote
Old 04-03-2010, 04:47 PM   #4
Michael Beloved
Member
 
Join Date: Sep 2008
Location: Brooklyn NY
Posts: 141
Default

I am in complete agreement with Howard,
I used Abby Fine reader. Previously I used Acrobat but it did not do a good job and I had to correct many of its goofs,

With Abby Fine reader, there will be a few goofs but overall I was satisfied with its performance.

There is one thing that I might contribute. If you find many section breaks or column breaks or odd margins inserted into the Abby output, removed these from the last page going forward.
This is important since if you remove them from the first page going to the last page, the system may adapt or transfer the formatting of the proceeding section or proceeding paragraph forward.
I learnt this the hard way and finally figure it out, that I should clean up margins, sections breaks and column breaks by working backwards only.

One other thing I discovered was that I should select the entire document, click to open the columns dialog box in Word and set it to one column and also set the margin size to some generic margin style like letter size.

Sometimes there are hidden objects embedded in a document when it passes through an OCR software, then you have to find those and delete them, If you do not you will find that there are holes or blank areas in the text where you cannot type anything.
To remove those you can click on select button in the Home tab of Word. Then click on objects, and then use that to find the hidden objects. They will appear as four or more blue dots. Then they can be deleted.

With Acrobat I found that it causes many such hidden objects if you use the OCR feature of it, especially if your input is typed pages or pages which have any sort of water marks or stains.

Once you purchase and use the software, if you have difficulty, it is a good idea to mention that in case anyone has experience with that specific issue.
Michael Beloved is offline   Reply With Quote
Old 04-04-2010, 09:51 AM   #5
Michael Rowley
Member
 
Join Date: Jan 2005
Location: Ipswich (the one in England)
Posts: 5,105
Default

Howard:
Quote:
Acrobat's OCR function produces these sorts of files (as can other OCR software), so be careful
Any text scanned to Acrobat with the OCR switched on will be liable to contain errors, because (a) Acrobat can't do a spelling check on text, and (b) even misspelled words that are discovered by eye cannot be corrected by Acrobat: it is far better to scan to an application that finds and corrects text and then to convert the document to PDF.

   
__________________
Michael
Michael Rowley is offline   Reply With Quote
Old 04-05-2010, 09:28 AM   #6
Howard Allen
Member
 
Howard Allen's Avatar
 
Join Date: Oct 2007
Location: Calgary, Alberta, Canada
Posts: 824
Default

Agreed. Acrobat has a "Find OCR suspects" routine, but that only gives you a chance to correct what Acrobat itself considers "suspects". There's no way of fixing errors that Acrobat has missed. Therefore, Acrobat's OCR option should only be considered for "quick-and-dirty" jobs. If you want a "good" OCR job, you should, as Michael says, OCR to a text file, edit the text, lay it out, proofread it, then make a PDF.

   
__________________
Howard

OSX 10.10.5
Howard Allen is offline   Reply With Quote
Old 06-15-2010, 01:20 PM   #7
groucho
Staff
 
Join Date: Oct 2004
Posts: 490
Default

A lot depends on how much material you need to scan. We got a Fujitsu ScanSnap because it is fast, robust, and while nearly $400 it scans both sides at once. Included the Abbyy reader and the full Acrobat, although it will default and feed files into Abbyy it won't default and feed them into Acrobat. Which is a nuisance, but you can save to PDF then tell Acro to batch convert the PDF files.

And that's what we do, because Acrobat does a much much better job of OCR than the Abbyy software does. Even from relatively good typed source material, Acrobat is simply head and shoulders above Abbyy at the job.
groucho is offline   Reply With Quote
Old 06-16-2010, 07:20 AM   #8
Michael Rowley
Member
 
Join Date: Jan 2005
Location: Ipswich (the one in England)
Posts: 5,105
Default

'Groucho':
Quote:
Included the Abbyy reader
Which? The current version is FineReader 10, but ScanSnap is unlikely to provide that free.

   
__________________
Michael
Michael Rowley is offline   Reply With Quote
Old 06-16-2010, 08:05 AM   #9
groucho
Staff
 
Join Date: Oct 2004
Posts: 490
Default

It doesn't seem to say, Michael. The scanner is last year's model, so one would expect a fairly current product (on par with the Acro9 that it shipped with) but the software says "ABBYY FineReader for ScanSnap (TM) 4.0" indicating it is a custom version that may having nothing to do with the version numbers on the retail products. Further internally it says copyright 2008 and build 8.x, so perhaps it means Version 8.x, custom bundle 4.x?

It is most peculiat that Acro can't recognize the scanner as a native input source, and the scanner can't default to using Acro, when they ship together. I'm guessing that means Adobe only wants to deal with TWAIN devices and Fujitsu doesn't.
groucho is offline   Reply With Quote
Old 06-16-2010, 09:10 AM   #10
Steve Rindsberg
Staff
 
Join Date: Nov 2004
Posts: 6,709
Default

I keep looking at the Fujitsu scanners with lust in my heart (but moths in my wallet).
They get good reviews but ISTR one of the things people complain about is the lack of a TWAIN driver.

Ah. Googling the two coughs up this:

http://scansnapcommunity.com/tips-tr...twain-drivers/

A bit ingenuous ... why not provide a TWAIN driver so that apps that need it can use the scanner and also whatever they need for their own software. But at least this gives the Fujitsulogic behind it.

   
__________________
Steve Rindsberg
====================
www.pptfaq.com
www.pptools.com
and stuff
Steve Rindsberg is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Starting a Different Browser dthomsen8 The Corner Pub 7 02-20-2008 04:12 PM


All times are GMT -8. The time now is 07:32 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Contents copyright 2004–2014 Desktop Publishing Forum and its members.