On 2015-09-27 23:33, Paul Koning wrote:

On Sep 26, 2015, at 5:42 PM, Toby Thain <t...@telegraphics.com.au> wrote:
...
Software which "recreates" the typography of a document from OCR does not 
produce an acceptable substitute, I've yet to see a book that wasn't ruined by it.


True.  But that's not the biggest problem with OCR.  The biggest problem is that even 
professional grade OCR programs have rather low accuracy.  Maybe they do acceptably well 
on really high grade scans of very clean new documents, but on books, typewritten 
documents, etc., even after you use the "train" feature you need to spend a 
long time cleaning up.  It may be faster than retyping things, if you're lucky.  Not if 
you're not; two of us recently retyped 300 pages of line printer listing because that was 
faster and more accurate than OCR on that particular printout.

Well, all I can say is that the OCRing I did of the book I posted a link to required some minor cleanup, but it was very light. So the accuracy was very good there.

Given that OCR can only do, at best, a just barely acceptable recognition of 
the letters of the alphabet, it follows that accurately recognizing the actual 
font used will be vastly less accurate.  And indeed you can see that clearly.

The program I used back then obviously correctly picked not just the letters, but also which font to use very accurately.

I wonder if there are OCR programs that can be told to choose among 2 or 3 fonts, as 
opposed to guess from the entire inventory of the machine.  If so, and if they are 
sufficiently distinct, then maybe you'd stand a chance.  Especially if it also added 
heuristics like "never change fonts in mid-word" -- an obvious rule but not one 
I have seen implemented.

That would be possible, I guess. But I would so like to remember, refind what I used back then. The results it produced was pretty much identical to the original. Manuals, in comparison, would be pretty straight forward. (Less fonts, and less strange layouts than books, in my eye. Figures still needs to be bitmaps, though.)

        Johnny

--
Johnny Billquist                  || "I'm on a bus
                                  ||  on a psychedelic trip
email: b...@softjar.se             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol

Reply via email to