> -----Original Message----- > From: cctalk [mailto:cctalk-boun...@classiccmp.org] On Behalf Of Johnny > Billquist > Sent: 27 September 2015 13:18 > To: cctalk@classiccmp.org > Subject: Re: If you OCR, always archive the bitmaps too - Re: Regarding > Manuals > > On 2015-09-27 03:41, Toby Thain wrote: > > On 2015-09-26 5:51 PM, Johnny Billquist wrote: > >> On 2015-09-26 23:42, Toby Thain wrote: > >>> On 2015-09-26 4:28 PM, Johnny Billquist wrote: > >>>> On 2015-09-26 12:16, Johnny Billquist wrote: > >>>>> On 2015-09-25 22:35, Al Kossow wrote: > >>>>>> I have been going back and applying OCR to the ones on bitsavers. > >>>>>> Are there some in particular that you have a problem with? > >>>>> > >>>>> Aha. I wasn't aware of that. I've downloaded copies many years ago > >>>>> that I've been keeping locally. I'll check out the current > >>>>> versions on bitsavers then. > >>>> > >>>> Al, exactly how have they been OCRed? Looking at them, it would > >>>> appear that what you see is still the bitmaps of all the pages, but > >>>> then you have the basic text also available for selection/searching. > >>>> > >>>> My issue with that is that the documents are huge, and the > >>>> experience just scrolling through them is pretty bad. > >>> > >>> Imho, though I am sure I am not alone: > >>> > >>> Software which "recreates" the typography of a document from OCR > >>> does not produce an acceptable substitute, I've yet to see a book > >>> that wasn't ruined by it. > >>> > >>> Just worth mentioning for anyone who might be tempted - For this > >>> reason and others, the bitmaps must NEVER be discarded (Although of > >>> course bitmaps can be archived in a different file if people want to > >>> supply OCR as well.) > >> > >> Look at the results in the link I posted. I was more than happy with > >> that result. > > > > I've seen plenty of technical books ruined by this technique, which is > > why I beg anyone doing this to not divorce the bitmaps from the OCR'd > > result. > > > > I suppose some books might be relatively immune, but technical texts > > seem to be quite sensitive to poor interpretation by OCR, logically enough. > > I suppose it is the eternal argument between preservation and use. I use > these documents every day. I don't care about the pixels, but the content. > Museums and the like are obviously more interested in the preservation. > > I get the feeling you didn't actually check the text that I OCRed from a book. > That text is an example what I'm looking for.
I did. I found it hard to read as it has OCr'd with mixed typefaces. It is also only basically non-technical English. Try a couple of pages from any of the VM/370R6 manuals. I have tried to OCR without the bit maps with little success. These are especially badly reproduced (originally not as scanned). I can read the text from the BitMaps and know its right. One error in the OCR and I can be scratching my hed for ages. I also don't have problems reading them on a laprop... > > I will not prevent people who want pixel preservation from continuing to > have that. But for me, it is a problem. The experience in actually using these > documents are pretty poor. And, as have also been noted, information have > been lost in these scans, as they have not preserved color codings. > > Johnny > > -- > Johnny Billquist || "I'm on a bus > || on a psychedelic trip > email: b...@softjar.se || Reading murder books > pdp is alive! || tryin' to stay hip" - B. Idol