[tesseract-ocr] Re: Manual review and correction for characters outside of the Latin-1 character set

Jun Repasa Tue, 04 Jun 2024 16:21:24 -0700

If tesseract can no longer recognize specific characters, then time to add 
custom OCR models - Haven't done this though myself, as most documents we 
scan are pretty normal.
On Tuesday 4 June 2024 at 11:06:51 UTC+12 ger.h...@gmail.com wrote:


> -  "These scans include characters that are not in the Latin-1 block, 
> which I read somewhere and now can't find is the limit for the English 
> data."
>
> Well, to put it bluntly, diving into the rabbit hole without a helmet nor 
> a 'chute: as far as I have been able to discover, the current "official" 
> tesseract training data "databases" (neural net matrices) that are used to 
> recognize anything we throw at tesseract have been produced ("trained") at 
> google by Ray Smith, using copious hardware from google I expect -- 
> training neural nets is no joy at the average Joe's hardware budget, after 
> all. When you dig through the git commits, such as 
> https://github.com/tesseract-ocr/tessdata/commits/main/ , you'll find the 
> last training file *content* update was back in '17 by @theraysmith and he 
> hasn't been around long after since: 
> https://github.com/theraysmith?tab=overview&from=2017-12-01&to=2017-12-31 
> -- without any hard data, my initial guess is a change of corporate google 
> mind re tesseract.
>
> Stefan Weil et al have done a lot a ton of important work since, but when 
> you ask "what can this baby recognize?" that translates 1:1 to "what has 
> tesseract been trained to recognize?" and there... things get a little 
> vague for me. I'd love to be corrected on this, slapped on the wrist or 
> worse, but from what I've gleaned so far during my research:
>
> - though there's https://github.com/tesseract-ocr/langdata , 
> https://github.com/tesseract-ocr/tesstrain , 
> https://github.com/tesseract-ocr/tessdata_best/commits/main/ and Ray 
> Smith's public notes and papers about what was done for tesseract v4/v5 at 
> https://github.com/tesseract-ocr/docs (which is separate from 
> https://github.com/tesseract-ocr/tessdoc, which is more user oriented 
> instead of architectural background), I am not confident that the actual 
> list of training files used to produce those master traineddata LSTM files 
> (= tesseract v4/v5 OCR engine) are checked into git: I have seen a list of 
> font names used some place in there (or was it the mailing list?), but for 
> anyone who works with fonts that already is a handwavey kinda thing and, 
> yes, copyrights, yadayada, will forever prevent something more precise to 
> be available because the list most certainly included commercial fonts. 
> Then there's also the training input files defining the "text lines" to be 
> rendered as training material: those actually determine which glyphs in the 
> fonts will be trained at all (and in what combinations). And there I am not 
> feeling confident either, as it looks like those files published are the 
> ones from the older v3 engine, still relevant, but *probably* not what Ray 
> was using to produce those many traineddata files he did at the google shop.
> Having dug through the git histories, inspected the various files, scripts 
> and notes about 2 years ago, I cannot say with complete confidence whether 
> your (C), TM and 1/2, 3/4, etc. fraction glyphs have made it into the 
> training set for English back then. My *guess* is that they have been 
> included, if only a few samples, so the neural net will have *some* 
> recollection of them, if my guess is correct, but I also expect them to 
> have "featured little" in the total training process so recognition chances 
> are reduced.
>
> (Aside: As we focus on the English language training set here, I didn't 
> mention the metric ton of work done by @Shreeshrii for Asian scripts, 
> particularly Devanagari and related, a few years later. As far as I can 
> tell, most of the `traineddata` scripts and process today are due to his 
> work and Stefan Weil's, who, if you look over there, you'll note has done a 
> lot of work around OCR-ing (pre-war) German newpapers and similar 
> publications, which was when the Germans had a fondness of printing 
> everything in (to my eyes) quite hard to read blackletter fonts. To make 
> that feat happen, he and the university team (of several German uni's 
> together, if I read what was done right, back when) created a 
> German-specific training set for newspaper blackletter print and published 
> the resulting tesseract traineddata OCR databases for public use (language: 
> "fra" = fraktur). I don't recall seeing a publication where he lists the 
> number of CPU hours used to produce that trained set (one(1) language, few 
> fonts vs. the 400+ allegedly used in the google production run) but you can 
> bet your bottom it wasn't cheap! Or quick!)
>
> When we pop out of the rabbit hole of tesseract history, we might now 
> better understand why your problem is answered... haphazardly:
>
> - general advice number 1 out there is to 'tune' a language training file 
> if you have special needs, such as your wish to recognize fractions, etc., 
> which don't feature often in published texts and thus haven't been a real 
> bother thus far. This "tuning" advice is basically training advice to do a 
> little extra training, which is, to me, a little hairy as you are expected 
> to not deteriorate the existing recognition ability while *slightly 
> improving* the recognition confidence (and thus output quality) for a few 
> glyphs ("characters in your fonts") that are already mostly recognized by 
> the neural net as it recognizes part or all of the relevant "shapes" that 
> make up the glyphs you wish to see recognized. (This is a very rough 
> translation of what a neural net "learns" vs. how we humans might 
> understand pattern recognition, so tread carefully around this blather of 
> mine if you think you're getting a look under the hood. We're rather more 
> *paraphrasing* the engine instead of pointing at its carburetor, spark 
> plugs, etc., if you get my drift.)
>
> Logically, this approach is met with varying success (and crushed hopes) 
> as it is VERY much dependent on the exact shapes and glyphs (characters) 
> you add.   (TM) might be helped by being quite close to a T+M superscript, 
> while the fractions being a combo of superscript, subscript and a / slash 
> might be doable or hard for the LSTM+CTC engine, I cannot tell without 
> having tried. And training takes time, both in setting it up and in CPU 
> cycles, so it's not a 5 minute thing to do. Which explains another type of 
> silence around here.
>
> - if that didn't work, you will read several folks advising to "lop off 
> the top layer" and retrain the whole language. What this says is that, 
> basically, the attempt is to wipe just one of the many layers of the 
> LSTM+CTC neural net where it is expected to 'conclude' things like "ah... 
> that there and this shapy thingamajig here, all that jazz is very probably 
> an 'a'..." and hope that that lopping-off-and-retraining suffices to get 
> acceptable training results after running the training for a while (& 
> checking you're doing all right and not overtraining other bits and pieces 
> of the engine's alphabet/text output!)
> This takes rather more time than "tuning" as you must now retrain at least 
> an entire layer, while tuning was only intended to have the training 
> activity result in a few cell connections in there being tweaked a little 
> to get what you wanted.
>
> - general advice number 3 is to do what the Germans did and train a 
> dedicated "language", which means you'll need to do all the work of 
> creating font(s), text line training files which include (hopefully) every 
> word and symbol you may ever encounter later on and then cook one CPU or 
> more for some considerable time. I consider that effort approaching 
> herculean, particularly when you're alone. Some have tried, and a few even 
> succeeded it seems from the noises I recall for the last couple of years 
> lurking on this mailing list.
>
> By now you'll surely have gotten the gist of it: from the distance of a 
> mailing list POV, it's all a guess and there's so many little details 
> involved to arrive at success that almost nobody dares venture saying much, 
> at least not all at once. Because this stuff is *hard* to get right and the 
> above can be a cause for scare with some folks. 
>
> Me personally, I tried my hand at "tuning" a little about a year ago and 
> it didn't fare well, because I found out I still didn't understand all the 
> processes involved well enough to make decisions that would differ from 
> joining a crap shoot blindfolded. But that is me and I am not into the 
> adrenalin rush of bungee jumping either, so it probably says more about me 
> than about the process of training/tuning tesseract.
>
>
>
>
>
>
> Having mentioned the above three options, my personal favorite advice 
> number 4 is: try to come up with a way which can keep tesseract as-is, and 
> adding a review/correction post-process that is acceptable for you. If you 
> find it in your heart to accept that a little copy-editing after the OCR 
> actions is A-okay, you are probably better off, both in time spent and 
> frustration with machines' ways. After all, the initial setup cost for this 
> option is much less for single-person shops, I expect. ;-)  (The break-even 
> would be a fairly large number of pages to process...)
>
>
>
>
>
>
>
> - "I've got a mostly English language set of scans (image quality is good 
> but not great, but best I can do without a better scanner"
>
> Personal experience to date is image preprocessing is a "field of active 
> research" (i.e. you need to try and test all your own and any others' ideas 
> that sound more or less reasonable) and has a very strong effect on the 
> outcome of the OCR stage. For instance, you may want to rescale your 
> scanned images and see at which text pixel height they do well/best; 
> previous research says text at 30-33 pixels height is optimal, but yours 
> might differ a little from that, so experiment! (I'll try to do a tesseract 
> run on an image you posted earlier later tomorrow at very resize sizes to 
> see what comes out that one.)
>
> Ditto for post-processing: it might be useful, if the content is important 
> enough to you, to dump it into a word processor / text editor with 
> spellchecker on board for further assistance. A manual review process of 
> some kind is called for, anyway, if you want consistent (very) high quality 
> output.
>
> There's also processors/tools that can do "smart quotes" if you like, but 
> I would reserve that for last; my initial approach there would be to have 
> the OCR engine spit out quotes where-ever they occur and then convert them 
> to "smart" open/close quotes in post, if I wanted. French quotes would 
> potentially be easier to OCR that way (as they appear at different vertical 
> offsets) but I'ld be glad to have *any* kind of quote coming out of the OCR 
> machine: the training sets have been trained on a gazillion fonts and 
> intricate little typography details like "smart quotes" are rather font 
> specific, so recognizing them from an OCR engine's perspective screams 
> "tuning! dedicated font training!" and a little headache starts to develop 
> over here. ;-))
>
>
>
> - "Slightly related, how, exactly, do y'all deal with drop caps?"
>
> Errrrm, AFAICT.... we don't. Apologies.          Seriously though, I don't 
> recall any positive success info on that one. 
>
> Here my initial gut response is to "recognize" the drop caps in 
> preprocessor, i.e. in the "image segmentation phase" and cut them out 
> specifically to have them extracted, rescaled to a sensible "regular text 
> size" and only then fed into the OCR engine. Afterwards the output then has 
> to be recombined with the rest of the image segments' text produce. BUT 
> that is mere theory as tesseract does not yet have a module/subprocess to 
> "identify" possible dropcaps and segment and process them as I just 
> described. Which means that today, you either do that up front and do the 
> recombining afterwards in your own custom postprocess, or you decide to 
> accept a little extra editorial post work by either keeping them in as-is 
> (and expecting errors or at least uncertainties reported by the OCR engine) 
> or maybe tipp-ex-ing ;-) them out in preprocessing and hoping the engine's 
> built-in dictionary resolves half of them due to spelling correction. Any 
> way, this is all currently non-existent, alas, so anything you come up with 
> is better than what is, today.
>
> (I am working on my own copy of tesseract which might improve this a 
> little, but don't expect any miracles there this quarter. I'm /slow/.)
>
>
>
> The 'tesseract does best with 30-33pixel high text' stuff is at: - 
> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ
> I wrote 
> https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ a 
> while ago; maybe the diagram in there and some paragraphs there aid 
> understanding what's going under the hood, which' info I think you need, 
> like I did/do.
>
>
>
> Take care,
>
> Ger
>
>
> P.S.: it was lying around for a gander, but my tesseract is buggered ATM. 
> Anyway, I installed an "official distro" one yesterday for other purposes 
> and I'll see how your previously posted scans fare with that one when I 
> test a few things on them. To be reported later this week, possibly 
> tomorrow afternoon.
>
>
>
>
>
>
>  
>
> On Monday, May 20, 2024 at 5:02:24 AM UTC+2 misti...@gmail.com wrote:
>
>> I've asked a couple different times, and each time I get just a little 
>> bit more information, but still not enough to work with.
>>
>> I've got a mostly English language set of scans (image quality is good 
>> but not great, but best I can do without a better scanner, I'm working on 
>> that to re-scan but there are some problems that still wouldn't be fixed). 
>> These scans include characters that are not in the Latin-1 block, which I 
>> read somewhere and now can't find is the limit for the English data. 
>> Example characters not being recognized include fractions ( ⅛ ⅔ instead 
>> of 1/8 or 2/3), the TM ( ™ ) or C ( © ) symbols (latter is actually in 
>> Latin 1, but isn't directly typeable and, from what I've been able to tell, 
>> the circled part comes out so faint on the input image, tesseract thinks it 
>> is noise) and "smart" or curly quotes - all characters that require using 
>> alt+ codes, insert special character dialogs or letting your 
>> wordprocessor/DTP handle converting for you. Which seems to mean they 
>> require some level of manual review and correction to be able to get it 
>> into the text output. BUT, once you see you need to input manually, how do 
>> you handle the positioning data (when working in hocr format)? I 
>> considered, briefly, using character whitelisting to help with these, but, 
>> that would imply the characters are already included in the character 
>> set/wordlist, which if memory serves, many of these aren't?
>>
>> Slightly related, how, exactly, do y'all deal with drop caps?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bfef6127-8b66-4bf9-9aca-fa70b9dea4ddn%40googlegroups.com.

[tesseract-ocr] Re: Manual review and correction for characters outside of the Latin-1 character set

Reply via email to