Re: [tesseract-ocr] Re: Manual review and correction for characters outside of the Latin-1 character set

Ger Hobbelt Sun, 09 Jun 2024 03:40:55 -0700

@MistiHamon:
care to share a set of those scans you have difficulty with? My use for
them would be to see if I can improve the results; at least they would be
great test material for future development as these are already "known hard
to get good results from".
At least I'd like to try my hand at a few of 'em. :-)   (The first one you
posted earlier is waiting for that on my todo stack; I want getting my own
experimental tesseract going with some new code first, so I can compare
vanilla release (UBMannheim) with my own current state of affairs.)


Might be handy to drop a set of them in a Google Drive share or DropBox
share; an alternative is dropping them in a github repo and designate it a
small test corpus; that way anyone who likes to try them can get them
easily and it won't load the others on this mailing list.

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------


On Fri, Jun 7, 2024 at 7:45 PM Misti Hamon <mistiha...@gmail.com> wrote:

> Novels and non-fiction prose (memiors, basic history or whatever) I'm
> getting good runs, they also happen to use fonts that were, or are close to
> ones, already trained. Manuals and textbooks - most of the ones I'm trying
> to work with include pictures and diagrams and other elements to further
> illustrate or just make things "pretty" and occasionally use non-standard
> fonts - are causing all sorts of problems. Tuning/retraining isn't
> possible, not enough data to work with and can't generate more because I
> don't know the fonts used. I also have a complicating factor of some uneven
> lighting that I can't figure out how to fix (an overall darken still leads
> to the areas that were overexposed getting skipped completely, even when
> running a thresholding algorithm before feeding to tesseract).
>
> On Tue, Jun 4, 2024, 17:21 Jun Repasa <jun.rep...@gmail.com> wrote:
>
>> If tesseract can no longer recognize specific characters, then time to
>> add custom OCR models - Haven't done this though myself, as most documents
>> we scan are pretty normal.
>> On Tuesday 4 June 2024 at 11:06:51 UTC+12 ger.h...@gmail.com wrote:
>>
>>> -  "These scans include characters that are not in the Latin-1 block,
>>> which I read somewhere and now can't find is the limit for the English
>>> data."
>>>
>>> Well, to put it bluntly, diving into the rabbit hole without a helmet
>>> nor a 'chute: as far as I have been able to discover, the current
>>> "official" tesseract training data "databases" (neural net matrices) that
>>> are used to recognize anything we throw at tesseract have been produced
>>> ("trained") at google by Ray Smith, using copious hardware from google I
>>> expect -- training neural nets is no joy at the average Joe's hardware
>>> budget, after all. When you dig through the git commits, such as
>>> https://github.com/tesseract-ocr/tessdata/commits/main/ , you'll find
>>> the last training file *content* update was back in '17 by @theraysmith and
>>> he hasn't been around long after since:
>>> https://github.com/theraysmith?tab=overview&from=2017-12-01&to=2017-12-31
>>> -- without any hard data, my initial guess is a change of corporate google
>>> mind re tesseract.
>>>
>>> Stefan Weil et al have done a lot a ton of important work since, but
>>> when you ask "what can this baby recognize?" that translates 1:1 to "what
>>> has tesseract been trained to recognize?" and there... things get a little
>>> vague for me. I'd love to be corrected on this, slapped on the wrist or
>>> worse, but from what I've gleaned so far during my research:
>>>
>>> - though there's https://github.com/tesseract-ocr/langdata ,
>>> https://github.com/tesseract-ocr/tesstrain ,
>>> https://github.com/tesseract-ocr/tessdata_best/commits/main/ and Ray
>>> Smith's public notes and papers about what was done for tesseract v4/v5 at
>>> https://github.com/tesseract-ocr/docs (which is separate from
>>> https://github.com/tesseract-ocr/tessdoc, which is more user oriented
>>> instead of architectural background), I am not confident that the actual
>>> list of training files used to produce those master traineddata LSTM files
>>> (= tesseract v4/v5 OCR engine) are checked into git: I have seen a list of
>>> font names used some place in there (or was it the mailing list?), but for
>>> anyone who works with fonts that already is a handwavey kinda thing and,
>>> yes, copyrights, yadayada, will forever prevent something more precise to
>>> be available because the list most certainly included commercial fonts.
>>> Then there's also the training input files defining the "text lines" to be
>>> rendered as training material: those actually determine which glyphs in the
>>> fonts will be trained at all (and in what combinations). And there I am not
>>> feeling confident either, as it looks like those files published are the
>>> ones from the older v3 engine, still relevant, but *probably* not what Ray
>>> was using to produce those many traineddata files he did at the google shop.
>>> Having dug through the git histories, inspected the various files,
>>> scripts and notes about 2 years ago, I cannot say with complete confidence
>>> whether your (C), TM and 1/2, 3/4, etc. fraction glyphs have made it into
>>> the training set for English back then. My *guess* is that they have been
>>> included, if only a few samples, so the neural net will have *some*
>>> recollection of them, if my guess is correct, but I also expect them to
>>> have "featured little" in the total training process so recognition chances
>>> are reduced.
>>>
>>> (Aside: As we focus on the English language training set here, I didn't
>>> mention the metric ton of work done by @Shreeshrii for Asian scripts,
>>> particularly Devanagari and related, a few years later. As far as I can
>>> tell, most of the `traineddata` scripts and process today are due to his
>>> work and Stefan Weil's, who, if you look over there, you'll note has done a
>>> lot of work around OCR-ing (pre-war) German newpapers and similar
>>> publications, which was when the Germans had a fondness of printing
>>> everything in (to my eyes) quite hard to read blackletter fonts. To make
>>> that feat happen, he and the university team (of several German uni's
>>> together, if I read what was done right, back when) created a
>>> German-specific training set for newspaper blackletter print and published
>>> the resulting tesseract traineddata OCR databases for public use (language:
>>> "fra" = fraktur). I don't recall seeing a publication where he lists the
>>> number of CPU hours used to produce that trained set (one(1) language, few
>>> fonts vs. the 400+ allegedly used in the google production run) but you can
>>> bet your bottom it wasn't cheap! Or quick!)
>>>
>>> When we pop out of the rabbit hole of tesseract history, we might now
>>> better understand why your problem is answered... haphazardly:
>>>
>>> - general advice number 1 out there is to 'tune' a language training
>>> file if you have special needs, such as your wish to recognize fractions,
>>> etc., which don't feature often in published texts and thus haven't been a
>>> real bother thus far. This "tuning" advice is basically training advice to
>>> do a little extra training, which is, to me, a little hairy as you are
>>> expected to not deteriorate the existing recognition ability while
>>> *slightly improving* the recognition confidence (and thus output quality)
>>> for a few glyphs ("characters in your fonts") that are already mostly
>>> recognized by the neural net as it recognizes part or all of the relevant
>>> "shapes" that make up the glyphs you wish to see recognized. (This is a
>>> very rough translation of what a neural net "learns" vs. how we humans
>>> might understand pattern recognition, so tread carefully around this
>>> blather of mine if you think you're getting a look under the hood. We're
>>> rather more *paraphrasing* the engine instead of pointing at its
>>> carburetor, spark plugs, etc., if you get my drift.)
>>>
>>> Logically, this approach is met with varying success (and crushed hopes)
>>> as it is VERY much dependent on the exact shapes and glyphs (characters)
>>> you add.   (TM) might be helped by being quite close to a T+M superscript,
>>> while the fractions being a combo of superscript, subscript and a / slash
>>> might be doable or hard for the LSTM+CTC engine, I cannot tell without
>>> having tried. And training takes time, both in setting it up and in CPU
>>> cycles, so it's not a 5 minute thing to do. Which explains another type of
>>> silence around here.
>>>
>>> - if that didn't work, you will read several folks advising to "lop off
>>> the top layer" and retrain the whole language. What this says is that,
>>> basically, the attempt is to wipe just one of the many layers of the
>>> LSTM+CTC neural net where it is expected to 'conclude' things like "ah...
>>> that there and this shapy thingamajig here, all that jazz is very probably
>>> an 'a'..." and hope that that lopping-off-and-retraining suffices to get
>>> acceptable training results after running the training for a while (&
>>> checking you're doing all right and not overtraining other bits and pieces
>>> of the engine's alphabet/text output!)
>>> This takes rather more time than "tuning" as you must now retrain at
>>> least an entire layer, while tuning was only intended to have the training
>>> activity result in a few cell connections in there being tweaked a little
>>> to get what you wanted.
>>>
>>> - general advice number 3 is to do what the Germans did and train a
>>> dedicated "language", which means you'll need to do all the work of
>>> creating font(s), text line training files which include (hopefully) every
>>> word and symbol you may ever encounter later on and then cook one CPU or
>>> more for some considerable time. I consider that effort approaching
>>> herculean, particularly when you're alone. Some have tried, and a few even
>>> succeeded it seems from the noises I recall for the last couple of years
>>> lurking on this mailing list.
>>>
>>> By now you'll surely have gotten the gist of it: from the distance of a
>>> mailing list POV, it's all a guess and there's so many little details
>>> involved to arrive at success that almost nobody dares venture saying much,
>>> at least not all at once. Because this stuff is *hard* to get right and the
>>> above can be a cause for scare with some folks.
>>>
>>> Me personally, I tried my hand at "tuning" a little about a year ago and
>>> it didn't fare well, because I found out I still didn't understand all the
>>> processes involved well enough to make decisions that would differ from
>>> joining a crap shoot blindfolded. But that is me and I am not into the
>>> adrenalin rush of bungee jumping either, so it probably says more about me
>>> than about the process of training/tuning tesseract.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Having mentioned the above three options, my personal favorite advice
>>> number 4 is: try to come up with a way which can keep tesseract as-is, and
>>> adding a review/correction post-process that is acceptable for you. If you
>>> find it in your heart to accept that a little copy-editing after the OCR
>>> actions is A-okay, you are probably better off, both in time spent and
>>> frustration with machines' ways. After all, the initial setup cost for this
>>> option is much less for single-person shops, I expect. ;-)  (The break-even
>>> would be a fairly large number of pages to process...)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> - "I've got a mostly English language set of scans (image quality is
>>> good but not great, but best I can do without a better scanner"
>>>
>>> Personal experience to date is image preprocessing is a "field of active
>>> research" (i.e. you need to try and test all your own and any others' ideas
>>> that sound more or less reasonable) and has a very strong effect on the
>>> outcome of the OCR stage. For instance, you may want to rescale your
>>> scanned images and see at which text pixel height they do well/best;
>>> previous research says text at 30-33 pixels height is optimal, but yours
>>> might differ a little from that, so experiment! (I'll try to do a tesseract
>>> run on an image you posted earlier later tomorrow at very resize sizes to
>>> see what comes out that one.)
>>>
>>> Ditto for post-processing: it might be useful, if the content is
>>> important enough to you, to dump it into a word processor / text editor
>>> with spellchecker on board for further assistance. A manual review process
>>> of some kind is called for, anyway, if you want consistent (very) high
>>> quality output.
>>>
>>> There's also processors/tools that can do "smart quotes" if you like,
>>> but I would reserve that for last; my initial approach there would be to
>>> have the OCR engine spit out quotes where-ever they occur and then convert
>>> them to "smart" open/close quotes in post, if I wanted. French quotes would
>>> potentially be easier to OCR that way (as they appear at different vertical
>>> offsets) but I'ld be glad to have *any* kind of quote coming out of the OCR
>>> machine: the training sets have been trained on a gazillion fonts and
>>> intricate little typography details like "smart quotes" are rather font
>>> specific, so recognizing them from an OCR engine's perspective screams
>>> "tuning! dedicated font training!" and a little headache starts to develop
>>> over here. ;-))
>>>
>>>
>>>
>>> - "Slightly related, how, exactly, do y'all deal with drop caps?"
>>>
>>> Errrrm, AFAICT.... we don't. Apologies.          Seriously though, I
>>> don't recall any positive success info on that one.
>>>
>>> Here my initial gut response is to "recognize" the drop caps in
>>> preprocessor, i.e. in the "image segmentation phase" and cut them out
>>> specifically to have them extracted, rescaled to a sensible "regular text
>>> size" and only then fed into the OCR engine. Afterwards the output then has
>>> to be recombined with the rest of the image segments' text produce. BUT
>>> that is mere theory as tesseract does not yet have a module/subprocess to
>>> "identify" possible dropcaps and segment and process them as I just
>>> described. Which means that today, you either do that up front and do the
>>> recombining afterwards in your own custom postprocess, or you decide to
>>> accept a little extra editorial post work by either keeping them in as-is
>>> (and expecting errors or at least uncertainties reported by the OCR engine)
>>> or maybe tipp-ex-ing ;-) them out in preprocessing and hoping the engine's
>>> built-in dictionary resolves half of them due to spelling correction. Any
>>> way, this is all currently non-existent, alas, so anything you come up with
>>> is better than what is, today.
>>>
>>> (I am working on my own copy of tesseract which might improve this a
>>> little, but don't expect any miracles there this quarter. I'm /slow/.)
>>>
>>>
>>>
>>> The 'tesseract does best with 30-33pixel high text' stuff is at: -
>>> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ
>>> I wrote
>>> https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ
>>> a while ago; maybe the diagram in there and some paragraphs there aid
>>> understanding what's going under the hood, which' info I think you need,
>>> like I did/do.
>>>
>>>
>>>
>>> Take care,
>>>
>>> Ger
>>>
>>>
>>> P.S.: it was lying around for a gander, but my tesseract is buggered
>>> ATM. Anyway, I installed an "official distro" one yesterday for other
>>> purposes and I'll see how your previously posted scans fare with that one
>>> when I test a few things on them. To be reported later this week, possibly
>>> tomorrow afternoon.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Monday, May 20, 2024 at 5:02:24 AM UTC+2 misti...@gmail.com wrote:
>>>
>>>> I've asked a couple different times, and each time I get just a little
>>>> bit more information, but still not enough to work with.
>>>>
>>>> I've got a mostly English language set of scans (image quality is good
>>>> but not great, but best I can do without a better scanner, I'm working on
>>>> that to re-scan but there are some problems that still wouldn't be fixed).
>>>> These scans include characters that are not in the Latin-1 block, which I
>>>> read somewhere and now can't find is the limit for the English data.
>>>> Example characters not being recognized include fractions ( ⅛ ⅔
>>>> instead of 1/8 or 2/3), the TM ( ™ ) or C ( © ) symbols (latter is
>>>> actually in Latin 1, but isn't directly typeable and, from what I've been
>>>> able to tell, the circled part comes out so faint on the input image,
>>>> tesseract thinks it is noise) and "smart" or curly quotes - all characters
>>>> that require using alt+ codes, insert special character dialogs or letting
>>>> your wordprocessor/DTP handle converting for you. Which seems to mean they
>>>> require some level of manual review and correction to be able to get it
>>>> into the text output. BUT, once you see you need to input manually, how do
>>>> you handle the positioning data (when working in hocr format)? I
>>>> considered, briefly, using character whitelisting to help with these, but,
>>>> that would imply the characters are already included in the character
>>>> set/wordlist, which if memory serves, many of these aren't?
>>>>
>>>> Slightly related, how, exactly, do y'all deal with drop caps?
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/bfef6127-8b66-4bf9-9aca-fa70b9dea4ddn%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/bfef6127-8b66-4bf9-9aca-fa70b9dea4ddn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAEnOb6QCTmXz%3DeVS-fCJPhzTYuVXtVrwjq-5%3DvwRK6R8Cwx-7A%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAEnOb6QCTmXz%3DeVS-fCJPhzTYuVXtVrwjq-5%3DvwRK6R8Cwx-7A%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frXijQjLcoqKoY87BH%3DTB-%3DL9xiftM09tMy_qYHUG0dxA%40mail.gmail.com.

Re: [tesseract-ocr] Re: Manual review and correction for characters outside of the Latin-1 character set

Reply via email to