Ger, Your problem set/end goal is simular to mine (textbooks/manuals not magazines and datasheets and I only have tiff or jpg images, no partial pdfs, but full text search and copy/paste are things I want, and textbooks/manuals do have the same OCR difficulties as magazines).
Can't offer much help on the problems you currently are working on solving, except for the mask issue. If you haven't already looked into this, check out MRC - Mixed Raster Content - images. These are usually tiff format, but I'm pretty sure jpg can be made MRC as well. I currently use a GUI (very) interactive tool - github.com/ScanTailor-Advanced/scantailor-advanced (use this fork if you are going to try it, the original developer has abandoned development) but internet archive has a python script that *may* work better especially in your workflow (there's a weird bug/problem in scantailor that doesn't properly identify the background when text is over top of an image like in magazines or textbook chapter/part/section openings). My tesseract runs are currently happening against just the mask but you could run them against the integrated image and tesseract only sees the mask - it even ignores text that your mask has been set to not see as text. I'll get with you privately about testing images... On Sun, Jun 9, 2024, 04:32 Ger Hobbelt <ger.hobb...@gmail.com> wrote: > @JunRepasa: can you share details of your project/intent? (privately + NDA > is okay with me if that works better for you) > > Why? Because I am working on that area of interest myself; however it's > slow going as I'm self-funded and there's multiple focus areas, also > outside tesseract and not relevant for this "preprocessing stage" problem. > (Slow going means I don't expect any usable results before end-of-year.) > > > --- /start tangential note > > My own (possibly relevant) goal set is this: > > - integrate tesseract more fully in the mupdf tool chain (Artifex already > has a basic tesseract run going, but for my purposes I need more > fine-grained control per PDF and page image) > why mupdf based? Because my problem area is very comparable to "scanned > magazine page images" packaged as every-page-is-an-image PDFs (electronic > datasheets, scanned magazines, ...), which I need to have ready for FTS > (Full Text Search) and Text copy&paste of the decoded contents. > Image/chart/graphics extraction is a bonus. Anyhow, my inputs can be > gang-pressed into PDFs if they aren't already and a couple of loose images > (posters / cheat sheets / other single-page publications) if need be. > - easier-for-humans diagnostics output: focus on the preprocessor stage: > tesseract has some yet-to-be-diagnosed issues with page segmentation = > discovering the little rectangle areas on the page where a text word and > text lines are situated; tesseract sometimes turns "deaf" for parts of the > page for otherwise unremarkable page images. An experimental version of > tesseract of mine outputs the debug output + debug intermediate stage > images in HTML format for easier perusal in the browser. This rides on the > coattails of the tesseract ScrollView Java tool, but I am not looking for > user-interactive; I want HTML-based human-readable debug/diag log output > for bulk processes that may be (partially) reviewed at a later date and the > review process should be less brain-load than it is right now. > - flexible = more powerful preprocessing stage in tesseract: suppose we > fix or remove the current segmentation bugs, what would work for me? Here > the plan is to (minimally) modify the tesseract process so the various > stages internally become addressable and steerable by a user script (I plan > to use JavaScript for this, using QuickJS as the script core): that way I > and anyone else can tweak the tesseract process stages without needing to > recompile or use external means that may run the executable repeatedly > (pyTesseract el al): I need a hopefully faster process as I will be > processing page images in bulk on limited hardware: end-users' single > machines. > - tesseract CLI / API: allow to input an optional "mask image" next to the > page image itself. The fundamental idea here is that the segmentation > process is done elsewhere (by human or other machine) and the mask image is > similar to one would encounter in the 3D entertainment movie industry: the > "mask image" not only encodes which pixels in the image are > text-to-be-OCR-ed, but also encodes the *order* in which these pixel groups > form words or glyphs to OCR and output in the designated order. Think a > multi-layer mask image encoding all you need to unambiguously extract the > text in a multi-column or other "shaped layout" page, plus mark any text > that's part of in-page charts/graphs/images/footnotes/header/footer, so > that we can write them in order to the output HOCR, TEXT or HTML formats. > Thus the internal page segmentation logic can be overruled by external > image means: all it takes is scanning the mask image to decode what is to > be done, where, when, no tesseract segmentation heuristics noise. > - extra image processing: once the scripting works, add other image > grayscaling and thresholding algorithms so script-writers can try a few > more things and tweak the process to do what they find works for them. > Basically that would mean using a PRLib or OpenCV like library, next to > leptonica. > > > The whole concept is based on getting towards a > mupdf+leptonica+tesseract-based application which takes a batch of > arbitrary PDFs and processes them, rewriting each as a fully searchable PDF > with original content visible on-screen while a text overlay ensures text > mark/edit/annotate/copy+paste behaviour becoming possible, while a separate > text-like output format is fed to a search engine indexer for FTS, so "you > can google your own library". > > Most of this exists out there already in some partial form or other > (except the image mask concept); this is only a success when it ultimately > can serve as a ready-to-use end-user application, ready out of the box. > All future music right now, as this is the target but progress is slow. > > --- /end tangential note > > > Met vriendelijke groeten / Best regards, > > Ger Hobbelt > > -------------------------------------------------- > web: http://www.hobbelt.com/ > http://www.hebbut.net/ > mail: g...@hobbelt.com > mobile: +31-6-11 120 978 > -------------------------------------------------- > > > On Sun, Jun 9, 2024 at 1:56 AM Jun Repasa <jun.rep...@gmail.com> wrote: > >> Guys I am about to start a project, which mainly will improve tesseract >> ability to recognize text, regardless of input document/image quality. >> Am securing some grants/funding. >> >> If you 're interested let me know. >> >> cheers >> >> On Saturday 8 June 2024 at 05:37:41 UTC+12 misti...@gmail.com wrote: >> >>> Hello Ger, and thank you for responding. >>> >>> Regarding training and/or tuning - I definitely don't have the available >>> computing power for a full train, and assuming I'm understanding the >>> requirements (specifically the 1000 images minimum thing) I'm not sure I >>> have enough data for a tune (it's approximately 230 pages that use this >>> font, with only about 50% text coverage on the more dense pages, the rest >>> are non-ocr pictures, even if the 1000 images are single line images, not >>> sure I'd get there). I also have no idea what the font is, I suspect it's >>> one that isn't available to the public (without a hefty fee), so, >>> generating new very clean images isn't possible either (if it's possible to >>> tune using one font and have it apply to others that aren't visually >>> similar, that might actually be an option). >>> >>> So, we're back to manually fixing after the ocr run and/or using >>> graphics software to further "fix" the images before processing. I could >>> open the hocr files in my text editor and "fix" commas that are read as >>> periods, quotes that aren't quite correct and even super/sub fractions, >>> generating the bounding boxes when whole words are simply ignored due to >>> uneven lighting (even though they are in the input image thanks to running >>> a thresholding algorithm before being handed to tesseract) is something I >>> haven't figured out how to do (if you happen to know how to use The GIMP to >>> selectively darken overexposed areas, that might help a lot. Alternatively, >>> is there a way to do a two run recognition? Something akin to a >>> non-persistent tune - do one run to a text file, manually correct the text >>> file, and have the second run to hocr use that text file as the dictionary >>> to use for that run. >>> >>> Biggest problem I am experiencing with manual correction: generating or >>> fixing - mostly expand, sometimes contract - the bounding boxes after >>> entering the correct characters when what was recognized as the wrong >>> metrics for what is supposed to be there. >>> >>> Second biggest problem (which if possible should be fixed first), I need >>> an additional preprocessing step to fix uneven lighting. I have available >>> for use Rawtherapee and The GIMP (was able to fix overexposure, but that >>> darkened everything equally, need a way to spot darken the regions that >>> received more light during scanning those regions are the ones that are >>> most likely to not get recognized at all) >>> >>> On Mon, Jun 3, 2024, 17:06 Ger Hobbelt <ger.h...@gmail.com> wrote: >>> >>>> - "These scans include characters that are not in the Latin-1 block, >>>> which I read somewhere and now can't find is the limit for the English >>>> data." >>>> >>>> Well, to put it bluntly, diving into the rabbit hole without a helmet >>>> nor a 'chute: as far as I have been able to discover, the current >>>> "official" tesseract training data "databases" (neural net matrices) that >>>> are used to recognize anything we throw at tesseract have been produced >>>> ("trained") at google by Ray Smith, using copious hardware from google I >>>> expect -- training neural nets is no joy at the average Joe's hardware >>>> budget, after all. When you dig through the git commits, such as >>>> https://github.com/tesseract-ocr/tessdata/commits/main/ , you'll find >>>> the last training file *content* update was back in '17 by @theraysmith and >>>> he hasn't been around long after since: >>>> https://github.com/theraysmith?tab=overview&from=2017-12-01&to=2017-12-31 >>>> -- without any hard data, my initial guess is a change of corporate google >>>> mind re tesseract. >>>> >>>> Stefan Weil et al have done a lot a ton of important work since, but >>>> when you ask "what can this baby recognize?" that translates 1:1 to "what >>>> has tesseract been trained to recognize?" and there... things get a little >>>> vague for me. I'd love to be corrected on this, slapped on the wrist or >>>> worse, but from what I've gleaned so far during my research: >>>> >>>> - though there's https://github.com/tesseract-ocr/langdata , >>>> https://github.com/tesseract-ocr/tesstrain , >>>> https://github.com/tesseract-ocr/tessdata_best/commits/main/ and Ray >>>> Smith's public notes and papers about what was done for tesseract v4/v5 at >>>> https://github.com/tesseract-ocr/docs (which is separate from >>>> https://github.com/tesseract-ocr/tessdoc, which is more user oriented >>>> instead of architectural background), I am not confident that the actual >>>> list of training files used to produce those master traineddata LSTM files >>>> (= tesseract v4/v5 OCR engine) are checked into git: I have seen a list of >>>> font names used some place in there (or was it the mailing list?), but for >>>> anyone who works with fonts that already is a handwavey kinda thing and, >>>> yes, copyrights, yadayada, will forever prevent something more precise to >>>> be available because the list most certainly included commercial fonts. >>>> Then there's also the training input files defining the "text lines" to be >>>> rendered as training material: those actually determine which glyphs in the >>>> fonts will be trained at all (and in what combinations). And there I am not >>>> feeling confident either, as it looks like those files published are the >>>> ones from the older v3 engine, still relevant, but *probably* not what Ray >>>> was using to produce those many traineddata files he did at the google >>>> shop. >>>> Having dug through the git histories, inspected the various files, >>>> scripts and notes about 2 years ago, I cannot say with complete confidence >>>> whether your (C), TM and 1/2, 3/4, etc. fraction glyphs have made it into >>>> the training set for English back then. My *guess* is that they have been >>>> included, if only a few samples, so the neural net will have *some* >>>> recollection of them, if my guess is correct, but I also expect them to >>>> have "featured little" in the total training process so recognition chances >>>> are reduced. >>>> >>>> (Aside: As we focus on the English language training set here, I didn't >>>> mention the metric ton of work done by @Shreeshrii for Asian scripts, >>>> particularly Devanagari and related, a few years later. As far as I can >>>> tell, most of the `traineddata` scripts and process today are due to his >>>> work and Stefan Weil's, who, if you look over there, you'll note has done a >>>> lot of work around OCR-ing (pre-war) German newpapers and similar >>>> publications, which was when the Germans had a fondness of printing >>>> everything in (to my eyes) quite hard to read blackletter fonts. To make >>>> that feat happen, he and the university team (of several German uni's >>>> together, if I read what was done right, back when) created a >>>> German-specific training set for newspaper blackletter print and published >>>> the resulting tesseract traineddata OCR databases for public use (language: >>>> "fra" = fraktur). I don't recall seeing a publication where he lists the >>>> number of CPU hours used to produce that trained set (one(1) language, few >>>> fonts vs. the 400+ allegedly used in the google production run) but you can >>>> bet your bottom it wasn't cheap! Or quick!) >>>> >>>> When we pop out of the rabbit hole of tesseract history, we might now >>>> better understand why your problem is answered... haphazardly: >>>> >>>> - general advice number 1 out there is to 'tune' a language training >>>> file if you have special needs, such as your wish to recognize fractions, >>>> etc., which don't feature often in published texts and thus haven't been a >>>> real bother thus far. This "tuning" advice is basically training advice to >>>> do a little extra training, which is, to me, a little hairy as you are >>>> expected to not deteriorate the existing recognition ability while >>>> *slightly improving* the recognition confidence (and thus output quality) >>>> for a few glyphs ("characters in your fonts") that are already mostly >>>> recognized by the neural net as it recognizes part or all of the relevant >>>> "shapes" that make up the glyphs you wish to see recognized. (This is a >>>> very rough translation of what a neural net "learns" vs. how we humans >>>> might understand pattern recognition, so tread carefully around this >>>> blather of mine if you think you're getting a look under the hood. We're >>>> rather more *paraphrasing* the engine instead of pointing at its >>>> carburetor, spark plugs, etc., if you get my drift.) >>>> >>>> Logically, this approach is met with varying success (and crushed >>>> hopes) as it is VERY much dependent on the exact shapes and glyphs >>>> (characters) you add. (TM) might be helped by being quite close to a T+M >>>> superscript, while the fractions being a combo of superscript, subscript >>>> and a / slash might be doable or hard for the LSTM+CTC engine, I cannot >>>> tell without having tried. And training takes time, both in setting it up >>>> and in CPU cycles, so it's not a 5 minute thing to do. Which explains >>>> another type of silence around here. >>>> >>>> - if that didn't work, you will read several folks advising to "lop off >>>> the top layer" and retrain the whole language. What this says is that, >>>> basically, the attempt is to wipe just one of the many layers of the >>>> LSTM+CTC neural net where it is expected to 'conclude' things like "ah... >>>> that there and this shapy thingamajig here, all that jazz is very probably >>>> an 'a'..." and hope that that lopping-off-and-retraining suffices to get >>>> acceptable training results after running the training for a while (& >>>> checking you're doing all right and not overtraining other bits and pieces >>>> of the engine's alphabet/text output!) >>>> This takes rather more time than "tuning" as you must now retrain at >>>> least an entire layer, while tuning was only intended to have the training >>>> activity result in a few cell connections in there being tweaked a little >>>> to get what you wanted. >>>> >>>> - general advice number 3 is to do what the Germans did and train a >>>> dedicated "language", which means you'll need to do all the work of >>>> creating font(s), text line training files which include (hopefully) every >>>> word and symbol you may ever encounter later on and then cook one CPU or >>>> more for some considerable time. I consider that effort approaching >>>> herculean, particularly when you're alone. Some have tried, and a few even >>>> succeeded it seems from the noises I recall for the last couple of years >>>> lurking on this mailing list. >>>> >>>> By now you'll surely have gotten the gist of it: from the distance of a >>>> mailing list POV, it's all a guess and there's so many little details >>>> involved to arrive at success that almost nobody dares venture saying much, >>>> at least not all at once. Because this stuff is *hard* to get right and the >>>> above can be a cause for scare with some folks. >>>> >>>> Me personally, I tried my hand at "tuning" a little about a year ago >>>> and it didn't fare well, because I found out I still didn't understand all >>>> the processes involved well enough to make decisions that would differ from >>>> joining a crap shoot blindfolded. But that is me and I am not into the >>>> adrenalin rush of bungee jumping either, so it probably says more about me >>>> than about the process of training/tuning tesseract. >>>> >>>> >>>> >>>> >>>> >>>> >>>> Having mentioned the above three options, my personal favorite advice >>>> number 4 is: try to come up with a way which can keep tesseract as-is, and >>>> adding a review/correction post-process that is acceptable for you. If you >>>> find it in your heart to accept that a little copy-editing after the OCR >>>> actions is A-okay, you are probably better off, both in time spent and >>>> frustration with machines' ways. After all, the initial setup cost for this >>>> option is much less for single-person shops, I expect. ;-) (The break-even >>>> would be a fairly large number of pages to process...) >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> - "I've got a mostly English language set of scans (image quality is >>>> good but not great, but best I can do without a better scanner" >>>> >>>> Personal experience to date is image preprocessing is a "field of >>>> active research" (i.e. you need to try and test all your own and any >>>> others' ideas that sound more or less reasonable) and has a very strong >>>> effect on the outcome of the OCR stage. For instance, you may want to >>>> rescale your scanned images and see at which text pixel height they do >>>> well/best; previous research says text at 30-33 pixels height is optimal, >>>> but yours might differ a little from that, so experiment! (I'll try to do a >>>> tesseract run on an image you posted earlier later tomorrow at very resize >>>> sizes to see what comes out that one.) >>>> >>>> Ditto for post-processing: it might be useful, if the content is >>>> important enough to you, to dump it into a word processor / text editor >>>> with spellchecker on board for further assistance. A manual review process >>>> of some kind is called for, anyway, if you want consistent (very) high >>>> quality output. >>>> >>>> There's also processors/tools that can do "smart quotes" if you like, >>>> but I would reserve that for last; my initial approach there would be to >>>> have the OCR engine spit out quotes where-ever they occur and then convert >>>> them to "smart" open/close quotes in post, if I wanted. French quotes would >>>> potentially be easier to OCR that way (as they appear at different vertical >>>> offsets) but I'ld be glad to have *any* kind of quote coming out of the OCR >>>> machine: the training sets have been trained on a gazillion fonts and >>>> intricate little typography details like "smart quotes" are rather font >>>> specific, so recognizing them from an OCR engine's perspective screams >>>> "tuning! dedicated font training!" and a little headache starts to develop >>>> over here. ;-)) >>>> >>>> >>>> >>>> - "Slightly related, how, exactly, do y'all deal with drop caps?" >>>> >>>> Errrrm, AFAICT.... we don't. Apologies. Seriously though, I >>>> don't recall any positive success info on that one. >>>> >>>> Here my initial gut response is to "recognize" the drop caps in >>>> preprocessor, i.e. in the "image segmentation phase" and cut them out >>>> specifically to have them extracted, rescaled to a sensible "regular text >>>> size" and only then fed into the OCR engine. Afterwards the output then has >>>> to be recombined with the rest of the image segments' text produce. BUT >>>> that is mere theory as tesseract does not yet have a module/subprocess to >>>> "identify" possible dropcaps and segment and process them as I just >>>> described. Which means that today, you either do that up front and do the >>>> recombining afterwards in your own custom postprocess, or you decide to >>>> accept a little extra editorial post work by either keeping them in as-is >>>> (and expecting errors or at least uncertainties reported by the OCR engine) >>>> or maybe tipp-ex-ing ;-) them out in preprocessing and hoping the engine's >>>> built-in dictionary resolves half of them due to spelling correction. Any >>>> way, this is all currently non-existent, alas, so anything you come up with >>>> is better than what is, today. >>>> >>>> (I am working on my own copy of tesseract which might improve this a >>>> little, but don't expect any miracles there this quarter. I'm /slow/.) >>>> >>>> >>>> >>>> The 'tesseract does best with 30-33pixel high text' stuff is at: - >>>> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ >>>> I wrote >>>> https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ >>>> a while ago; maybe the diagram in there and some paragraphs there aid >>>> understanding what's going under the hood, which' info I think you need, >>>> like I did/do. >>>> >>>> >>>> >>>> Take care, >>>> >>>> Ger >>>> >>>> >>>> P.S.: it was lying around for a gander, but my tesseract is buggered >>>> ATM. Anyway, I installed an "official distro" one yesterday for other >>>> purposes and I'll see how your previously posted scans fare with that one >>>> when I test a few things on them. To be reported later this week, possibly >>>> tomorrow afternoon. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Monday, May 20, 2024 at 5:02:24 AM UTC+2 misti...@gmail.com wrote: >>>> >>>>> I've asked a couple different times, and each time I get just a little >>>>> bit more information, but still not enough to work with. >>>>> >>>>> I've got a mostly English language set of scans (image quality is good >>>>> but not great, but best I can do without a better scanner, I'm working on >>>>> that to re-scan but there are some problems that still wouldn't be fixed). >>>>> These scans include characters that are not in the Latin-1 block, which I >>>>> read somewhere and now can't find is the limit for the English data. >>>>> Example characters not being recognized include fractions ( ⅛ ⅔ >>>>> instead of 1/8 or 2/3), the TM ( ™ ) or C ( © ) symbols (latter is >>>>> actually in Latin 1, but isn't directly typeable and, from what I've been >>>>> able to tell, the circled part comes out so faint on the input image, >>>>> tesseract thinks it is noise) and "smart" or curly quotes - all characters >>>>> that require using alt+ codes, insert special character dialogs or letting >>>>> your wordprocessor/DTP handle converting for you. Which seems to mean they >>>>> require some level of manual review and correction to be able to get it >>>>> into the text output. BUT, once you see you need to input manually, how do >>>>> you handle the positioning data (when working in hocr format)? I >>>>> considered, briefly, using character whitelisting to help with these, but, >>>>> that would imply the characters are already included in the character >>>>> set/wordlist, which if memory serves, many of these aren't? >>>>> >>>>> Slightly related, how, exactly, do y'all deal with drop caps? >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/dc048b53-0767-4167-9976-819d2a2e0d8fn%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/dc048b53-0767-4167-9976-819d2a2e0d8fn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/588fa1d6-4537-4d3f-861a-42db278053a4n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/588fa1d6-4537-4d3f-861a-42db278053a4n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foBLobnJf8b7vLqUVbrW2VcT3Cd%3DK9HWhiGMtt6wDD7ug%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foBLobnJf8b7vLqUVbrW2VcT3Cd%3DK9HWhiGMtt6wDD7ug%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAEnOb6QqH2Z%3DGhwFY3bUpKzZYT8bijd0h1jw1-XSwmGR3O%3DB7Q%40mail.gmail.com.