Word level extraction only 

On Tuesday 13 February 2024 at 11:10:03 UTC+5:30 Santhiya C wrote:

> I had completed the training portion utilising the training tesseract OCR. 
> After annotating the.box file, it did not change the misspelt character for 
> my output extraction.
>
> I was followed this article only  Training Tesseract-OCR with custom 
> data. | by Sai Ashish | Medium 
> <https://saiashish90.medium.com/training-tesseract-ocr-with-custom-data-d3f4881575c0>
>  how do i resolve this issue 
> On Thursday 8 February 2024 at 10:22:40 UTC+5:30 aromal...@gmail.com 
> wrote:
>
>> are you working on a word level text extraction or sentence level text 
>> extraction?
>>
>> On Tuesday 6 February 2024 at 12:11:03 UTC+5:30 santhi...@gmail.com 
>> wrote:
>>
>>> can you please tell me model and steps 
>>>
>>> On Monday 5 February 2024 at 17:22:10 UTC+5:30 aromal...@gmail.com 
>>> wrote:
>>>
>>>> if you are getting started with OCR try some  other  engines  or just 
>>>> start with some deep learning models 
>>>> understand the basic working
>>>> On Thursday 1 February 2024 at 11:17:14 UTC+5:30 santhi...@gmail.com 
>>>> wrote:
>>>>
>>>>> Already i was used above mentioned  steps but i lost the datas 
>>>>>
>>>>> On Saturday 27 January 2024 at 06:52:54 UTC+5:30 g...@hobbelt.com 
>>>>> wrote:
>>>>>
>>>>>> L.S.,
>>>>>>
>>>>>> *PDF. OCR. text extraction. best language models? not a lot of 
>>>>>> success yet...*
>>>>>>
>>>>>> 🤔 
>>>>>>
>>>>>> Broad subject.  Learning curve ahead. 🚧 Workflow diagram included 
>>>>>> today.
>>>>>>
>>>>>>
>>>>>> *Tesseract does not live alone*
>>>>>>
>>>>>> Tesseract is an engine, which takes an image as input and produces 
>>>>>> text output; several output formats are available. If you are unsure, 
>>>>>> start 
>>>>>> with HOCR output as that's close to modern HTML and carries almost all 
>>>>>> info 
>>>>>> tesseract produces during the OCR process.
>>>>>> If it isn't an image you've got, you need a preprocess (and 
>>>>>> consequently additional tools) to produce images you can feed tesseract. 
>>>>>> tesseract is designed to process a SINGLE IMAGE. (Yes, that means you 
>>>>>> may 
>>>>>> want to 'merge' its output: postprocessing)
>>>>>>
>>>>>> *     To complicate matters immediately, tesseract can deal with 
>>>>>> "multipage TIFF" images and can accept multiple images to process via 
>>>>>> its 
>>>>>> commandline. Keep thinking "one page image in, bunch of text out" and 
>>>>>> you'll be okay until you discover the additional possibilities.*
>>>>>>
>>>>>> *Advice Number 1: *get a tesseract executable, invoke it using its 
>>>>>> commandline interface. If you can't build tesseract yourself, Uni 
>>>>>> Mannheim 
>>>>>> may have binaries for you to download and install. Linuxes often have 
>>>>>> tesseract binaries and mandatory language models available as packages, 
>>>>>> BUT 
>>>>>> many Linuxes are more or less far behind the curve: latest tesseract 
>>>>>> release as of this writing is 5.3.4: 
>>>>>> https://github.com/tesseract-ocr/tesseract/releases so VERIFY your 
>>>>>> rig has the latest tesseract installed. Older releases are older and 
>>>>>> "previous" for a reason!
>>>>>>
>>>>>>
>>>>>> *Preprocessing is the chorus of this song*
>>>>>>
>>>>>> As you say "PDF", you therefor need to convert that thing to *page 
>>>>>> images*. My personal favorite is the Artifex mupdf toolkit, using 
>>>>>> mutool or mudraw / etc. tools from that commandline toolkit to render 
>>>>>> accurate, high-rez page images. Others will favor other means but it all 
>>>>>> ends up doing the same thing: anything, PDFs et al, is to be converted 
>>>>>> to 
>>>>>> one image per page and fed to tesseract that way. The rendered page 
>>>>>> images 
>>>>>> MAY require additional *image preprocessing*: 
>>>>>>
>>>>>>
>>>>>> *This next bit cannot be stressed enough: *tesseract is designed and 
>>>>>> engineered to work on plain printed book pages, i.e. BLACK TEXT on PLAIN 
>>>>>> WHITE BACKGROUND. As I observe everyone and their granny dumping holiday 
>>>>>> snapshots, favorite CD, LP and fancy colourful book covers straight into 
>>>>>> tesseract and complaining "nothing sensible is coming out" that's 
>>>>>> because 
>>>>>> you're feeding it a load of dung as far as the engine concerned: it 
>>>>>> expects 
>>>>>> BLACK TEXT on PLAIN WHITE BACKGROUND like a regular dull printed page in 
>>>>>> a 
>>>>>> BOOK, so anything with nature backgrounds, colourful architectural 
>>>>>> backgrounds and such is begging for a disaster. And I only emphasize 
>>>>>> with 
>>>>>> the grannies. <drama + rant mode off/>   This is why 
>>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html is 
>>>>>> mentioned almost every week in this mailing list, for example. It's very 
>>>>>> important, but you'll need more...
>>>>>>
>>>>>>
>>>>>> The take-away? You'll need additional tools for image preprocessing 
>>>>>> until you can produce greyscale or B&W images that look almost as if 
>>>>>> these 
>>>>>> were plain old boring book pages: no or very little fancy stuff, black 
>>>>>> text 
>>>>>> (anti-aliased or not), white background. 
>>>>>> Bonus points for you when your preprocess removes non-text image 
>>>>>> components, e.g. photographs, in the page image: it can only confuse the 
>>>>>> OCR engine so when you strive for perfection, that's one more bit to 
>>>>>> deal 
>>>>>> with BEFORE you feed it into tesseract and wait expectantly... (Besides, 
>>>>>> tesseract will have less discovery to do so it'll be faster too. Of 
>>>>>> little 
>>>>>> importance, relatively speaking, but there you have it.)
>>>>>> As also mentioned at 
>>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html : tools 
>>>>>> of interest re image processing are leptonica (parts used by tesseract, 
>>>>>> but 
>>>>>> don't count on it doing your preprocessing for you as it's a highly 
>>>>>> scenario/case-dependent activity and therefor not included in tesseract 
>>>>>> itself) Also check out: OpenCV (a library, not a tool, so you'll need 
>>>>>> scaffolding there before you can use it), ImageMagick, (Adobe Photoshop 
>>>>>> or 
>>>>>> open source: Krita: great for what-can-I-get experiments but not 
>>>>>> suitable 
>>>>>> for bulk), etc.etc.
>>>>>>
>>>>>>
>>>>>> *Tesseract bliss and the afterglow: postprocessing*
>>>>>>
>>>>>> Once you are producing page images like they were book pages, and 
>>>>>> feeding them into tesseract, you get output, being it "plain text", HOCR 
>>>>>> or 
>>>>>> otherwise.
>>>>>>
>>>>>> Personally I favor HOCR but that's because it's closest to what *my 
>>>>>> *workflow 
>>>>>> needs. You must look into "postprocessing" anyway: being it additional 
>>>>>> tooling to recombine the OCR-ed text into PDF "overlay", PDF/A 
>>>>>> production, 
>>>>>> or anything else; advanced usage may require additional postprocessing 
>>>>>> steps, e.g. pulling the OCR-ed text through a spellchecker+corrector 
>>>>>> such 
>>>>>> as hunspell, if that floats your boat. You'll also need to get and set 
>>>>>> up 
>>>>>> and/or program postprocess tooling if you otherwise wish to merge 
>>>>>> multiple 
>>>>>> images' OCR results. You may want to search the internet for this; I 
>>>>>> don't 
>>>>>> have any toolkit's name present off the top off my head for that as I'm 
>>>>>> using tesseract in a slightly different workflow, where it is part of a 
>>>>>> custom, *augmented *mupdf toolkit: PDF in, PDF + HOCR + misc 
>>>>>> document metadata out, so all that preprocessing and postprocessing I 
>>>>>> hammer on is done by yours truly's custom toolchain. Under development, 
>>>>>> so 
>>>>>> I'm not working with the diverse python stuff most everybody else will 
>>>>>> dig 
>>>>>> up after a quick google search, I'm sure. Individual project's 
>>>>>> requirements' differences and such, so your path will only be obvious to 
>>>>>> you.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *How to be trolling an OCR engine *😋
>>>>>>
>>>>>> Oh, before I forget: some peeps drop shopping bills and such into 
>>>>>> off-the-shelf tesseract: *cute *but not anything like a "plain 
>>>>>> printed book page" so they encounter all kinds of "surprises":    
>>>>>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html  is 
>>>>>> important but it doesn't tell you *everything*. "plain printed book 
>>>>>> pages" are, by general assumption, pages of text, or, more precisely: 
>>>>>> *stories*. Or other tracts with paragraphs of text. Bills, invoices 
>>>>>> and other financial stuff is not just "tabulated semi-numeric content" 
>>>>>> instead of "paragraphs of text" but those types of inputs also fail 
>>>>>> grade F 
>>>>>> regarding the other implicit assumption that comes with human 
>>>>>> "paragraphs 
>>>>>> of text": the latter are series of words, technically each a bunch of 
>>>>>> alphabet glyphs (*alpha*numerics), while financials often mix 
>>>>>> currency symbols and numeric values: while these were part of 
>>>>>> tesseract's 
>>>>>> training set, I am sure, they are not its focal point hence have been 
>>>>>> given 
>>>>>> less attention than the words in your language dictionary. And scanning 
>>>>>> those SKUs will fare even worse as they're just a jumbled *codes*, 
>>>>>> rather than *language*. Consequently you'll need to retrain 
>>>>>> tesseract if your CONTENT does not suit these mentioned assumptions re 
>>>>>> "plain printed book page". Haven't done that yet myself; it's not for 
>>>>>> the 
>>>>>> faint of heart and since Google did the training for the "official" 
>>>>>> tesseract language models everyone downloads and uses, you can bet your 
>>>>>> bottom retraining isn't going to be "nice" for the less well funded 
>>>>>> either. 
>>>>>> Don't expect instant miracles and expect a long haul when you decide you 
>>>>>> must go this route [of training tesseract], or you will meet Captain 
>>>>>> Disappointment. Y'all have been warned. 😉
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Why your preprocess is more important than kickstarting tesseract, 
>>>>>> by blowing ether* up its carburetor*
>>>>>>
>>>>>> *Why is that "plain printed book page is like human stories and 
>>>>>> similar tracts: paragraphs of text" mantra so important?* Well, 
>>>>>> tesseract uses a lot of technology to get the OCR quality it achieves, 
>>>>>> including using language dictionaries. While some smarter people will 
>>>>>> find 
>>>>>> switches in tesseract where *explicit* dictionary usage can be 
>>>>>> turned off, it cannot switch off the *implicit* use due to how the 
>>>>>> latest and best core engine: LSTM+CTC (since tesseract v4) actually 
>>>>>> works: 
>>>>>> it slowly moves its gaze across each word it is fed (jargon: *image 
>>>>>> segmentation *preprocess inside tesseract produces these word 
>>>>>> images) and LSTM is so good at recognizing text, because it has "learned 
>>>>>> context": that context being the characters surrounding the one it is 
>>>>>> gazing at right now. Which means LSTM can be argued to act akin to a 
>>>>>> *hidden 
>>>>>> Markov model* (see wikipedia) and thus will deliver its predictions 
>>>>>> based on what "language" (i.e. *dictionary*) it was fed during 
>>>>>> training: human text which is used in professional papers and stories. 
>>>>>> Dutch VAT codes didn't feature in the training set, as one member of the 
>>>>>> ML 
>>>>>> discovered a while ago. Financial amounts, e.g. "EUR7.95" are also not 
>>>>>> prominently featured in LSTMs training so you can now guess the amount 
>>>>>> of 
>>>>>> confusion the LSTM will experience when scanning across such a thing: 
>>>>>> reading "EUR" has it expect "O" with high confidence, as in "eur" 
>>>>>> obviously 
>>>>>> leading to the word "euro", but what the heck is that "digit 7" doing 
>>>>>> there?! That's *highly* unexpected, hence OCR probabilities drop, 
>>>>>> pass decision-making thresholds and you get WTF results, simply because 
>>>>>> the 
>>>>>> engine went WTF *first*.
>>>>>> Ditto story/drama for calligraphed signs outside shops, and, *oh! 
>>>>>> oh!, license plates*!! (google LPR/ALPR if you want any of that) and 
>>>>>> *anything 
>>>>>> else *that's *not *reams of text and thus you wouldn't expect to 
>>>>>> find in a plain story- or textbook.
>>>>>> (And for the detail-oriented folks: yes, tesseract had/has a module 
>>>>>> on board for recognizing math, but I haven't seen that work very well 
>>>>>> with 
>>>>>> my inputs and not seen a lot of happy noises out there about it either, 
>>>>>> but 
>>>>>> the Google engineer(s) surely must have anticipated OCRing that kind of 
>>>>>> stuff alongside paragraphs of text. For us mere mortals, I'ld consider 
>>>>>> this 
>>>>>> bit "an historic attempt" and forget about it.)
>>>>>>
>>>>>>
>>>>>> *Advice Number 2: *when rendering page images, the ppi (pixels per 
>>>>>> inch) resolution to select would be best adjusted to produce regular 
>>>>>> lines 
>>>>>> of text in those images where the capital-height of the text is around 
>>>>>> 30 
>>>>>> pixels. Typography people would rather like to refer to *x-height*, 
>>>>>> so that would be a little lower in pixel height. Line height would be 
>>>>>> larger as that includes stems and interline spacing. However, from an 
>>>>>> OCR 
>>>>>> engine perspective, these (x-height & line-height) are very much 
>>>>>> dependent 
>>>>>> of the font used and the page layout used, so they are more variable 
>>>>>> than 
>>>>>> the reported optimal capital-D-height at ~32px. As no-one measures this 
>>>>>> up-front, as an initial guess, 300dpi in the render/print-to-image 
>>>>>> dialog 
>>>>>> of your render tool of choice would be reasonable start but when you 
>>>>>> want 
>>>>>> more accuracy, tweaking this number can already bring some quality 
>>>>>> changes. 
>>>>>> Of course, when the source is (low rez) bitmap images already (embedded 
>>>>>> in 
>>>>>> PDF or otherwise), there's little you can do, but then there's still 
>>>>>> scaling, sharpening, etc. image preprocessing to try. This advice is 
>>>>>> driven 
>>>>>> by the results published here: 
>>>>>> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ 
>>>>>> (and google already quickly produced one other who does something like 
>>>>>> that 
>>>>>> and published a small bit of tooling: 
>>>>>> https://gist.github.com/rinogo/294e723ac9e53c23d131e5852312dfe8 )
>>>>>>
>>>>>>
>>>>>> *) the old-fash way to see if a rusty engine will still go (or blow, 
>>>>>> alas). Replace with "SEO'd blog pages extolling instant success with 
>>>>>> ease" 
>>>>>> to take this into the 21st century.)
>>>>>>
>>>>>>
>>>>>>
>>>>>> *The mandatory readings list:*
>>>>>>
>>>>>> - https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
>>>>>> - https://tesseract-ocr.github.io/tessdoc/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *The above in diagram form (suggested tesseract workflow ;-) )*
>>>>>>
>>>>>> [image: diagram.png]
>>>>>> (diagram PikChr source + SVG attached)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Met vriendelijke groeten / Best regards,
>>>>>>
>>>>>> Ger Hobbelt
>>>>>>
>>>>>> --------------------------------------------------
>>>>>> web:    http://www.hobbelt.com/
>>>>>>         http://www.hebbut.net/
>>>>>> mail:   g...@hobbelt.com
>>>>>> mobile: +31-6-11 120 978
>>>>>> --------------------------------------------------
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 26, 2024 at 6:11 PM Santhiya C <santhi...@gmail.com> 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Guys , i will start development OCR using image and Pdf to text 
>>>>>>> extraction what are the steps i need to follow , can you pleasse refer 
>>>>>>> me 
>>>>>>> the best model , already i had used the pytesseract engine but i did 
>>>>>>> not 
>>>>>>> get proper extraction ...
>>>>>>>
>>>>>>> Best Regards,
>>>>>>>
>>>>>>> Sandhiya
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7faf7a9f-4d12-4c83-9a0c-8058ad04b6d7n%40googlegroups.com.

Reply via email to