Re: [tesseract-ocr] Re: I need help to develop image to text extraction

Santhiya C Tue, 30 Jan 2024 23:58:11 -0800

Already i installed hte pytesseract  but i got this error Usage: 
pytesseract [-l lang] input_file
 how do i fix this issue 
On Saturday 27 January 2024 at 14:08:01 UTC+5:30 zdenop wrote:


> 👍
>
> Zdenko
>
>
> so 27. 1. 2024 o 2:22 Ger Hobbelt <g...@hobbelt.com> napísal(a):
>
>> L.S.,
>>
>> *PDF. OCR. text extraction. best language models? not a lot of success 
>> yet...*
>>
>> 🤔 
>>
>> Broad subject.  Learning curve ahead. 🚧 Workflow diagram included today.
>>
>>
>> *Tesseract does not live alone*
>>
>> Tesseract is an engine, which takes an image as input and produces text 
>> output; several output formats are available. If you are unsure, start with 
>> HOCR output as that's close to modern HTML and carries almost all info 
>> tesseract produces during the OCR process.
>> If it isn't an image you've got, you need a preprocess (and consequently 
>> additional tools) to produce images you can feed tesseract. tesseract is 
>> designed to process a SINGLE IMAGE. (Yes, that means you may want to 
>> 'merge' its output: postprocessing)
>>
>> *     To complicate matters immediately, tesseract can deal with 
>> "multipage TIFF" images and can accept multiple images to process via its 
>> commandline. Keep thinking "one page image in, bunch of text out" and 
>> you'll be okay until you discover the additional possibilities.*
>>
>> *Advice Number 1: *get a tesseract executable, invoke it using its 
>> commandline interface. If you can't build tesseract yourself, Uni Mannheim 
>> may have binaries for you to download and install. Linuxes often have 
>> tesseract binaries and mandatory language models available as packages, BUT 
>> many Linuxes are more or less far behind the curve: latest tesseract 
>> release as of this writing is 5.3.4: 
>> https://github.com/tesseract-ocr/tesseract/releases so VERIFY your rig 
>> has the latest tesseract installed. Older releases are older and "previous" 
>> for a reason!
>>
>>
>> *Preprocessing is the chorus of this song*
>>
>> As you say "PDF", you therefor need to convert that thing to *page 
>> images*. My personal favorite is the Artifex mupdf toolkit, using mutool 
>> or mudraw / etc. tools from that commandline toolkit to render accurate, 
>> high-rez page images. Others will favor other means but it all ends up 
>> doing the same thing: anything, PDFs et al, is to be converted to one image 
>> per page and fed to tesseract that way. The rendered page images MAY 
>> require additional *image preprocessing*: 
>>
>>
>> *This next bit cannot be stressed enough: *tesseract is designed and 
>> engineered to work on plain printed book pages, i.e. BLACK TEXT on PLAIN 
>> WHITE BACKGROUND. As I observe everyone and their granny dumping holiday 
>> snapshots, favorite CD, LP and fancy colourful book covers straight into 
>> tesseract and complaining "nothing sensible is coming out" that's because 
>> you're feeding it a load of dung as far as the engine concerned: it expects 
>> BLACK TEXT on PLAIN WHITE BACKGROUND like a regular dull printed page in a 
>> BOOK, so anything with nature backgrounds, colourful architectural 
>> backgrounds and such is begging for a disaster. And I only emphasize with 
>> the grannies. <drama + rant mode off/>   This is why 
>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html is mentioned 
>> almost every week in this mailing list, for example. It's very important, 
>> but you'll need more...
>>
>>
>> The take-away? You'll need additional tools for image preprocessing until 
>> you can produce greyscale or B&W images that look almost as if these were 
>> plain old boring book pages: no or very little fancy stuff, black text 
>> (anti-aliased or not), white background. 
>> Bonus points for you when your preprocess removes non-text image 
>> components, e.g. photographs, in the page image: it can only confuse the 
>> OCR engine so when you strive for perfection, that's one more bit to deal 
>> with BEFORE you feed it into tesseract and wait expectantly... (Besides, 
>> tesseract will have less discovery to do so it'll be faster too. Of little 
>> importance, relatively speaking, but there you have it.)
>> As also mentioned at 
>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html : tools of 
>> interest re image processing are leptonica (parts used by tesseract, but 
>> don't count on it doing your preprocessing for you as it's a highly 
>> scenario/case-dependent activity and therefor not included in tesseract 
>> itself) Also check out: OpenCV (a library, not a tool, so you'll need 
>> scaffolding there before you can use it), ImageMagick, (Adobe Photoshop or 
>> open source: Krita: great for what-can-I-get experiments but not suitable 
>> for bulk), etc.etc.
>>
>>
>> *Tesseract bliss and the afterglow: postprocessing*
>>
>> Once you are producing page images like they were book pages, and feeding 
>> them into tesseract, you get output, being it "plain text", HOCR or 
>> otherwise.
>>
>> Personally I favor HOCR but that's because it's closest to what *my 
>> *workflow 
>> needs. You must look into "postprocessing" anyway: being it additional 
>> tooling to recombine the OCR-ed text into PDF "overlay", PDF/A production, 
>> or anything else; advanced usage may require additional postprocessing 
>> steps, e.g. pulling the OCR-ed text through a spellchecker+corrector such 
>> as hunspell, if that floats your boat. You'll also need to get and set up 
>> and/or program postprocess tooling if you otherwise wish to merge multiple 
>> images' OCR results. You may want to search the internet for this; I don't 
>> have any toolkit's name present off the top off my head for that as I'm 
>> using tesseract in a slightly different workflow, where it is part of a 
>> custom, *augmented *mupdf toolkit: PDF in, PDF + HOCR + misc document 
>> metadata out, so all that preprocessing and postprocessing I hammer on is 
>> done by yours truly's custom toolchain. Under development, so I'm not 
>> working with the diverse python stuff most everybody else will dig up after 
>> a quick google search, I'm sure. Individual project's requirements' 
>> differences and such, so your path will only be obvious to you.
>>
>>
>>
>> *How to be trolling an OCR engine *😋
>>
>> Oh, before I forget: some peeps drop shopping bills and such into 
>> off-the-shelf tesseract: *cute *but not anything like a "plain printed 
>> book page" so they encounter all kinds of "surprises":    
>> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html  is 
>> important but it doesn't tell you *everything*. "plain printed book 
>> pages" are, by general assumption, pages of text, or, more precisely: 
>> *stories*. Or other tracts with paragraphs of text. Bills, invoices and 
>> other financial stuff is not just "tabulated semi-numeric content" instead 
>> of "paragraphs of text" but those types of inputs also fail grade F 
>> regarding the other implicit assumption that comes with human "paragraphs 
>> of text": the latter are series of words, technically each a bunch of 
>> alphabet glyphs (*alpha*numerics), while financials often mix currency 
>> symbols and numeric values: while these were part of tesseract's training 
>> set, I am sure, they are not its focal point hence have been given less 
>> attention than the words in your language dictionary. And scanning those 
>> SKUs will fare even worse as they're just a jumbled *codes*, rather than 
>> *language*. Consequently you'll need to retrain tesseract if your 
>> CONTENT does not suit these mentioned assumptions re "plain printed book 
>> page". Haven't done that yet myself; it's not for the faint of heart and 
>> since Google did the training for the "official" tesseract language models 
>> everyone downloads and uses, you can bet your bottom retraining isn't going 
>> to be "nice" for the less well funded either. Don't expect instant miracles 
>> and expect a long haul when you decide you must go this route [of training 
>> tesseract], or you will meet Captain Disappointment. Y'all have been 
>> warned. 😉
>>
>>
>>
>>
>> *Why your preprocess is more important than kickstarting tesseract, by 
>> blowing ether* up its carburetor*
>>
>> *Why is that "plain printed book page is like human stories and similar 
>> tracts: paragraphs of text" mantra so important?* Well, tesseract uses a 
>> lot of technology to get the OCR quality it achieves, including using 
>> language dictionaries. While some smarter people will find switches in 
>> tesseract where *explicit* dictionary usage can be turned off, it cannot 
>> switch off the *implicit* use due to how the latest and best core 
>> engine: LSTM+CTC (since tesseract v4) actually works: it slowly moves its 
>> gaze across each word it is fed (jargon: *image segmentation *preprocess 
>> inside tesseract produces these word images) and LSTM is so good at 
>> recognizing text, because it has "learned context": that context being the 
>> characters surrounding the one it is gazing at right now. Which means LSTM 
>> can be argued to act akin to a *hidden Markov model* (see wikipedia) and 
>> thus will deliver its predictions based on what "language" (i.e. 
>> *dictionary*) it was fed during training: human text which is used in 
>> professional papers and stories. Dutch VAT codes didn't feature in the 
>> training set, as one member of the ML discovered a while ago. Financial 
>> amounts, e.g. "EUR7.95" are also not prominently featured in LSTMs training 
>> so you can now guess the amount of confusion the LSTM will experience when 
>> scanning across such a thing: reading "EUR" has it expect "O" with high 
>> confidence, as in "eur" obviously leading to the word "euro", but what the 
>> heck is that "digit 7" doing there?! That's *highly* unexpected, hence 
>> OCR probabilities drop, pass decision-making thresholds and you get WTF 
>> results, simply because the engine went WTF *first*.
>> Ditto story/drama for calligraphed signs outside shops, and, *oh! oh!, 
>> license plates*!! (google LPR/ALPR if you want any of that) and *anything 
>> else *that's *not *reams of text and thus you wouldn't expect to find in 
>> a plain story- or textbook.
>> (And for the detail-oriented folks: yes, tesseract had/has a module on 
>> board for recognizing math, but I haven't seen that work very well with my 
>> inputs and not seen a lot of happy noises out there about it either, but 
>> the Google engineer(s) surely must have anticipated OCRing that kind of 
>> stuff alongside paragraphs of text. For us mere mortals, I'ld consider this 
>> bit "an historic attempt" and forget about it.)
>>
>>
>> *Advice Number 2: *when rendering page images, the ppi (pixels per inch) 
>> resolution to select would be best adjusted to produce regular lines of 
>> text in those images where the capital-height of the text is around 30 
>> pixels. Typography people would rather like to refer to *x-height*, so 
>> that would be a little lower in pixel height. Line height would be larger 
>> as that includes stems and interline spacing. However, from an OCR engine 
>> perspective, these (x-height & line-height) are very much dependent of the 
>> font used and the page layout used, so they are more variable than the 
>> reported optimal capital-D-height at ~32px. As no-one measures this 
>> up-front, as an initial guess, 300dpi in the render/print-to-image dialog 
>> of your render tool of choice would be reasonable start but when you want 
>> more accuracy, tweaking this number can already bring some quality changes. 
>> Of course, when the source is (low rez) bitmap images already (embedded in 
>> PDF or otherwise), there's little you can do, but then there's still 
>> scaling, sharpening, etc. image preprocessing to try. This advice is driven 
>> by the results published here: 
>> https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ 
>> (and google already quickly produced one other who does something like that 
>> and published a small bit of tooling: 
>> https://gist.github.com/rinogo/294e723ac9e53c23d131e5852312dfe8 )
>>
>>
>> *) the old-fash way to see if a rusty engine will still go (or blow, 
>> alas). Replace with "SEO'd blog pages extolling instant success with ease" 
>> to take this into the 21st century.)
>>
>>
>>
>> *The mandatory readings list:*
>>
>> - https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
>> - https://tesseract-ocr.github.io/tessdoc/
>>
>>
>>
>>
>> *The above in diagram form (suggested tesseract workflow ;-) )*
>>
>> [image: diagram.png]
>> (diagram PikChr source + SVG attached)
>>
>>
>>
>> Met vriendelijke groeten / Best regards,
>>
>> Ger Hobbelt
>>
>> --------------------------------------------------
>> web:    http://www.hobbelt.com/
>>         http://www.hebbut.net/
>> mail:   g...@hobbelt.com
>> mobile: +31-6-11 120 978
>> --------------------------------------------------
>>
>>
>> On Fri, Jan 26, 2024 at 6:11 PM Santhiya C <santhi...@gmail.com> wrote:
>>
>>> Hi Guys , i will start development OCR using image and Pdf to text 
>>> extraction what are the steps i need to follow , can you pleasse refer me 
>>> the best model , already i had used the pytesseract engine but i did not 
>>> get proper extraction ...
>>>
>>> Best Regards,
>>>
>>> Sandhiya
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/a92d17a9-4bcf-4ba0-a81c-71e8e08a4afen%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frk4VEg1EFQM%2Bes3CWUs1%2BjsD%3DF7G%2B-UCMU--ekeGRsGA%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frk4VEg1EFQM%2Bes3CWUs1%2BjsD%3DF7G%2B-UCMU--ekeGRsGA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/33646df9-55e2-49bb-8997-338123fdb3c7n%40googlegroups.com.

Re: [tesseract-ocr] Re: I need help to develop image to text extraction

Reply via email to