great answer . Can you please guide me if our word is not recognized correctly by tesseract so how can we insert it in dictionary. As i read here (https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf) 6. Linguistic Analysis it have dictonary word.
On Friday, February 12, 2021 at 10:44:12 PM UTC+5:30 g...@hobbelt.com wrote: > Ah, a misunderstanding there. > > Ok, the key message of those pages is: you must extract each "table cell" > as a /separate/ image to help OCR, then, if needed, combine the text > results for each of those smaller images to form the text of your page. > > That's often referred to as "segmentation". > > Tesseract has an algorithm for that built in AFAICT, but that is geared > towards pages of text (reams of texts, lines of text) and picking out the > individual words in there. That task gets very confused when you feed it a > table layout, which has all kinds of edges in the images that are /not/ > text, but table cell /borders/. > > So what those links are hinting at is that you need to come up with an > image *preprocess* which can handle your type of table. This depends on > your particular table layout, as there are many ways to "design / style" a > table. > > So you will have to write some script which will find and then cut out > each table cell as an image to feed tesseract. > > When you look for segmentation approaches on the net, leptonica and opencv > get mentioned a lot. > > Unfortunately most segmentation work when googling for it is about object > and facial recognition. Not a problem per se, isnt a table cell an object > too? Well, not really, not in the sense they're using it as those > algorithms approach the image segmentation from the concept of each object > being an area filled with color(s). This would be applicable if the table > was styled as cells with an alternating background, for instance, but yours > is all white and just some thin black borders. > > There's a couple of ideas for that: > > 1: conform the image to an (empty) form template, i.e. seek a way to make > your scanned form overlay near perfectly on a template image. Then you have > to define your areas of interest (box coordinates in the template) and clip > those parts out, save them as individual files and feed those to tesseract. > This is often done for government application forms: there is a reason > you're supposed to only write within the boxes. ЁЯШЙ > > That is what that first link alludes at. It's just one idea among many to > try. > > 2: what if you cannot or must not apply idea 1? Can we perhaps detect > those table borders through image processing and /then/ come up with > something that can take that data and help us extract the cell images? > > I must say I haven't done this myself yet, but some googling uncovered > this link (after having quickly scanned several false positives in my > google results and several altered search attempts) : > https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic > > Here are a couple if fellows who have thought "out of the box" (pun > intended) and gotten some results by phrasing my question in an entirely > different way: instead of wondering how we can detect and extract those > table cells, they try to answer the question: "what if we are able to > *remove* those cell borders visually? Yes, we will worry about the texts in > the cells looking as a haphazard ream of text later and expect trouble to > discern which bit of recognized text was in what cell exactly (tesseract > can output hOCR + other formats which deliver text plus coordinates of > placement - you may have to wprk on that *afterwards* when you do something > like they're doing: > > https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic > > Looks promising to me. What I'ld attempt next with their approach is see > if I can make those detected borders extend and then me being able to > extract each individual black area (cell!) as pixel *mask*, to be applied > to my (conformed) page image so everything is thrown out except the pixels > in that cell and thus giving me one image of one cell worth of text. Repeat > that for each black area (see > https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic > > answers to see what i mean: th3 result image he gets which is pure black > with the table borders (lines) in white.) > > /They/ tackle the problem similarly but conceptually in a very different > way than I am thinking about now: they go and mask out the detected table > borders in one go. > That can work very well and is much faster as they are not extracting > subimages by masking or other means. > > Their *potential* trouble will be deciding which bit of text was together > in which cell. That can be done in bbox analysis after ocr/tesseract has > done its job. (again, google can provide hints. Again, it depends on your > particular circumstances) > > My (very probable) trouble will be identifying the black cell areas > singularly: doing a simple flood fill with a color, then extract anything > covered by that color, is troublesome as the table border detection might > very well not be perfect and thus cause my simpke flood fill to color > adjacent cells too. ЁЯШв So, if I had your task, I'ld be looking at ways to > extract, say, each individual *minimum rectangle* which does not contain > white pixels (uh-oh, need noise removal then!) OR perhaps a way where each > detected line segment is described as a vector and then extend those lines > out across the page to get my rectangles in between: those would be my > cells then. That's a bother when the table has cells spanning columns or > rows. So more research needed before I'ld code that preprocess. > > Another issue with the line detection + removal/zoning techniques would be > making sure the lines are either near perfect horizontal and vertical all > (*orienting*/*deskewing* the image will help some there) OR you must come > up with an algo that's able to find angled lines (while it should ignore > the curvy text characters). Again, yet another area of further > investigation if I were at it. > > The key here is that you'll have to do some work on your images before you > can call tesseract and expect success. > > HTH. > > On Fri, Feb 12, 2021, 07:56 Kumar Rajwani <kumarraj...@gmail.com> wrote: > >> Hey, both of the pages answer on is steps after some text detected right? >> https://i.pinimg.com/564x/bd/a3/d4/bda3d4bf11b0f727db1f9d81faac1b5d.jpg >> i have all this type of images where I am not able to detect date at the >> top right. also contact name, phone, fax this is not correctly read every >> time or missed in detection part. >> That's the reason i am asking i have a similar format of the document so >> if i trained the model on that it will help the model in the detection and >> recognition part? >> I don't know how tesseract detecting the text from the whole form. >> i have tried thresholding, scaling, sharpening but this can't give me >> results all time. >> >> On Friday, February 12, 2021 at 1:04:29 AM UTC+5:30 g...@hobbelt.com >> wrote: >> >>> Have you read the two pages linked to in the answer from february 5th? >>> Have you executed those procedures, or anything similar, to extract the >>> individual table call images, to feed those to tesseract? >>> So far you have not shown images or any results that show you have used >>> a tabular recognition and cell extraction process at all (which is a >>> preprocess required by the type of input image you have provided so far if >>> you want to significantly improve OCR output quality), so, *hey*, what are >>> your results so far following the sage advice (Feb 5)? >>> (quoted below for convenience:) >>> >>> >>> On Friday, February 5, 2021 at 7:53:26 PM UTC+5:30 shree wrote: >>> >>>> >>>> See >>>> https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/ >>>> >>>> >>>> https://stackoverflow.com/questions/61265666/how-to-extract-data-from-invoices-in-tabular-format >>>> >>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com >>>> >>> -- >>> >>> >>> Met vriendelijke groeten / Best regards, >>> >>> Ger Hobbelt >>> >>> -------------------------------------------------- >>> web: http://www.hobbelt.com/ >>> http://www.hebbut.net/ >>> mail: g...@hobbelt.com >>> mobile: +31-6-11 120 978 >>> -------------------------------------------------- >>> >>> >>> On Mon, Feb 8, 2021 at 1:47 PM Kumar Rajwani <kumarraj...@gmail.com> >>> wrote: >>> >>>> hey, i am still waiting for your reply. can you please solve my >>>> doubts. >>>> On Sunday, February 7, 2021 at 8:13:56 AM UTC+5:30 Kumar Rajwani wrote: >>>> >>>>> hey can you please tell me how can i improve the text detection for >>>>> the same kind of images? >>>>> >>>>> On Friday, February 5, 2021 at 8:38:31 PM UTC+5:30 Kumar Rajwani wrote: >>>>> >>>>>> Thanks for this. i know about the usage of the tesseract. i have >>>>>> multiple images where i can't improve image quality so i want to improve >>>>>> my >>>>>> model to get text from it. >>>>>> are you saying that text detection will not improve by training? >>>>>> Because i don't have an issue with text recognition most of time it >>>>>> right. >>>>>> can you tell me how can i improve the model to get more text from the >>>>>> image? I am using psm 11 where it find lot's of text but some are >>>>>> missing. >>>>>> >>>>>> >>>>>> On Friday, February 5, 2021 at 7:53:26 PM UTC+5:30 shree wrote: >>>>>> >>>>>>> Training won't fix that. >>>>>>> >>>>>>> See >>>>>>> https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/ >>>>>>> >>>>>>> >>>>>>> https://stackoverflow.com/questions/61265666/how-to-extract-data-from-invoices-in-tabular-format >>>>>>> >>>>>>> On Fri, Feb 5, 2021 at 6:14 PM Kumar Rajwani <kumarraj...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> i have tried a lot of images where it getting 90% accuracy and >>>>>>>> missing always one side of image. that's the reason i want to train >>>>>>>> model >>>>>>>> if it can improve a little a bit it would be great. >>>>>>>> if you can provide a script or steps that can help me it would be >>>>>>>> good for me. >>>>>>>> >>>>>>>> On Friday, February 5, 2021 at 5:50:30 PM UTC+5:30 Kumar Rajwani >>>>>>>> wrote: >>>>>>>> >>>>>>>>> main thing is i want to learn about training tesseract on image >>>>>>>>> level so can you please tell me how can i procced further. i want to >>>>>>>>> know >>>>>>>>> where is the main problem. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Friday, February 5, 2021 at 5:46:22 PM UTC+5:30 shree wrote: >>>>>>>>> >>>>>>>>>> I see the tabular image that you shared. I don't think training >>>>>>>>>> is going to help you in this. eng.traineddata should be able to >>>>>>>>>> recognize >>>>>>>>>> it quite well. You should select the different areas of interest and >>>>>>>>>> just >>>>>>>>>> OCR those sections. >>>>>>>>>> >>>>>>>>>> On Fri, Feb 5, 2021 at 5:33 PM Kumar Rajwani < >>>>>>>>>> kumarraj...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> i have tried to do same thing in tesseract 4 which stuck at >>>>>>>>>>> following line. >>>>>>>>>>> Compute CTC targets failed! >>>>>>>>>>> >>>>>>>>>>> On Friday, February 5, 2021 at 5:04:42 PM UTC+5:30 Kumar Rajwani >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> !tesseract -v >>>>>>>>>>>> tesseract 5.0.0-alpha-20201231-171-g04173 >>>>>>>>>>>> leptonica-1.78.0 >>>>>>>>>>>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng >>>>>>>>>>>> 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 >>>>>>>>>>>> 2.3.0 >>>>>>>>>>>> Found AVX2 >>>>>>>>>>>> Found AVX >>>>>>>>>>>> Found FMA >>>>>>>>>>>> Found SSE >>>>>>>>>>>> Found OpenMP 201511 >>>>>>>>>>>> Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 >>>>>>>>>>>> liblz4/1.7.1 >>>>>>>>>>>> >>>>>>>>>>>> image example >>>>>>>>>>>> i have added one image from my training data. >>>>>>>>>>>> >>>>>>>>>>>> i am using the colab system which have ubuntu os. >>>>>>>>>>>> >>>>>>>>>>>> https://colab.research.google.com/drive/1_Bn4wbK6dE5zYAuFyC4Eczq_eNU2shuz?usp=sharing >>>>>>>>>>>> this is my notebook you can see complete process in finetune 2 >>>>>>>>>>>> section. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Friday, February 5, 2021 at 4:55:43 PM UTC+5:30 shree wrote: >>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Feb 5, 2021 at 4:44 PM Kumar Rajwani < >>>>>>>>>>>>> kumarraj...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> hi, >>>>>>>>>>>>> >>>>>>>>>>>>> i have tried minus 1 and got following result >>>>>>>>>>>>>> Iteration 0: GROUND TRUTH : ) @┬о >>>>>>>>>>>>>> Iteration 0: BEST OCR TEXT : Yo >>>>>>>>>>>>>> File eng.arial.exp0.lstmf line 0 : >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> What's your version of tesseract? What o/s? >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Without your files, it's difficult to know what's causing the >>>>>>>>>>>>> issue. >>>>>>>>>>>>> >>>>>>>>>>>>> with -1 debug_interval you should get the info for every >>>>>>>>>>>>> iteration. >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>>> >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/342f3faf-b107-4243-845e-ba8a16274122n%40googlegroups.com >>>>>>>>>>> >>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/342f3faf-b107-4243-845e-ba8a16274122n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>> . >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> ____________________________________________________________ >>>>>>>>>> рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com >>>>>>>>>> >>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>> >>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com >>>>>>>> >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> ____________________________________________________________ >>>>>>> рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com >>>>>>> >>>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/5cb397af-eedd-40bb-979d-d7128ab7c64en%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/5cb397af-eedd-40bb-979d-d7128ab7c64en%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/f839b8b5-0996-445b-8607-9cc63c1a0d32n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/f839b8b5-0996-445b-8607-9cc63c1a0d32n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/72367041-6a6e-42ac-9508-4351bb20c2afn%40googlegroups.com.