Ah, a misunderstanding there. Ok, the key message of those pages is: you must extract each "table cell" as a /separate/ image to help OCR, then, if needed, combine the text results for each of those smaller images to form the text of your page.
That's often referred to as "segmentation". Tesseract has an algorithm for that built in AFAICT, but that is geared towards pages of text (reams of texts, lines of text) and picking out the individual words in there. That task gets very confused when you feed it a table layout, which has all kinds of edges in the images that are /not/ text, but table cell /borders/. So what those links are hinting at is that you need to come up with an image *preprocess* which can handle your type of table. This depends on your particular table layout, as there are many ways to "design / style" a table. So you will have to write some script which will find and then cut out each table cell as an image to feed tesseract. When you look for segmentation approaches on the net, leptonica and opencv get mentioned a lot. Unfortunately most segmentation work when googling for it is about object and facial recognition. Not a problem per se, isnt a table cell an object too? Well, not really, not in the sense they're using it as those algorithms approach the image segmentation from the concept of each object being an area filled with color(s). This would be applicable if the table was styled as cells with an alternating background, for instance, but yours is all white and just some thin black borders. There's a couple of ideas for that: 1: conform the image to an (empty) form template, i.e. seek a way to make your scanned form overlay near perfectly on a template image. Then you have to define your areas of interest (box coordinates in the template) and clip those parts out, save them as individual files and feed those to tesseract. This is often done for government application forms: there is a reason you're supposed to only write within the boxes. ЁЯШЙ That is what that first link alludes at. It's just one idea among many to try. 2: what if you cannot or must not apply idea 1? Can we perhaps detect those table borders through image processing and /then/ come up with something that can take that data and help us extract the cell images? I must say I haven't done this myself yet, but some googling uncovered this link (after having quickly scanned several false positives in my google results and several altered search attempts) : https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic Here are a couple if fellows who have thought "out of the box" (pun intended) and gotten some results by phrasing my question in an entirely different way: instead of wondering how we can detect and extract those table cells, they try to answer the question: "what if we are able to *remove* those cell borders visually? Yes, we will worry about the texts in the cells looking as a haphazard ream of text later and expect trouble to discern which bit of recognized text was in what cell exactly (tesseract can output hOCR + other formats which deliver text plus coordinates of placement - you may have to wprk on that *afterwards* when you do something like they're doing: https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic Looks promising to me. What I'ld attempt next with their approach is see if I can make those detected borders extend and then me being able to extract each individual black area (cell!) as pixel *mask*, to be applied to my (conformed) page image so everything is thrown out except the pixels in that cell and thus giving me one image of one cell worth of text. Repeat that for each black area (see https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic answers to see what i mean: th3 result image he gets which is pure black with the table borders (lines) in white.) /They/ tackle the problem similarly but conceptually in a very different way than I am thinking about now: they go and mask out the detected table borders in one go. That can work very well and is much faster as they are not extracting subimages by masking or other means. Their *potential* trouble will be deciding which bit of text was together in which cell. That can be done in bbox analysis after ocr/tesseract has done its job. (again, google can provide hints. Again, it depends on your particular circumstances) My (very probable) trouble will be identifying the black cell areas singularly: doing a simple flood fill with a color, then extract anything covered by that color, is troublesome as the table border detection might very well not be perfect and thus cause my simpke flood fill to color adjacent cells too. ЁЯШв So, if I had your task, I'ld be looking at ways to extract, say, each individual *minimum rectangle* which does not contain white pixels (uh-oh, need noise removal then!) OR perhaps a way where each detected line segment is described as a vector and then extend those lines out across the page to get my rectangles in between: those would be my cells then. That's a bother when the table has cells spanning columns or rows. So more research needed before I'ld code that preprocess. Another issue with the line detection + removal/zoning techniques would be making sure the lines are either near perfect horizontal and vertical all (*orienting*/*deskewing* the image will help some there) OR you must come up with an algo that's able to find angled lines (while it should ignore the curvy text characters). Again, yet another area of further investigation if I were at it. The key here is that you'll have to do some work on your images before you can call tesseract and expect success. HTH. On Fri, Feb 12, 2021, 07:56 Kumar Rajwani <kumarrajwani1...@gmail.com> wrote: > Hey, both of the pages answer on is steps after some text detected right? > https://i.pinimg.com/564x/bd/a3/d4/bda3d4bf11b0f727db1f9d81faac1b5d.jpg > i have all this type of images where I am not able to detect date at the > top right. also contact name, phone, fax this is not correctly read every > time or missed in detection part. > That's the reason i am asking i have a similar format of the document so > if i trained the model on that it will help the model in the detection and > recognition part? > I don't know how tesseract detecting the text from the whole form. > i have tried thresholding, scaling, sharpening but this can't give me > results all time. > > On Friday, February 12, 2021 at 1:04:29 AM UTC+5:30 g...@hobbelt.com > wrote: > >> Have you read the two pages linked to in the answer from february 5th? >> Have you executed those procedures, or anything similar, to extract the >> individual table call images, to feed those to tesseract? >> So far you have not shown images or any results that show you have used a >> tabular recognition and cell extraction process at all (which is a >> preprocess required by the type of input image you have provided so far if >> you want to significantly improve OCR output quality), so, *hey*, what are >> your results so far following the sage advice (Feb 5)? >> (quoted below for convenience:) >> >> >> On Friday, February 5, 2021 at 7:53:26 PM UTC+5:30 shree wrote: >> >>> >>> See >>> https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/ >>> >>> >>> https://stackoverflow.com/questions/61265666/how-to-extract-data-from-invoices-in-tabular-format >>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com >>> >> -- >> >> >> Met vriendelijke groeten / Best regards, >> >> Ger Hobbelt >> >> -------------------------------------------------- >> web: http://www.hobbelt.com/ >> http://www.hebbut.net/ >> mail: g...@hobbelt.com >> mobile: +31-6-11 120 978 >> -------------------------------------------------- >> >> >> On Mon, Feb 8, 2021 at 1:47 PM Kumar Rajwani <kumarraj...@gmail.com> >> wrote: >> >>> hey, i am still waiting for your reply. can you please solve my doubts. >>> On Sunday, February 7, 2021 at 8:13:56 AM UTC+5:30 Kumar Rajwani wrote: >>> >>>> hey can you please tell me how can i improve the text detection for the >>>> same kind of images? >>>> >>>> On Friday, February 5, 2021 at 8:38:31 PM UTC+5:30 Kumar Rajwani wrote: >>>> >>>>> Thanks for this. i know about the usage of the tesseract. i have >>>>> multiple images where i can't improve image quality so i want to improve >>>>> my >>>>> model to get text from it. >>>>> are you saying that text detection will not improve by training? >>>>> Because i don't have an issue with text recognition most of time it >>>>> right. >>>>> can you tell me how can i improve the model to get more text from the >>>>> image? I am using psm 11 where it find lot's of text but some are missing. >>>>> >>>>> >>>>> On Friday, February 5, 2021 at 7:53:26 PM UTC+5:30 shree wrote: >>>>> >>>>>> Training won't fix that. >>>>>> >>>>>> See >>>>>> https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/ >>>>>> >>>>>> >>>>>> https://stackoverflow.com/questions/61265666/how-to-extract-data-from-invoices-in-tabular-format >>>>>> >>>>>> On Fri, Feb 5, 2021 at 6:14 PM Kumar Rajwani <kumarraj...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> i have tried a lot of images where it getting 90% accuracy and >>>>>>> missing always one side of image. that's the reason i want to train >>>>>>> model >>>>>>> if it can improve a little a bit it would be great. >>>>>>> if you can provide a script or steps that can help me it would be >>>>>>> good for me. >>>>>>> >>>>>>> On Friday, February 5, 2021 at 5:50:30 PM UTC+5:30 Kumar Rajwani >>>>>>> wrote: >>>>>>> >>>>>>>> main thing is i want to learn about training tesseract on image >>>>>>>> level so can you please tell me how can i procced further. i want to >>>>>>>> know >>>>>>>> where is the main problem. >>>>>>>> >>>>>>>> >>>>>>>> On Friday, February 5, 2021 at 5:46:22 PM UTC+5:30 shree wrote: >>>>>>>> >>>>>>>>> I see the tabular image that you shared. I don't think training >>>>>>>>> is going to help you in this. eng.traineddata should be able to >>>>>>>>> recognize >>>>>>>>> it quite well. You should select the different areas of interest and >>>>>>>>> just >>>>>>>>> OCR those sections. >>>>>>>>> >>>>>>>>> On Fri, Feb 5, 2021 at 5:33 PM Kumar Rajwani < >>>>>>>>> kumarraj...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> i have tried to do same thing in tesseract 4 which stuck at >>>>>>>>>> following line. >>>>>>>>>> Compute CTC targets failed! >>>>>>>>>> >>>>>>>>>> On Friday, February 5, 2021 at 5:04:42 PM UTC+5:30 Kumar Rajwani >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> !tesseract -v >>>>>>>>>>> tesseract 5.0.0-alpha-20201231-171-g04173 >>>>>>>>>>> leptonica-1.78.0 >>>>>>>>>>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng >>>>>>>>>>> 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 >>>>>>>>>>> 2.3.0 >>>>>>>>>>> Found AVX2 >>>>>>>>>>> Found AVX >>>>>>>>>>> Found FMA >>>>>>>>>>> Found SSE >>>>>>>>>>> Found OpenMP 201511 >>>>>>>>>>> Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 >>>>>>>>>>> liblz4/1.7.1 >>>>>>>>>>> >>>>>>>>>>> image example >>>>>>>>>>> i have added one image from my training data. >>>>>>>>>>> >>>>>>>>>>> i am using the colab system which have ubuntu os. >>>>>>>>>>> >>>>>>>>>>> https://colab.research.google.com/drive/1_Bn4wbK6dE5zYAuFyC4Eczq_eNU2shuz?usp=sharing >>>>>>>>>>> this is my notebook you can see complete process in finetune 2 >>>>>>>>>>> section. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Friday, February 5, 2021 at 4:55:43 PM UTC+5:30 shree wrote: >>>>>>>>>>> >>>>>>>>>>>> On Fri, Feb 5, 2021 at 4:44 PM Kumar Rajwani < >>>>>>>>>>>> kumarraj...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> hi, >>>>>>>>>>>> >>>>>>>>>>>> i have tried minus 1 and got following result >>>>>>>>>>>>> Iteration 0: GROUND TRUTH : ) @┬о >>>>>>>>>>>>> Iteration 0: BEST OCR TEXT : Yo >>>>>>>>>>>>> File eng.arial.exp0.lstmf line 0 : >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> What's your version of tesseract? What o/s? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Without your files, it's difficult to know what's causing the >>>>>>>>>>>> issue. >>>>>>>>>>>> >>>>>>>>>>>> with -1 debug_interval you should get the info for every >>>>>>>>>>>> iteration. >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>> >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>> >>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/342f3faf-b107-4243-845e-ba8a16274122n%40googlegroups.com >>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/342f3faf-b107-4243-845e-ba8a16274122n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> ____________________________________________________________ >>>>>>>>> рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com >>>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>> >>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> ____________________________________________________________ >>>>>> рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com >>>>>> >>>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/5cb397af-eedd-40bb-979d-d7128ab7c64en%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/5cb397af-eedd-40bb-979d-d7128ab7c64en%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/f839b8b5-0996-445b-8607-9cc63c1a0d32n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/f839b8b5-0996-445b-8607-9cc63c1a0d32n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fp8BPVTVCOROCs8MwEpmxaRcaCQ9GQku7zmsJu3odP29A%40mail.gmail.com.