Re: [tesseract-ocr] not training on image after loading data

Ger Hobbelt Fri, 12 Feb 2021 09:14:13 -0800

Ah, a misunderstanding there.

Ok, the key message of those pages is: you must extract each "table cell"
as a /separate/ image to help OCR, then, if needed, combine the text
results for each of those smaller images to form the text of your page.

That's often referred to as "segmentation".

Tesseract has an algorithm for that built in AFAICT, but that is geared
towards pages of text (reams of texts, lines of text) and picking out the
individual words in there. That task gets very confused when you feed it a
table layout, which has all kinds of edges in the images that are /not/
text, but table cell /borders/.

So what those links are hinting at is that you need to come up with an
image *preprocess* which can handle your type of table. This depends on
your particular table layout, as there are many ways to "design / style" a
table.

So you will have to write some script which will find and then cut out each
table cell as an image to feed tesseract.

When you look for segmentation approaches on the net, leptonica and opencv
get mentioned a lot.

Unfortunately most segmentation work when googling for it is about object
and facial recognition. Not a problem per se, isnt a table cell an object
too? Well, not really, not in the sense they're using it as those
algorithms approach the image segmentation from the concept of each object
being an area filled with color(s). This would be applicable if the table
was styled as cells with an alternating background, for instance, but yours
is all white and just some thin black borders.

There's a couple of ideas for that:

1: conform the image to an (empty) form template, i.e. seek a way to make
your scanned form overlay near perfectly on a template image. Then you have
to define your areas of interest (box coordinates in the template) and clip
those parts out, save them as individual files and feed those to tesseract.
This is often done for government application forms: there is a reason
you're supposed to only write within the boxes. 😉

That is what that first link alludes at. It's just one idea among many to
try.

2: what if you cannot or must not apply idea 1? Can we perhaps detect those
table borders through image processing and /then/ come up with something
that can take that data and help us extract the cell images?

I must say I haven't done this myself yet, but some googling uncovered this
link (after having quickly scanned several false positives in my google
results and several altered search attempts) :
https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic

Here are a couple if fellows who have thought "out of the box" (pun
intended) and gotten some results by phrasing my question in an entirely
different way: instead of wondering how we can detect and extract those
table cells, they try to answer the question: "what if we are able to
*remove* those cell borders visually? Yes, we will worry about the texts in
the cells looking as a haphazard ream of text later and expect trouble to
discern which bit of recognized text was in what cell exactly (tesseract
can output hOCR + other formats which deliver text plus coordinates of
placement - you may have to wprk on that *afterwards* when you do something
like they're doing:
https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic

Looks promising to me. What I'ld attempt next with their approach is see if
I can make those detected borders extend and then me being able to extract
each individual black area (cell!) as pixel *mask*, to be applied to my
(conformed) page image so everything is thrown out except the pixels in
that cell and thus giving me one image of one cell worth of text. Repeat
that for each black area (see
https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic
answers to see what i mean: th3 result image he gets which is pure black
with the table borders (lines) in white.)

/They/ tackle the problem similarly but conceptually in a very different
way than I am thinking about now: they go and mask out the detected table
borders in one go.
That can work very well and is much faster as they are not extracting
subimages by masking or other means.

Their *potential* trouble will be deciding which bit of text was together
in which cell. That can be done in bbox analysis after ocr/tesseract has
done its job. (again, google can provide hints. Again, it depends on your
particular circumstances)

My (very probable) trouble will be identifying the black cell areas
singularly: doing a simple flood fill with a color, then extract anything
covered by that color, is troublesome as the table border detection might
very well not be perfect and thus cause my simpke flood fill to color
adjacent cells too. 😢 So, if I had your task, I'ld be looking at ways to
extract, say, each individual *minimum rectangle* which does not contain
white pixels (uh-oh, need noise removal then!) OR perhaps a way where each
detected line segment is described as a  vector and then extend those lines
out across the page to get my rectangles in between: those would be my
cells then. That's a bother when the table has cells spanning columns or
rows. So more research needed before I'ld code that preprocess.

Another issue with the line detection + removal/zoning techniques would be
making sure the lines are either near perfect horizontal and vertical all
(*orienting*/*deskewing* the image will help some there) OR you must come
up with an algo that's able to find angled lines (while it should ignore
the curvy text characters). Again, yet another area of further
investigation if I were at it.

The key here is that you'll have to do some work on your images before you
can call tesseract and expect success.

HTH.

On Fri, Feb 12, 2021, 07:56 Kumar Rajwani <kumarrajwani1...@gmail.com>
wrote:

> Hey, both of the pages answer on is steps after some text detected right?
> https://i.pinimg.com/564x/bd/a3/d4/bda3d4bf11b0f727db1f9d81faac1b5d.jpg
> i have all this type of images where I am not able to detect date at the
> top right. also contact name, phone, fax this is not correctly read every
> time or missed in detection part.
> That's the reason i am asking i have a similar format of the document so
> if i trained the model on that it will help the model in the detection and
> recognition part?
> I don't know how tesseract detecting the text from the whole form.
> i have tried thresholding, scaling, sharpening but this can't give me
> results all time.
>
> On Friday, February 12, 2021 at 1:04:29 AM UTC+5:30 g...@hobbelt.com
> wrote:
>
>> Have you read the two pages linked to in the answer from february 5th?
>> Have you executed those procedures, or anything similar, to extract the
>> individual table call images, to feed those to tesseract?
>> So far you have not shown images or any results that show you have used a
>> tabular recognition and cell extraction process at all (which is a
>> preprocess required by the type of input image you have provided so far if
>> you want to significantly improve OCR output quality), so, *hey*, what are
>> your results so far following the sage advice (Feb 5)?
>> (quoted below for convenience:)
>>
>>
>> On Friday, February 5, 2021 at 7:53:26 PM UTC+5:30 shree wrote:
>>
>>>
>>> See
>>> https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/
>>>
>>>
>>> https://stackoverflow.com/questions/61265666/how-to-extract-data-from-invoices-in-tabular-format
>>>
>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>
>>>
>>> --
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> --
>>
>>
>> Met vriendelijke groeten / Best regards,
>>
>> Ger Hobbelt
>>
>> --------------------------------------------------
>> web:    http://www.hobbelt.com/
>>         http://www.hebbut.net/
>> mail:   g...@hobbelt.com
>> mobile: +31-6-11 120 978
>> --------------------------------------------------
>>
>>
>> On Mon, Feb 8, 2021 at 1:47 PM Kumar Rajwani <kumarraj...@gmail.com>
>> wrote:
>>
>>> hey, i am still waiting for your reply. can  you please solve my doubts.
>>> On Sunday, February 7, 2021 at 8:13:56 AM UTC+5:30 Kumar Rajwani wrote:
>>>
>>>> hey can you please tell me how can i improve the text detection for the
>>>> same kind of images?
>>>>
>>>> On Friday, February 5, 2021 at 8:38:31 PM UTC+5:30 Kumar Rajwani wrote:
>>>>
>>>>> Thanks for this. i know about the usage of the tesseract. i have
>>>>> multiple images where i can't improve image quality so i want to improve 
>>>>> my
>>>>> model to get text from it.
>>>>> are you saying that text detection will not improve by training?
>>>>> Because i don't have an issue with text recognition most of time it
>>>>> right.
>>>>> can you tell me how can i improve the model to get more text from the
>>>>> image? I am using psm 11 where it find lot's of text but some are missing.
>>>>>
>>>>>
>>>>> On Friday, February 5, 2021 at 7:53:26 PM UTC+5:30 shree wrote:
>>>>>
>>>>>> Training won't fix that.
>>>>>>
>>>>>> See
>>>>>> https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/
>>>>>>
>>>>>>
>>>>>> https://stackoverflow.com/questions/61265666/how-to-extract-data-from-invoices-in-tabular-format
>>>>>>
>>>>>> On Fri, Feb 5, 2021 at 6:14 PM Kumar Rajwani <kumarraj...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> i have tried a lot of images where it getting 90% accuracy and
>>>>>>> missing always one side of image. that's the reason i want to train 
>>>>>>> model
>>>>>>> if it can improve a little a bit it would be great.
>>>>>>> if you can provide a script or steps that can help me it would be
>>>>>>> good for me.
>>>>>>>
>>>>>>> On Friday, February 5, 2021 at 5:50:30 PM UTC+5:30 Kumar Rajwani
>>>>>>> wrote:
>>>>>>>
>>>>>>>> main thing is i want to learn about training tesseract on image
>>>>>>>> level so can you please tell me  how can i procced further. i want to 
>>>>>>>> know
>>>>>>>> where is the main problem.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Friday, February 5, 2021 at 5:46:22 PM UTC+5:30 shree wrote:
>>>>>>>>
>>>>>>>>> I see the tabular image that you shared.  I don't think training
>>>>>>>>> is going to help you in this. eng.traineddata should be able to 
>>>>>>>>> recognize
>>>>>>>>> it quite well. You should select the different areas of interest and 
>>>>>>>>> just
>>>>>>>>> OCR those sections.
>>>>>>>>>
>>>>>>>>> On Fri, Feb 5, 2021 at 5:33 PM Kumar Rajwani <
>>>>>>>>> kumarraj...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> i have tried to do same thing in tesseract 4 which stuck at
>>>>>>>>>> following line.
>>>>>>>>>> Compute CTC targets failed!
>>>>>>>>>>
>>>>>>>>>> On Friday, February 5, 2021 at 5:04:42 PM UTC+5:30 Kumar Rajwani
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> !tesseract -v
>>>>>>>>>>> tesseract 5.0.0-alpha-20201231-171-g04173
>>>>>>>>>>>  leptonica-1.78.0
>>>>>>>>>>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng
>>>>>>>>>>> 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 
>>>>>>>>>>> 2.3.0
>>>>>>>>>>>  Found AVX2
>>>>>>>>>>>  Found AVX
>>>>>>>>>>>  Found FMA
>>>>>>>>>>>  Found SSE
>>>>>>>>>>>  Found OpenMP 201511
>>>>>>>>>>>  Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6
>>>>>>>>>>> liblz4/1.7.1
>>>>>>>>>>>
>>>>>>>>>>> image example
>>>>>>>>>>> i have added one image from my training data.
>>>>>>>>>>>
>>>>>>>>>>> i am using the colab system which have ubuntu os.
>>>>>>>>>>>
>>>>>>>>>>> https://colab.research.google.com/drive/1_Bn4wbK6dE5zYAuFyC4Eczq_eNU2shuz?usp=sharing
>>>>>>>>>>> this is my notebook you can see complete process in finetune 2
>>>>>>>>>>> section.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Friday, February 5, 2021 at 4:55:43 PM UTC+5:30 shree wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Feb 5, 2021 at 4:44 PM Kumar Rajwani <
>>>>>>>>>>>> kumarraj...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> hi,
>>>>>>>>>>>>
>>>>>>>>>>>> i have tried minus 1 and got following result
>>>>>>>>>>>>> Iteration 0: GROUND  TRUTH : ) @®
>>>>>>>>>>>>> Iteration 0: BEST OCR TEXT : Yo
>>>>>>>>>>>>> File eng.arial.exp0.lstmf line 0 :
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> What's your version of tesseract? What o/s?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Without your files, it's difficult to know what's causing the
>>>>>>>>>>>> issue.
>>>>>>>>>>>>
>>>>>>>>>>>> with -1 debug_interval you should get the info for every
>>>>>>>>>>>> iteration.
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/342f3faf-b107-4243-845e-ba8a16274122n%40googlegroups.com
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/342f3faf-b107-4243-845e-ba8a16274122n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> ____________________________________________________________
>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>
>>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>
>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/5cb397af-eedd-40bb-979d-d7128ab7c64en%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/5cb397af-eedd-40bb-979d-d7128ab7c64en%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f839b8b5-0996-445b-8607-9cc63c1a0d32n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/f839b8b5-0996-445b-8607-9cc63c1a0d32n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fp8BPVTVCOROCs8MwEpmxaRcaCQ9GQku7zmsJu3odP29A%40mail.gmail.com.

Re: [tesseract-ocr] not training on image after loading data

Reply via email to