Re: [tesseract-ocr] not training on image after loading data

Kumar Rajwani Thu, 04 Mar 2021 21:53:39 -0800

great answer .
Can you please guide me if our word is not recognized correctly by 
tesseract so how can we insert it in dictionary.
As i read 
here 
(https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf)
  
6. Linguistic Analysis it have dictonary word.


On Friday, February 12, 2021 at 10:44:12 PM UTC+5:30 g...@hobbelt.com wrote:

> Ah, a misunderstanding there.
>
> Ok, the key message of those pages is: you must extract each "table cell" 
> as a /separate/ image to help OCR, then, if needed, combine the text 
> results for each of those smaller images to form the text of your page. 
>
> That's often referred to as "segmentation".
>
> Tesseract has an algorithm for that built in AFAICT, but that is geared 
> towards pages of text (reams of texts, lines of text) and picking out the 
> individual words in there. That task gets very confused when you feed it a 
> table layout, which has all kinds of edges in the images that are /not/ 
> text, but table cell /borders/.
>
> So what those links are hinting at is that you need to come up with an 
> image *preprocess* which can handle your type of table. This depends on 
> your particular table layout, as there are many ways to "design / style" a 
> table. 
>
> So you will have to write some script which will find and then cut out 
> each table cell as an image to feed tesseract. 
>
> When you look for segmentation approaches on the net, leptonica and opencv 
> get mentioned a lot. 
>
> Unfortunately most segmentation work when googling for it is about object 
> and facial recognition. Not a problem per se, isnt a table cell an object 
> too? Well, not really, not in the sense they're using it as those 
> algorithms approach the image segmentation from the concept of each object 
> being an area filled with color(s). This would be applicable if the table 
> was styled as cells with an alternating background, for instance, but yours 
> is all white and just some thin black borders. 
>
> There's a couple of ideas for that:
>
> 1: conform the image to an (empty) form template, i.e. seek a way to make 
> your scanned form overlay near perfectly on a template image. Then you have 
> to define your areas of interest (box coordinates in the template) and clip 
> those parts out, save them as individual files and feed those to tesseract. 
> This is often done for government application forms: there is a reason 
> you're supposed to only write within the boxes. 😉
>
> That is what that first link alludes at. It's just one idea among many to 
> try. 
>
> 2: what if you cannot or must not apply idea 1? Can we perhaps detect 
> those table borders through image processing and /then/ come up with 
> something that can take that data and help us extract the cell images? 
>
> I must say I haven't done this myself yet, but some googling uncovered 
> this link (after having quickly scanned several false positives in my 
> google results and several altered search attempts) :  
> https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic
>
> Here are a couple if fellows who have thought "out of the box" (pun 
> intended) and gotten some results by phrasing my question in an entirely 
> different way: instead of wondering how we can detect and extract those 
> table cells, they try to answer the question: "what if we are able to 
> *remove* those cell borders visually? Yes, we will worry about the texts in 
> the cells looking as a haphazard ream of text later and expect trouble to 
> discern which bit of recognized text was in what cell exactly (tesseract 
> can output hOCR + other formats which deliver text plus coordinates of 
> placement - you may have to wprk on that *afterwards* when you do something 
> like they're doing:  
>
> https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic
>
> Looks promising to me. What I'ld attempt next with their approach is see 
> if I can make those detected borders extend and then me being able to 
> extract each individual black area (cell!) as pixel *mask*, to be applied 
> to my (conformed) page image so everything is thrown out except the pixels 
> in that cell and thus giving me one image of one cell worth of text. Repeat 
> that for each black area (see 
> https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic
>  
> answers to see what i mean: th3 result image he gets which is pure black 
> with the table borders (lines) in white.)
>
> /They/ tackle the problem similarly but conceptually in a very different 
> way than I am thinking about now: they go and mask out the detected table 
> borders in one go.
> That can work very well and is much faster as they are not extracting 
> subimages by masking or other means. 
>
> Their *potential* trouble will be deciding which bit of text was together 
> in which cell. That can be done in bbox analysis after ocr/tesseract has 
> done its job. (again, google can provide hints. Again, it depends on your 
> particular circumstances)
>
> My (very probable) trouble will be identifying the black cell areas 
> singularly: doing a simple flood fill with a color, then extract anything 
> covered by that color, is troublesome as the table border detection might 
> very well not be perfect and thus cause my simpke flood fill to color 
> adjacent cells too. 😢 So, if I had your task, I'ld be looking at ways to 
> extract, say, each individual *minimum rectangle* which does not contain 
> white pixels (uh-oh, need noise removal then!) OR perhaps a way where each 
> detected line segment is described as a  vector and then extend those lines 
> out across the page to get my rectangles in between: those would be my 
> cells then. That's a bother when the table has cells spanning columns or 
> rows. So more research needed before I'ld code that preprocess.
>
> Another issue with the line detection + removal/zoning techniques would be 
> making sure the lines are either near perfect horizontal and vertical all 
> (*orienting*/*deskewing* the image will help some there) OR you must come 
> up with an algo that's able to find angled lines (while it should ignore 
> the curvy text characters). Again, yet another area of further 
> investigation if I were at it.
>
> The key here is that you'll have to do some work on your images before you 
> can call tesseract and expect success. 
>
> HTH. 
>
> On Fri, Feb 12, 2021, 07:56 Kumar Rajwani <kumarraj...@gmail.com> wrote:
>
>> Hey, both of the pages answer on is steps after some text detected right?
>> https://i.pinimg.com/564x/bd/a3/d4/bda3d4bf11b0f727db1f9d81faac1b5d.jpg
>> i have all this type of images where I am not able to detect date at the 
>> top right. also contact name, phone, fax this is not correctly read every 
>> time or missed in detection part.
>> That's the reason i am asking i have a similar format of the document so 
>> if i trained the model on that it will help the model in the detection and 
>> recognition part?
>> I don't know how tesseract detecting the text from the whole form.
>> i have tried thresholding, scaling, sharpening but this can't give me 
>> results all time.
>>
>> On Friday, February 12, 2021 at 1:04:29 AM UTC+5:30 g...@hobbelt.com 
>> wrote:
>>
>>> Have you read the two pages linked to in the answer from february 5th?
>>> Have you executed those procedures, or anything similar, to extract the 
>>> individual table call images, to feed those to tesseract?
>>> So far you have not shown images or any results that show you have used 
>>> a tabular recognition and cell extraction process at all (which is a 
>>> preprocess required by the type of input image you have provided so far if 
>>> you want to significantly improve OCR output quality), so, *hey*, what are 
>>> your results so far following the sage advice (Feb 5)?
>>> (quoted below for convenience:)
>>>
>>>
>>> On Friday, February 5, 2021 at 7:53:26 PM UTC+5:30 shree wrote:
>>>
>>>>
>>>> See 
>>>> https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/
>>>>
>>>>
>>>> https://stackoverflow.com/questions/61265666/how-to-extract-data-from-invoices-in-tabular-format
>>>>
>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> --  
>>>
>>>
>>> Met vriendelijke groeten / Best regards,
>>>
>>> Ger Hobbelt
>>>
>>> --------------------------------------------------
>>> web:    http://www.hobbelt.com/
>>>         http://www.hebbut.net/
>>> mail:   g...@hobbelt.com
>>> mobile: +31-6-11 120 978
>>> --------------------------------------------------
>>>
>>>
>>> On Mon, Feb 8, 2021 at 1:47 PM Kumar Rajwani <kumarraj...@gmail.com> 
>>> wrote:
>>>
>>>> hey, i am still waiting for your reply. can  you please solve my 
>>>> doubts. 
>>>> On Sunday, February 7, 2021 at 8:13:56 AM UTC+5:30 Kumar Rajwani wrote:
>>>>
>>>>> hey can you please tell me how can i improve the text detection for 
>>>>> the same kind of images?
>>>>>
>>>>> On Friday, February 5, 2021 at 8:38:31 PM UTC+5:30 Kumar Rajwani wrote:
>>>>>
>>>>>> Thanks for this. i know about the usage of the tesseract. i have 
>>>>>> multiple images where i can't improve image quality so i want to improve 
>>>>>> my 
>>>>>> model to get text from it.
>>>>>> are you saying that text detection will not improve by training?
>>>>>> Because i don't have an issue with text recognition most of time it 
>>>>>> right.
>>>>>> can you tell me how can i improve the model to get more text from the 
>>>>>> image? I am using psm 11 where it find lot's of text but some are 
>>>>>> missing.
>>>>>>
>>>>>>
>>>>>> On Friday, February 5, 2021 at 7:53:26 PM UTC+5:30 shree wrote:
>>>>>>
>>>>>>> Training won't fix that.
>>>>>>>
>>>>>>> See 
>>>>>>> https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/
>>>>>>>
>>>>>>>
>>>>>>> https://stackoverflow.com/questions/61265666/how-to-extract-data-from-invoices-in-tabular-format
>>>>>>>
>>>>>>> On Fri, Feb 5, 2021 at 6:14 PM Kumar Rajwani <kumarraj...@gmail.com> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> i have tried a lot of images where it getting 90% accuracy and 
>>>>>>>> missing always one side of image. that's the reason i want to train 
>>>>>>>> model 
>>>>>>>> if it can improve a little a bit it would be great.
>>>>>>>> if you can provide a script or steps that can help me it would be 
>>>>>>>> good for me.
>>>>>>>>
>>>>>>>> On Friday, February 5, 2021 at 5:50:30 PM UTC+5:30 Kumar Rajwani 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> main thing is i want to learn about training tesseract on image 
>>>>>>>>> level so can you please tell me  how can i procced further. i want to 
>>>>>>>>> know 
>>>>>>>>> where is the main problem.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Friday, February 5, 2021 at 5:46:22 PM UTC+5:30 shree wrote:
>>>>>>>>>
>>>>>>>>>> I see the tabular image that you shared.  I don't think training 
>>>>>>>>>> is going to help you in this. eng.traineddata should be able to 
>>>>>>>>>> recognize 
>>>>>>>>>> it quite well. You should select the different areas of interest and 
>>>>>>>>>> just 
>>>>>>>>>> OCR those sections.
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 5, 2021 at 5:33 PM Kumar Rajwani <
>>>>>>>>>> kumarraj...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> i have tried to do same thing in tesseract 4 which stuck at 
>>>>>>>>>>> following line.
>>>>>>>>>>> Compute CTC targets failed!
>>>>>>>>>>>
>>>>>>>>>>> On Friday, February 5, 2021 at 5:04:42 PM UTC+5:30 Kumar Rajwani 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> !tesseract -v
>>>>>>>>>>>> tesseract 5.0.0-alpha-20201231-171-g04173
>>>>>>>>>>>>  leptonica-1.78.0
>>>>>>>>>>>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 
>>>>>>>>>>>> 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 
>>>>>>>>>>>> 2.3.0
>>>>>>>>>>>>  Found AVX2
>>>>>>>>>>>>  Found AVX
>>>>>>>>>>>>  Found FMA
>>>>>>>>>>>>  Found SSE
>>>>>>>>>>>>  Found OpenMP 201511
>>>>>>>>>>>>  Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 
>>>>>>>>>>>> liblz4/1.7.1
>>>>>>>>>>>>
>>>>>>>>>>>> image example
>>>>>>>>>>>> i have added one image from my training data.
>>>>>>>>>>>>
>>>>>>>>>>>> i am using the colab system which have ubuntu os. 
>>>>>>>>>>>>
>>>>>>>>>>>> https://colab.research.google.com/drive/1_Bn4wbK6dE5zYAuFyC4Eczq_eNU2shuz?usp=sharing
>>>>>>>>>>>> this is my notebook you can see complete process in finetune 2 
>>>>>>>>>>>> section.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Friday, February 5, 2021 at 4:55:43 PM UTC+5:30 shree wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Feb 5, 2021 at 4:44 PM Kumar Rajwani <
>>>>>>>>>>>>> kumarraj...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> i have tried minus 1 and got following result
>>>>>>>>>>>>>> Iteration 0: GROUND  TRUTH : ) @®
>>>>>>>>>>>>>> Iteration 0: BEST OCR TEXT : Yo
>>>>>>>>>>>>>> File eng.arial.exp0.lstmf line 0 :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>  
>>>>>>>>>>>>>
>>>>>>>>>>>>>> What's your version of tesseract? What o/s?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Without your files, it's difficult to know what's causing the 
>>>>>>>>>>>>> issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> with -1 debug_interval you should get the info for every 
>>>>>>>>>>>>> iteration.
>>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/342f3faf-b107-4243-845e-ba8a16274122n%40googlegroups.com
>>>>>>>>>>>  
>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/342f3faf-b107-4243-845e-ba8a16274122n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>>
>>>>>>>>>> ____________________________________________________________
>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>
>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e663d426-2c32-432b-80b3-4ff9d8fe86d4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>>
>>>>>>> ____________________________________________________________
>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>
>>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>>
>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/5cb397af-eedd-40bb-979d-d7128ab7c64en%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5cb397af-eedd-40bb-979d-d7128ab7c64en%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/f839b8b5-0996-445b-8607-9cc63c1a0d32n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/f839b8b5-0996-445b-8607-9cc63c1a0d32n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/72367041-6a6e-42ac-9508-4351bb20c2afn%40googlegroups.com.

Re: [tesseract-ocr] not training on image after loading data

Reply via email to