search issue tracker and forum for "table"

Zdenko


pi 4. 6. 2021 o 17:13 Jeremy Young <[email protected]> napĂ­sal(a):

> It looks like there's a bug of some sort here. Attached is another image.
> When I COR it with
>
> "tesseract test.png test -c tessedit_create_hocr=1 -c hocr_char_boxes=1"
>
> the hocr for "Party A" looks like this:
>
>       <span class='ocrx_word' id='word_1_7' title='bbox 1547 347 1683 384;
> x_wconf 84'>
>        <span class='ocrx_cinfo' title='x_bboxes 1547 347 1567 376; x_conf
> 98.908447'>P</span>
>        <span class='ocrx_cinfo' title='x_bboxes 1571 354 1589 376; x_conf
> 99.026512'>a</span>
>        <span class='ocrx_cinfo' title='x_bboxes 1594 354 1607 376; x_conf
> 98.80246'>r</span>
>        <span class='ocrx_cinfo' title='x_bboxes 1609 349 1645 384; x_conf
> 98.968414'>t</span>
>        <span class='ocrx_cinfo' title='x_bboxes 1637 347 1661 384; x_conf
> 98.820137'>y</span>
>        <span class='ocrx_cinfo' title='x_bboxes 1657 347 1683 376; x_conf
> 97.777733'>A</span>
>       </span>
>
> ie the x-coordinate of the "y" overlaps the prior and following characters.
>
> On Thursday, June 3, 2021 at 6:45:51 PM UTC+1 Jeremy Young wrote:
>
>> Hi
>>
>> The attached test image (which could be in a batch of a million, so I
>> need a generalised fix) is being processed in Tess4J but I also get the
>> same issue with the Windows build from Mannheim version:
>>
>> C:\temp>tesseract --version
>> tesseract v5.0.0-alpha.20210506
>>  leptonica-1.78.0
>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 :
>> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>>  Found AVX2
>>  Found AVX
>>  Found FMA
>>  Found SSE4.1
>>  Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6
>> liblz4/1.7.5 libzstd/1.4.5
>>  Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4
>> nghttp2/1.31.0
>>
>> When I execute "tesseract test1.png test1" the output contains at line 21
>> "PartyA | PartyB | Valuation". "Party A" should be two words as should
>> "Party B".
>>
>> When I output the hocr using Tess4J I can see that the gaps between the
>> characters are 4,6,2,2,12
>> ie the gap between the "y" and the "A" is much bigger than the others.
>>
>>       <span class='ocrx_word' id='word_1_15' title='bbox 1551 349 1681
>> 386; x_wconf 91; x_fsize 9'>
>>        <span class='ocrx_cinfo' title='x_bboxes 1551 349 1569 378; x_conf
>> 99.031525'>P</span>
>>        <span class='ocrx_cinfo' title='x_bboxes 1573 356 1590 378; x_conf
>> 98.951897'>a</span>
>>        <span class='ocrx_cinfo' title='x_bboxes 1596 356 1608 378; x_conf
>> 98.996353'>r</span>
>>        <span class='ocrx_cinfo' title='x_bboxes 1610 351 1623 378; x_conf
>> 99.038818'>t</span>
>>        <span class='ocrx_cinfo' title='x_bboxes 1625 357 1644 386; x_conf
>> 98.881676'>y</span>
>>        <span class='ocrx_cinfo' title='x_bboxes 1656 349 1681 378; x_conf
>> 98.736168'>A</span>
>>       </span>
>>
>> Any suggestions what I could do?
>>
>> Thx
>>
>>
>>
>> LIKEZERO Limited is a limited company registered in Scotland with
>> registered number SC651418. Our registered office is at Quartermile One, 15
>> Lauriston Place, Edinburgh, United Kingdom, EH3 9EP
>>
>> This email is intended solely for the addressee and may contain
>> confidential information. If you have received this message in error,
>> please immediately and permanently delete it. Do not use, copy or disclose
>> the information contained in this message or in any attachment.
>>
>> This email is not in any way intended to create a binding contract.
>>
>> We may monitor and record emails for security reasons and for monitoring
>> compliance with internal policies.
>>
>
> LIKEZERO Limited is a limited company registered in Scotland with
> registered number SC651418. Our registered office is at Quartermile One, 15
> Lauriston Place, Edinburgh, United Kingdom, EH3 9EP
>
> This email is intended solely for the addressee and may contain
> confidential information. If you have received this message in error,
> please immediately and permanently delete it. Do not use, copy or disclose
> the information contained in this message or in any attachment.
>
> This email is not in any way intended to create a binding contract.
>
> We may monitor and record emails for security reasons and for monitoring
> compliance with internal policies.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/28ea517b-ff78-483c-98ed-67db49a7d7b5n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/28ea517b-ff78-483c-98ed-67db49a7d7b5n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zOVU9tqhC3mGi3ojDdRmLFVKnUkOJf%3D7UXr3fL8dAiNw%40mail.gmail.com.

Reply via email to