Thanks for your input, Ger.
Tesseract is definitely a bit rubbish when there is comnplex spacing to
deal with.
Test1.png attached is just the "Party A Party B Valuation" line.
The offsets are still overlapping.
Test2.png is just the "Party A" and Tesseract gets it right.
I already have some code which looks for low confidence characters/words
and reOCRs those areas in a different psm.
I stopped using it because the confidence cannot be relied upon, especially
when multiple languages come into play.
However, I could reuse that code to reocr, word-by-word, just those
sections which appear to have overlapping char coordinates.
Yuk!
J
On Sunday, June 6, 2021 at 1:35:47 PM UTC+1 [email protected] wrote:
> Don't know why it happens precisely, but tesseract gets a little wonky
> when you feed it tables (with borders/lines).
>
> Another good test would be to clip out the text of each cell, e.g. "Party
> A" only, etc. and feed those to tesseract one after the other.
>
> When text comes out proper then, then at least you'll have "proof" that
> this is triggered by the table layout. Which would imply "image
> segmentation" would be the next subject to look at - though it can be
> argued that fiddling with that can be considered as "workaround" instead of
> "fix". Either way, this is complicated and I dont have the answers. Just a
> direction to look at, categorize the problem and then maybe you have
> something to help you reduce the problem.
>
> Cheers,
>
> Ger
>
>
> On Fri, Jun 4, 2021, 17:41 Jeremy Young <[email protected]> wrote:
>
>> Hmmm. I had a quick look. The results don't seem to be too helpful. Could
>> be a little more precise as to what I'm looking for?
>> Thx
>>
>>
>> On Friday, June 4, 2021 at 4:29:29 PM UTC+1 zdenop wrote:
>>
>>> search issue tracker and forum for "table"
>>>
>>> Zdenko
>>>
>>>
>>> pi 4. 6. 2021 o 17:13 Jeremy Young <[email protected]>
>>> napísal(a):
>>>
>>>> It looks like there's a bug of some sort here. Attached is another
>>>> image. When I COR it with
>>>>
>>>> "tesseract test.png test -c tessedit_create_hocr=1 -c hocr_char_boxes=1"
>>>>
>>>> the hocr for "Party A" looks like this:
>>>>
>>>> <span class='ocrx_word' id='word_1_7' title='bbox 1547 347 1683
>>>> 384; x_wconf 84'>
>>>> <span class='ocrx_cinfo' title='x_bboxes 1547 347 1567 376;
>>>> x_conf 98.908447'>P</span>
>>>> <span class='ocrx_cinfo' title='x_bboxes 1571 354 1589 376;
>>>> x_conf 99.026512'>a</span>
>>>> <span class='ocrx_cinfo' title='x_bboxes 1594 354 1607 376;
>>>> x_conf 98.80246'>r</span>
>>>> <span class='ocrx_cinfo' title='x_bboxes 1609 349 1645 384;
>>>> x_conf 98.968414'>t</span>
>>>> <span class='ocrx_cinfo' title='x_bboxes 1637 347 1661 384;
>>>> x_conf 98.820137'>y</span>
>>>> <span class='ocrx_cinfo' title='x_bboxes 1657 347 1683 376;
>>>> x_conf 97.777733'>A</span>
>>>> </span>
>>>>
>>>> ie the x-coordinate of the "y" overlaps the prior and following
>>>> characters.
>>>>
>>>> On Thursday, June 3, 2021 at 6:45:51 PM UTC+1 Jeremy Young wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> The attached test image (which could be in a batch of a million, so I
>>>>> need a generalised fix) is being processed in Tess4J but I also get the
>>>>> same issue with the Windows build from Mannheim version:
>>>>>
>>>>> C:\temp>tesseract --version
>>>>> tesseract v5.0.0-alpha.20210506
>>>>> leptonica-1.78.0
>>>>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 :
>>>>> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>>>>> Found AVX2
>>>>> Found AVX
>>>>> Found FMA
>>>>> Found SSE4.1
>>>>> Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6
>>>>> liblz4/1.7.5 libzstd/1.4.5
>>>>> Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5
>>>>> libidn2/2.0.4 nghttp2/1.31.0
>>>>>
>>>>> When I execute "tesseract test1.png test1" the output contains at line
>>>>> 21 "PartyA | PartyB | Valuation". "Party A" should be two words as should
>>>>> "Party B".
>>>>>
>>>>> When I output the hocr using Tess4J I can see that the gaps between
>>>>> the characters are 4,6,2,2,12
>>>>> ie the gap between the "y" and the "A" is much bigger than the others.
>>>>>
>>>>> <span class='ocrx_word' id='word_1_15' title='bbox 1551 349 1681
>>>>> 386; x_wconf 91; x_fsize 9'>
>>>>> <span class='ocrx_cinfo' title='x_bboxes 1551 349 1569 378;
>>>>> x_conf 99.031525'>P</span>
>>>>> <span class='ocrx_cinfo' title='x_bboxes 1573 356 1590 378;
>>>>> x_conf 98.951897'>a</span>
>>>>> <span class='ocrx_cinfo' title='x_bboxes 1596 356 1608 378;
>>>>> x_conf 98.996353'>r</span>
>>>>> <span class='ocrx_cinfo' title='x_bboxes 1610 351 1623 378;
>>>>> x_conf 99.038818'>t</span>
>>>>> <span class='ocrx_cinfo' title='x_bboxes 1625 357 1644 386;
>>>>> x_conf 98.881676'>y</span>
>>>>> <span class='ocrx_cinfo' title='x_bboxes 1656 349 1681 378;
>>>>> x_conf 98.736168'>A</span>
>>>>> </span>
>>>>>
>>>>> Any suggestions what I could do?
>>>>>
>>>>> Thx
>>>>>
>>>>>
>>>>>
>>>>> LIKEZERO Limited is a limited company registered in Scotland with
>>>>> registered number SC651418. Our registered office is at Quartermile One,
>>>>> 15
>>>>> Lauriston Place, Edinburgh, United Kingdom, EH3 9EP
>>>>>
>>>>> This email is intended solely for the addressee and may contain
>>>>> confidential information. If you have received this message in error,
>>>>> please immediately and permanently delete it. Do not use, copy or
>>>>> disclose
>>>>> the information contained in this message or in any attachment.
>>>>>
>>>>> This email is not in any way intended to create a binding contract.
>>>>>
>>>>> We may monitor and record emails for security reasons and for
>>>>> monitoring compliance with internal policies.
>>>>>
>>>>
>>>> LIKEZERO Limited is a limited company registered in Scotland with
>>>> registered number SC651418. Our registered office is at Quartermile One,
>>>> 15
>>>> Lauriston Place, Edinburgh, United Kingdom, EH3 9EP
>>>>
>>>> This email is intended solely for the addressee and may contain
>>>> confidential information. If you have received this message in error,
>>>> please immediately and permanently delete it. Do not use, copy or disclose
>>>> the information contained in this message or in any attachment.
>>>>
>>>> This email is not in any way intended to create a binding contract.
>>>>
>>>> We may monitor and record emails for security reasons and for
>>>> monitoring compliance with internal policies.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/28ea517b-ff78-483c-98ed-67db49a7d7b5n%40googlegroups.com
>>>>
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/28ea517b-ff78-483c-98ed-67db49a7d7b5n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>
>> LIKEZERO Limited is a limited company registered in Scotland with
>> registered number SC651418. Our registered office is at Quartermile One, 15
>> Lauriston Place, Edinburgh, United Kingdom, EH3 9EP
>>
>> This email is intended solely for the addressee and may contain
>> confidential information. If you have received this message in error,
>> please immediately and permanently delete it. Do not use, copy or disclose
>> the information contained in this message or in any attachment.
>>
>> This email is not in any way intended to create a binding contract.
>>
>> We may monitor and record emails for security reasons and for monitoring
>> compliance with internal policies.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>>
> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/f7c3e0ba-3693-4315-885d-e6bd3a5ae0a4n%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/tesseract-ocr/f7c3e0ba-3693-4315-885d-e6bd3a5ae0a4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
--
LIKEZERO Limited is a limited company registered in Scotland with
registered number SC651418. Our registered office is at Quartermile One, 15
Lauriston Place, Edinburgh, United Kingdom, EH3 9EP
This email is intended
solely for the addressee and may contain confidential information. If you
have received this message in error, please immediately and permanently
delete it. Do not use, copy or disclose the information contained in this
message or in any attachment.
This email is not in any way intended to
create a binding contract.
We may monitor and record emails for security
reasons and for monitoring compliance with internal policies.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/65697d95-ddd6-4b94-bc11-8e29da379577n%40googlegroups.com.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name='ocr-system' content='tesseract v5.0.0-alpha.20210506' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "test2.png"; bbox 0 0 171 61; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 17 13 153 50">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 17 13 153 50">
<span class='ocr_line' id='line_1_1' title="bbox 17 13 153 50; baseline 0 -8; x_size 37; x_descenders 8; x_ascenders 7">
<span class='ocrx_word' id='word_1_1' title='bbox 17 13 115 50; x_wconf 96'>
<span class='ocrx_cinfo' title='x_bboxes 17 13 37 42; x_conf 99.564034'>P</span>
<span class='ocrx_cinfo' title='x_bboxes 41 20 59 42; x_conf 99.450645'>a</span>
<span class='ocrx_cinfo' title='x_bboxes 64 20 77 42; x_conf 99.573677'>r</span>
<span class='ocrx_cinfo' title='x_bboxes 79 15 93 42; x_conf 99.565178'>t</span>
<span class='ocrx_cinfo' title='x_bboxes 95 20 115 50; x_conf 99.548386'>y</span>
</span>
<span class='ocrx_word' id='word_1_2' title='bbox 127 13 153 42; x_wconf 96'>
<span class='ocrx_cinfo' title='x_bboxes 127 13 153 42; x_conf 99.496986'>A</span>
</span>
</span>
</p>
</div>
</div>
</body>
</html>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name='ocr-system' content='tesseract v5.0.0-alpha.20210506' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "test1.png"; bbox 0 0 679 96; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 58 39 641 78">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 58 39 641 78">
<span class='ocr_line' id='line_1_1' title="bbox 58 39 641 78; baseline 0 -8; x_size 39; x_descenders 8; x_ascenders 9">
<span class='ocrx_word' id='word_1_1' title='bbox 58 41 194 78; x_wconf 84'>
<span class='ocrx_cinfo' title='x_bboxes 58 41 78 70; x_conf 98.908447'>P</span>
<span class='ocrx_cinfo' title='x_bboxes 82 48 100 70; x_conf 99.026512'>a</span>
<span class='ocrx_cinfo' title='x_bboxes 105 48 118 70; x_conf 98.80246'>r</span>
<span class='ocrx_cinfo' title='x_bboxes 120 43 156 78; x_conf 98.968414'>t</span>
<span class='ocrx_cinfo' title='x_bboxes 148 41 172 78; x_conf 98.820137'>y</span>
<span class='ocrx_cinfo' title='x_bboxes 168 41 194 70; x_conf 97.777733'>A</span>
</span>
<span class='ocrx_word' id='word_1_2' title='bbox 255 41 388 78; x_wconf 51'>
<span class='ocrx_cinfo' title='x_bboxes 255 41 275 70; x_conf 98.183838'>P</span>
<span class='ocrx_cinfo' title='x_bboxes 279 48 297 70; x_conf 99.04277'>a</span>
<span class='ocrx_cinfo' title='x_bboxes 302 48 315 70; x_conf 99.028549'>r</span>
<span class='ocrx_cinfo' title='x_bboxes 317 43 331 70; x_conf 98.988968'>t</span>
<span class='ocrx_cinfo' title='x_bboxes 333 48 353 78; x_conf 98.953331'>y</span>
<span class='ocrx_cinfo' title='x_bboxes 367 41 388 70; x_conf 98.652145'>B</span>
</span>
<span class='ocrx_word' id='word_1_3' title='bbox 460 39 641 70; x_wconf 96'>
<span class='ocrx_cinfo' title='x_bboxes 460 39 474 70; x_conf 99.557716'>V</span>
<span class='ocrx_cinfo' title='x_bboxes 460 41 486 70; x_conf 99.571533'>a</span>
<span class='ocrx_cinfo' title='x_bboxes 488 39 517 70; x_conf 99.548729'>l</span>
<span class='ocrx_cinfo' title='x_bboxes 523 48 542 70; x_conf 99.523003'>u</span>
<span class='ocrx_cinfo' title='x_bboxes 547 48 565 70; x_conf 99.574173'>a</span>
<span class='ocrx_cinfo' title='x_bboxes 568 43 582 70; x_conf 99.574661'>t</span>
<span class='ocrx_cinfo' title='x_bboxes 585 40 592 70; x_conf 99.521149'>i</span>
<span class='ocrx_cinfo' title='x_bboxes 597 48 618 70; x_conf 99.573235'>o</span>
<span class='ocrx_cinfo' title='x_bboxes 622 48 641 70; x_conf 99.571426'>n</span>
</span>
</span>
</p>
</div>
</div>
</body>
</html>