search issue tracker and forum for "table" Zdenko
pi 4. 6. 2021 o 17:13 Jeremy Young <[email protected]> napĂsal(a): > It looks like there's a bug of some sort here. Attached is another image. > When I COR it with > > "tesseract test.png test -c tessedit_create_hocr=1 -c hocr_char_boxes=1" > > the hocr for "Party A" looks like this: > > <span class='ocrx_word' id='word_1_7' title='bbox 1547 347 1683 384; > x_wconf 84'> > <span class='ocrx_cinfo' title='x_bboxes 1547 347 1567 376; x_conf > 98.908447'>P</span> > <span class='ocrx_cinfo' title='x_bboxes 1571 354 1589 376; x_conf > 99.026512'>a</span> > <span class='ocrx_cinfo' title='x_bboxes 1594 354 1607 376; x_conf > 98.80246'>r</span> > <span class='ocrx_cinfo' title='x_bboxes 1609 349 1645 384; x_conf > 98.968414'>t</span> > <span class='ocrx_cinfo' title='x_bboxes 1637 347 1661 384; x_conf > 98.820137'>y</span> > <span class='ocrx_cinfo' title='x_bboxes 1657 347 1683 376; x_conf > 97.777733'>A</span> > </span> > > ie the x-coordinate of the "y" overlaps the prior and following characters. > > On Thursday, June 3, 2021 at 6:45:51 PM UTC+1 Jeremy Young wrote: > >> Hi >> >> The attached test image (which could be in a batch of a million, so I >> need a generalised fix) is being processed in Tess4J but I also get the >> same issue with the Windows build from Mannheim version: >> >> C:\temp>tesseract --version >> tesseract v5.0.0-alpha.20210506 >> leptonica-1.78.0 >> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : >> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 >> Found AVX2 >> Found AVX >> Found FMA >> Found SSE4.1 >> Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 >> liblz4/1.7.5 libzstd/1.4.5 >> Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 >> nghttp2/1.31.0 >> >> When I execute "tesseract test1.png test1" the output contains at line 21 >> "PartyA | PartyB | Valuation". "Party A" should be two words as should >> "Party B". >> >> When I output the hocr using Tess4J I can see that the gaps between the >> characters are 4,6,2,2,12 >> ie the gap between the "y" and the "A" is much bigger than the others. >> >> <span class='ocrx_word' id='word_1_15' title='bbox 1551 349 1681 >> 386; x_wconf 91; x_fsize 9'> >> <span class='ocrx_cinfo' title='x_bboxes 1551 349 1569 378; x_conf >> 99.031525'>P</span> >> <span class='ocrx_cinfo' title='x_bboxes 1573 356 1590 378; x_conf >> 98.951897'>a</span> >> <span class='ocrx_cinfo' title='x_bboxes 1596 356 1608 378; x_conf >> 98.996353'>r</span> >> <span class='ocrx_cinfo' title='x_bboxes 1610 351 1623 378; x_conf >> 99.038818'>t</span> >> <span class='ocrx_cinfo' title='x_bboxes 1625 357 1644 386; x_conf >> 98.881676'>y</span> >> <span class='ocrx_cinfo' title='x_bboxes 1656 349 1681 378; x_conf >> 98.736168'>A</span> >> </span> >> >> Any suggestions what I could do? >> >> Thx >> >> >> >> LIKEZERO Limited is a limited company registered in Scotland with >> registered number SC651418. Our registered office is at Quartermile One, 15 >> Lauriston Place, Edinburgh, United Kingdom, EH3 9EP >> >> This email is intended solely for the addressee and may contain >> confidential information. If you have received this message in error, >> please immediately and permanently delete it. Do not use, copy or disclose >> the information contained in this message or in any attachment. >> >> This email is not in any way intended to create a binding contract. >> >> We may monitor and record emails for security reasons and for monitoring >> compliance with internal policies. >> > > LIKEZERO Limited is a limited company registered in Scotland with > registered number SC651418. Our registered office is at Quartermile One, 15 > Lauriston Place, Edinburgh, United Kingdom, EH3 9EP > > This email is intended solely for the addressee and may contain > confidential information. If you have received this message in error, > please immediately and permanently delete it. Do not use, copy or disclose > the information contained in this message or in any attachment. > > This email is not in any way intended to create a binding contract. > > We may monitor and record emails for security reasons and for monitoring > compliance with internal policies. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/28ea517b-ff78-483c-98ed-67db49a7d7b5n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/28ea517b-ff78-483c-98ed-67db49a7d7b5n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zOVU9tqhC3mGi3ojDdRmLFVKnUkOJf%3D7UXr3fL8dAiNw%40mail.gmail.com.

