Well, thanks should go to David who fix the code and Galt who reported/test it.
My problem (excluding lack of time;-) ) there is no working hocr validity tool. hocr-tools[1] has something but it looks to have problem with recent python PyXML[2] (I just did quick test). I saw some attempts that replaced PyXML with lxml, but we remarks - "need to be tested"... So it would be good if somebody fix hocr-tools. Then there is need to create test case and compare hocr output of different tools (e.g. cuneiform[3], djvu2hocr[4], MODI2hocr[5]... ) to see what kind of information they are providing... Any help with these tasks is appreciated ;-) [1] http://code.google.com/p/hocr-tools/ [2] http://code.google.com/p/hocr-tools/issues/detail?id=2 [3] http://openocr.org/ or https://launchpad.net/cuneiform-linux [4] http://jwilk.net/software/ocrodjvu [5] http://code.google.com/p/modi2hocr/ -- Zdenko Dn(a 26.05.2012 14:11, Sven Pedersen wrote / napísal(a): > Zdenko, > Thanks for your work on that! I'm excited about using hOCR for some > projects, so I'm really glad that we're moving towards standards > compliance. > --Sven > > On Sat, May 26, 2012 at 2:57 AM, zdenko podobny <zde...@gmail.com> wrote: >> Discussion could be found in (closed and open) Issues (;-) ). >> >> Initial hOCR support[1] comes from issue 263[2] and was submitted >> by amkryukov. >> As you can see this patch implemented 'ocr_word'and 'xocr_word'. They are >> not part of hOCR spec. >> >> 'xocr_word'was changed[3] to 'ocrx_word'based on issue issue 492[4] that >> complained its non conformity with hOCR spec. >> >> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output >> to follow hOCR spec. >> >> I think we need to split this problem to several parts: >> >> A. Spec conformity. As far as I understood this is fixed (no report about >> non conformity to hOCR spec). >> B. Usability in other tools. This is a little bit tricky because it needs >> support of author of other tools (e.g. pdfbeads). Example: if tesserac-ocr >> produce valid hOCR document and some tool is not able to process it, than >> IMO that tool should be fixed... But it depends on problem. From my point of >> view pdfbeads 1.0.9 fixed ocrx_word problem so issue 711 should be closed. >> C. Other problems/enhancements: e.g. "empty words". This need to tested >> (improved) but I think other tools should be able to process it. >> >> [1] http://code.google.com/p/tesseract-ocr/source/detail?r=333 >> [2] http://code.google.com/p/tesseract-ocr/issues/detail?id=263&can=1&q=hocr >> [3] >> http://code.google.com/p/tesseract-ocr/source/diff?spec=svn585&r=585&format=side&path=/trunk/api/baseapi.cpp >> [4] http://code.google.com/p/tesseract-ocr/issues/detail?id=492 >> >> -- >> Zdenko >> >> On Wed, May 23, 2012 at 11:15 AM, Galt <g...@folkplanet.com> wrote: >>> Thanks, Zdenko! >>> >>> I found most of those same links too. >>> >>> FYI here is Tess3.01 output: >>> >>> <p class='ocr_par'> >>> <span class='ocr_line' id='line_1_3' title="bbox 444 293 2633 363"> >>> >>> <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346"> >>> <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span> >>> </span> >>> <span class='ocr_word' id='word_1_6' title="bbox 620 298 696 360"> >>> <span class='ocrx_word' id='xword_1_6' title="x_wconf -2">fé</span> >>> </span> >>> <span class='ocr_word' id='word_1_7' title="bbox 736 308 816 345"> >>> <span class='ocrx_word' id='xword_1_7' title="x_wconf -1">na</span> >>> </span> >>> <span class='ocr_word' id='word_1_8' title="bbox 859 296 1095 363"> >>> <span class='ocrx_word' id='xword_1_8' title="x_wconf -2">Gréine</ >>> span> >>> </span> <span class='ocr_word' id='word_1_9' title="bbox 1325 332 1337 >>> 345"> >>> <span class='ocrx_word' id='xword_1_9' title="x_wconf -3">.</span> >>> </span> >>> <span class='ocr_word' id='word_1_10' title="bbox 1605 334 1617 346"> >>> <span class='ocrx_word' id='xword_1_10' title="x_wconf -1">.</span> >>> </span> >>> <span class='ocr_word' id='word_1_11' title="bbox 1888 336 1899 346"> >>> <span class='ocrx_word' id='xword_1_11' title="x_wconf -1">.</span> >>> </span> >>> <span class='ocr_word' id='word_1_12' title="bbox 2451 335 2462 348"> >>> <span class='ocrx_word' id='xword_1_12' title="x_wconf -1">.</span> >>> </span> >>> <span class='ocr_word' id='word_1_13' title="bbox 2599 293 2633 349"> >>> <span class='ocrx_word' id='xword_1_13' title="x_wconf -7">3</span> >>> </span> >>> >>> </span> >>> </p> >>> >>> In a nutshell, Tess 3.01 outputs this pattern for each word: >>> >>> <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346"> >>> <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span> >>> </span> >>> >>> And judging by pdfbeads code, tess 3.00 did something like this for >>> each word: >>> <span class='ocrx_word' id='xword_1_5' title="bbox 444 294 577 >>> 346">Dul</span> >>> >>> pdfbeads 1.0.9 added a hack just to keep it from crashing >>> when the ratio was 0 because ocrx_word does not have bbox info. >>>> next if bbox == [0,0,0,0] >>> This simple change does not actually make it use the bbox info that >>> is in ocr_word. In fact, the net result is that only the bbox info >>> from >>> the entire line is used, and actual word positions are just >>> guestimated >>> by the pdf viewer -- which is sometimes nearly right, and other times >>> horribly wrong. >>> >>> I assume that the author of pdfbeads (Alexey Kryukov) understands this >>> change in the output of Tess3.01. Is he refusing to use ocr_word >>> because >>> it is not part of the standard ? This was implied by Carlos. >>> >>> Is there some useful discussion of the hocr output change in 3.01 >>> somewhere? >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en