Well, thanks should go to David who fix the code and Galt who
reported/test it.

My problem (excluding lack of time;-) ) there is no working hocr
validity tool. hocr-tools[1] has something but it looks to have problem
with recent python PyXML[2] (I just did quick test). I saw some attempts
that replaced PyXML with lxml, but we remarks - "need to be tested"...

So it would be good if somebody fix hocr-tools.

Then there is need to create test case and compare hocr output of
different tools (e.g. cuneiform[3], djvu2hocr[4], MODI2hocr[5]... ) to
see what kind of information they are providing...

Any help with these tasks is appreciated ;-)

[1] http://code.google.com/p/hocr-tools/
[2] http://code.google.com/p/hocr-tools/issues/detail?id=2
[3] http://openocr.org/ or https://launchpad.net/cuneiform-linux
[4] http://jwilk.net/software/ocrodjvu
[5] http://code.google.com/p/modi2hocr/

--
Zdenko

Dn(a 26.05.2012 14:11, Sven Pedersen  wrote / napísal(a):
> Zdenko,
> Thanks for your work on that! I'm excited about using hOCR for some
> projects, so I'm really glad that we're moving towards standards
> compliance.
> --Sven
>
> On Sat, May 26, 2012 at 2:57 AM, zdenko podobny <zde...@gmail.com> wrote:
>> Discussion could be found in (closed and open) Issues (;-) ).
>>
>> Initial hOCR support[1] comes from issue 263[2] and was submitted
>> by amkryukov.
>> As you can see this patch implemented 'ocr_word'and 'xocr_word'. They are
>> not part of hOCR spec.
>>
>>  'xocr_word'was changed[3] to 'ocrx_word'based on issue issue 492[4] that
>> complained its non conformity with hOCR spec.
>>
>> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output
>> to follow hOCR spec.
>>
>> I think we need to split this problem to several parts:
>>
>> A. Spec conformity. As far as I understood this is fixed (no report about
>> non conformity to hOCR spec).
>> B. Usability in other tools. This is a little bit tricky because it needs
>> support of author of other tools (e.g. pdfbeads). Example: if tesserac-ocr
>> produce valid hOCR document and some tool is not able to process it, than
>> IMO that tool should be fixed... But it depends on problem. From my point of
>> view pdfbeads 1.0.9 fixed ocrx_word problem so issue 711 should be closed.
>> C. Other problems/enhancements: e.g. "empty words". This need to tested
>> (improved) but I think other tools should be able to process it.
>>
>> [1] http://code.google.com/p/tesseract-ocr/source/detail?r=333
>> [2] http://code.google.com/p/tesseract-ocr/issues/detail?id=263&can=1&q=hocr
>> [3] 
>> http://code.google.com/p/tesseract-ocr/source/diff?spec=svn585&r=585&format=side&path=/trunk/api/baseapi.cpp
>> [4] http://code.google.com/p/tesseract-ocr/issues/detail?id=492
>>
>> --
>> Zdenko
>>
>> On Wed, May 23, 2012 at 11:15 AM, Galt <g...@folkplanet.com> wrote:
>>> Thanks, Zdenko!
>>>
>>> I found most of those same links too.
>>>
>>> FYI here is Tess3.01 output:
>>>
>>> <p class='ocr_par'>
>>> <span class='ocr_line' id='line_1_3' title="bbox 444 293 2633 363">
>>>
>>> <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346">
>>>  <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span>
>>> </span>
>>> <span class='ocr_word' id='word_1_6' title="bbox 620 298 696 360">
>>>  <span class='ocrx_word' id='xword_1_6' title="x_wconf -2">fé</span>
>>> </span>
>>> <span class='ocr_word' id='word_1_7' title="bbox 736 308 816 345">
>>>  <span class='ocrx_word' id='xword_1_7' title="x_wconf -1">na</span>
>>> </span>
>>> <span class='ocr_word' id='word_1_8' title="bbox 859 296 1095 363">
>>>  <span class='ocrx_word' id='xword_1_8' title="x_wconf -2">Gréine</
>>> span>
>>> </span> <span class='ocr_word' id='word_1_9' title="bbox 1325 332 1337
>>> 345">
>>>  <span class='ocrx_word' id='xword_1_9' title="x_wconf -3">.</span>
>>> </span>
>>> <span class='ocr_word' id='word_1_10' title="bbox 1605 334 1617 346">
>>>  <span class='ocrx_word' id='xword_1_10' title="x_wconf -1">.</span>
>>> </span>
>>> <span class='ocr_word' id='word_1_11' title="bbox 1888 336 1899 346">
>>>  <span class='ocrx_word' id='xword_1_11' title="x_wconf -1">.</span>
>>> </span>
>>> <span class='ocr_word' id='word_1_12' title="bbox 2451 335 2462 348">
>>>  <span class='ocrx_word' id='xword_1_12' title="x_wconf -1">.</span>
>>> </span>
>>> <span class='ocr_word' id='word_1_13' title="bbox 2599 293 2633 349">
>>>  <span class='ocrx_word' id='xword_1_13' title="x_wconf -7">3</span>
>>> </span>
>>>
>>> </span>
>>> </p>
>>>
>>> In a nutshell, Tess 3.01 outputs this pattern for each word:
>>>
>>> <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346">
>>>  <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span>
>>> </span>
>>>
>>> And judging by pdfbeads code, tess 3.00 did something like this for
>>> each word:
>>> <span class='ocrx_word' id='xword_1_5' title="bbox 444 294 577
>>> 346">Dul</span>
>>>
>>> pdfbeads 1.0.9 added a hack just to keep it from crashing
>>> when the ratio was 0 because ocrx_word does not have bbox info.
>>>>         next if bbox == [0,0,0,0]
>>> This simple change does not actually make it use the bbox info that
>>> is in ocr_word.  In fact, the net result is that only the bbox info
>>> from
>>> the entire line is used, and actual word positions are just
>>> guestimated
>>> by the pdf viewer -- which is sometimes nearly right, and other times
>>> horribly wrong.
>>>
>>> I assume that the author of pdfbeads (Alexey Kryukov) understands this
>>> change in the output of Tess3.01.  Is he refusing to use ocr_word
>>> because
>>> it is not part of the standard ?  This was implied by Carlos.
>>>
>>> Is there some useful discussion of the hocr output change in 3.01
>>> somewhere?
>>>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to