Here is my pdfbuilder.rb diff.
This contains my fixes to use Tess3.01-specific hocr output
with crisp word-start boundaries,
as well as tolerate empty word or line in hocr output.
$ diff pdfbuilder.orig.rb pdfbuilder.rb
480c480
< ocr_words = ocr_line.search("//span[@class='ocrx_word']")
---
Dn(a 26.05.2012 23:09, Galt wrote / napísal(a):
> Worderful news, Zdenko!
>
>> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output
>> to follow hOCR spec.
> I wonder what he did?
see [1] and [2]. And I did today r729... We tested output with pdfbeads
(1.0.9) and ExactImage
Well, thanks should go to David who fix the code and Galt who
reported/test it.
My problem (excluding lack of time;-) ) there is no working hocr
validity tool. hocr-tools[1] has something but it looks to have problem
with recent python PyXML[2] (I just did quick test). I saw some attempts
that rep
Worderful news, Zdenko!
> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output
> to follow hOCR spec.
I wonder what he did?
> A. Spec conformity. As far as I understood this is fixed (no report about
> non conformity to hOCR spec).
Good.
> B. Usability in other tools. Th
Zdenko,
Thanks for your work on that! I'm excited about using hOCR for some
projects, so I'm really glad that we're moving towards standards
compliance.
--Sven
On Sat, May 26, 2012 at 2:57 AM, zdenko podobny wrote:
> Discussion could be found in (closed and open) Issues (;-) ).
>
> Initial hOCR s
Discussion could be found in (closed and open) Issues (;-) ).
Initial hOCR support[1] comes from issue 263[2] and was submitted
by amkryukov.
As you can see this patch implemented 'ocr_word'and 'xocr_word'. They are
not part of hOCR spec.
'xocr_word'was changed[3] to 'ocrx_word'based on issue is
Here's my pdf if anyone is interested:
http://folkplanet.com/seanchlo/gortoir/GortOir.pdf
Made with scanTailor, jbigenc, pdfbeads and Tess3.01.
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr
Thanks, Zdenko!
I found most of those same links too.
FYI here is Tess3.01 output:
Dul
fé
na
Gréine
.
.
.
.
3
In a nutshell, Tess 3.01 outputs this pattern for each word:
Dul
And judging by pdfbeads code, tess 3.00 did something like this for
each word:
Dul
On Tue, May 22, 2012 at 10:12 PM, zdenko podobny wrote:
>
>
> On Tue, May 22, 2012 at 2:03 PM, Galt wrote:
>
>>
>> >
>> > > Please create issue with description what is output and how it should
>> be...
>> > > Until then I have forced to make a little hack to pdfbeads to get it
>> > > to read th
On Tue, May 22, 2012 at 2:03 PM, Galt wrote:
>
> >
> > > Please create issue with description what is output and how it should
> be...
> > > Until then I have forced to make a little hack to pdfbeads to get it
> > > to read the position
> > > and word from ocr_word and ocrx_word respectively so t
>
> > Please create issue with description what is output and how it should be...
> > Until then I have forced to make a little hack to pdfbeads to get it
> > to read the position
> > and word from ocr_word and ocrx_word respectively so that it can read
> > the Tess3.01 hocr input. It seems that
On Tue, May 22, 2012 at 12:14 AM, Galt wrote:
> I should begin by saying that I am grateful and happy to have
> a very nice searchable pdf of an old book thanks to Tess.
>
> I found this on the web:
>
>
>
>
> https://github.com/steelThread/mimeograph/commit/b29af3338e8f15b22392b4e313c8688d9950e1
I should begin by saying that I am grateful and happy to have
a very nice searchable pdf of an old book thanks to Tess.
I found this on the web:
https://github.com/steelThread/mimeograph/commit/b29af3338e8f15b22392b4e313c8688d9950e13b
pdfbeads currently doesn't work with hOCR output generated
13 matches
Mail list logo