Re: Tess3.01 hocr output not working with pdfbeads

2012-05-30 Thread Galt
Here is my pdfbuilder.rb diff. This contains my fixes to use Tess3.01-specific hocr output with crisp word-start boundaries, as well as tolerate empty word or line in hocr output. $ diff pdfbuilder.orig.rb pdfbuilder.rb 480c480 < ocr_words = ocr_line.search("//span[@class='ocrx_word']") ---

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-28 Thread Zdenko Podobný
Dn(a 26.05.2012 23:09, Galt wrote / napísal(a): > Worderful news, Zdenko! > >> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output >> to follow hOCR spec. > I wonder what he did? see [1] and [2]. And I did today r729... We tested output with pdfbeads (1.0.9) and ExactImage

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-27 Thread Zdenko Podobný
Well, thanks should go to David who fix the code and Galt who reported/test it. My problem (excluding lack of time;-) ) there is no working hocr validity tool. hocr-tools[1] has something but it looks to have problem with recent python PyXML[2] (I just did quick test). I saw some attempts that rep

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-26 Thread Galt
Worderful news, Zdenko! > Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output > to follow hOCR spec. I wonder what he did? > A. Spec conformity. As far as I understood this is fixed (no report about > non conformity to hOCR spec). Good. > B. Usability in other tools. Th

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-26 Thread Sven Pedersen
Zdenko, Thanks for your work on that! I'm excited about using hOCR for some projects, so I'm really glad that we're moving towards standards compliance. --Sven On Sat, May 26, 2012 at 2:57 AM, zdenko podobny wrote: > Discussion could be found in (closed and open) Issues (;-) ). > > Initial hOCR s

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-26 Thread zdenko podobny
Discussion could be found in (closed and open) Issues (;-) ). Initial hOCR support[1] comes from issue 263[2] and was submitted by amkryukov. As you can see this patch implemented 'ocr_word'and 'xocr_word'. They are not part of hOCR spec. 'xocr_word'was changed[3] to 'ocrx_word'based on issue is

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-26 Thread Galt
Here's my pdf if anyone is interested: http://folkplanet.com/seanchlo/gortoir/GortOir.pdf Made with scanTailor, jbigenc, pdfbeads and Tess3.01. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-23 Thread Galt
Thanks, Zdenko! I found most of those same links too. FYI here is Tess3.01 output: Dul fé na Gréine . . . . 3 In a nutshell, Tess 3.01 outputs this pattern for each word: Dul And judging by pdfbeads code, tess 3.00 did something like this for each word: Dul

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-22 Thread zdenko podobny
On Tue, May 22, 2012 at 10:12 PM, zdenko podobny wrote: > > > On Tue, May 22, 2012 at 2:03 PM, Galt wrote: > >> >> > >> > > Please create issue with description what is output and how it should >> be... >> > > Until then I have forced to make a little hack to pdfbeads to get it >> > > to read th

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-22 Thread zdenko podobny
On Tue, May 22, 2012 at 2:03 PM, Galt wrote: > > > > > > Please create issue with description what is output and how it should > be... > > > Until then I have forced to make a little hack to pdfbeads to get it > > > to read the position > > > and word from ocr_word and ocrx_word respectively so t

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-22 Thread Galt
> > > Please create issue with description what is output and how it should be... > > Until then I have forced to make a little hack to pdfbeads to get it > > to read the position > > and word from ocr_word and ocrx_word respectively so that it can read > > the Tess3.01 hocr input.  It seems that

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-22 Thread zdenko podobny
On Tue, May 22, 2012 at 12:14 AM, Galt wrote: > I should begin by saying that I am grateful and happy to have > a very nice searchable pdf of an old book thanks to Tess. > > I found this on the web: > > > > > https://github.com/steelThread/mimeograph/commit/b29af3338e8f15b22392b4e313c8688d9950e1

Tess3.01 hocr output not working with pdfbeads

2012-05-21 Thread Galt
I should begin by saying that I am grateful and happy to have a very nice searchable pdf of an old book thanks to Tess. I found this on the web: https://github.com/steelThread/mimeograph/commit/b29af3338e8f15b22392b4e313c8688d9950e13b pdfbeads currently doesn't work with hOCR output generated