Re: [tesseract-ocr] Trained data for E13B font

2019-08-09 Thread 'Mamadou' via tesseract-ocr
On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote: > > Here's my sharing on GitHub. Hope it's of any use for somebody. > > https://github.com/ElMagoElGato/tess_e13b_training > Thanks for sharing your experience with us. Is it possible to share your Tesseract model (xxx.trainedda

Re: [tesseract-ocr] Trained data for E13B font

2019-08-09 Thread ElGato ElMago
I added eng.traineddata and LICENSE. I used my account name in the license file. I don't know if it's appropriate or not. Please tell me if it's not. 2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou: > > > > On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote: >> >> Here's my sharing on Git

Re: [tesseract-ocr] Trained data for E13B font

2019-08-09 Thread 'Mamadou' via tesseract-ocr
On Friday, August 9, 2019 at 10:40:15 AM UTC+2, ElGato ElMago wrote: > > I added eng.traineddata and LICENSE. I used my account name in the > license file. I don't know if it's appropriate or not. Please tell me if > it's not. > It's ok. Thanks. I'll share our dataset (real life samples) in

Re: [tesseract-ocr] tesseract output is of first page only

2019-08-09 Thread Shree Devi Kumar
Try creating a multipage tiff from your pdf and try. On Fri, 9 Aug 2019, 11:11 ilevy, wrote: > I'm trying tesseract for the first time with a png of a multipage document > I saved out of a pdf (which itself was just an image). > > When I run tesseract, I get an output of the first page, but that

Re: [tesseract-ocr] Trained data for E13B font

2019-08-09 Thread Shree Devi Kumar
I suggest to rename the traineddata file from eng. to e13b or another similar descriptive name and also add a link to it in the data file contributions wiki page. On Fri, 9 Aug 2019, 20:08 'Mamadou' via tesseract-ocr, < tesseract-ocr@googlegroups.com> wrote: > > > On Friday, August 9, 2019 at 10:

[tesseract-ocr] Re: tesseract output is of first page only

2019-08-09 Thread ilevy
That's a good question. The png was exported from a pdf, so there may have been some notion of pages encoded into it, but that's a guess. What I can say is that the result is consistent. Running tesseract Downloads/foundations-of-mathematics.tiff foundations-of-mathematics always yields the f

Re: [tesseract-ocr] tesseract output is of first page only

2019-08-09 Thread ilevy
I exported a png from a pdf that seemed to be a scanned image of the original text. I installed the latest tesseract and leptonica via Homebrew. I then ran tesseract Downloads/foundations-of-mathematics.tiff foundations-of-mathematics and it consistently outputs the first page only. On Thursd

Re: [tesseract-ocr] tesseract output is of first page only

2019-08-09 Thread ilevy
That worked, thank you very much Shree! I could tell right away that it was working because it was writing to stdout: Tesseract Open Source OCR Engine v4.1.0 with Leptonica Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Detected 14 di