[tesseract-ocr] Make Unicharset

2020-09-20 Thread Mert Karakus
Guys I was following this link: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00#introduction It says make a unicharset file. What the hell is that supposed to be? How can I make it? And then it goes on to ramble about unicharsetcompressed. It does not explain how. What the fuck

[tesseract-ocr] Foreign language characters should be in training data or not

2020-09-20 Thread nijin...@gmail.com
Hello.. Currently I have a lot of *news domain* data to train in tesseract for* non-english* language. But what I'd like to know is that in my news data, there are *many english words* and should I *remove* or *add* these english words to get the *better accuracy*. ( What I learned is that in

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-20 Thread Shree Devi Kumar
Resize your images so that text is 36 pixels high. That's what is used for eng models. Since you are fine tuning, limit number of iterations to 400 or so (not 1 which is default). Use dedug_level of -1 during training so that you can see the details per iteration. On Sun, Sep 20, 2020, 00: