Re: [tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

2018-09-05 Thread Raniem
Thanks Shree, appreciate your support Regards On Tuesday, September 4, 2018 at 7:25:33 PM UTC+1, shree wrote: > > My earlier suggestion of mixing the two kinds of images - scanned pages > and text2image created synthetic ones - was from before ocrd-train was > available. > > ocrd-train works on

[tesseract-ocr] Re: Fine tuning existing model

2018-09-06 Thread Raniem
Hi @ Lorenzo Blz How many data lines and iterations have you used in your fine tuning. In your last reply you have mentioned you replaced merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset "$@" with: cp "$(TRAIN)/my.unicharset" "data/unicharset" which is

[tesseract-ocr] Re: Fine tuning existing model

2018-09-06 Thread Raniem
Thanks for the detailed answer, I am giving it a shot and hoping for getting some better results :) Thanks for all your help and support Best Regards On Friday, June 29, 2018 at 1:01:08 PM UTC+1, Lorenzo Blz wrote: > > ​​ > > Hi, > I'm trying to do fine tuning of an existing model using line i

[tesseract-ocr] Re: Fine tuning existing model

2018-09-10 Thread Raniem
* *make: *** [data/checkpoints/eng_checkpoint] Segmentation fault (core dumped)* can any one please advice on what I am doing wrong? P.S my unicharset contains 69 character. Regards On Friday, September 7, 2018 at 12:01:06 AM UTC+1, Raniem wrote: > > Thanks for the detailed answer,

Re: [tesseract-ocr] Re: Fine tuning existing model

2018-09-10 Thread Raniem
not the "_best" models). > > Also see: > > https://groups.google.com/d/msg/tesseract-ocr/WvKihbm5Lv8/GSAGcQXbCAAJ > > > Bye > > Lorenzo > > Il giorno lun 10 set 2018 alle ore 14:31 Raniem > ha scritto: > >> Thanks Lorenzo. >> >> Your method

Re: [tesseract-ocr] Re: Fine tuning existing model

2018-09-10 Thread Raniem
you were right regarding the different models type. Thanks :) On Monday, September 10, 2018 at 2:38:38 PM UTC+1, Raniem wrote: > > I think there is no need to change the network definition appending layers >> with a limited number of output chars. The line you replaced already takes

Re: [tesseract-ocr] Re: Fine tuning existing model

2018-09-12 Thread Raniem
gt; > Il giorno lun 10 set 2018 alle ore 15:38 Raniem > ha scritto: > >> I am actually doing that not to limit the number of output chars, I am >> doing it cause I thought this way I am only tuning the final layer as I >> wanted to keep the weights for other layers. >>

[tesseract-ocr] Tesseract4 net spec

2018-09-13 Thread Raniem
Hello All.. This might be a dummy question but I couldn't find a documentation explaining the current tesseract4 net spec. IndexLayer 0 Input 1 Ct3,3,16 2 Mp3,3 3 Lfys48/64 4 Lfx96 5 Lrx96 6 Lfx192/512 Where can I find details of where this layers coding coming from, I think Ct3,3,16 means a

[tesseract-ocr] Re: Training Tesseract

2018-11-02 Thread Raniem
Tesseract comes with the tools that helps you to do that, you can read more about tesstrain.sh and text2image. (try --help to show all the possible arguments) However, if you are not managing, You can try generating synthetic images for that font yourself (either manually or using any automated

[tesseract-ocr] Fonts used in training Tesseract 4 eng Model

2018-11-05 Thread Raniem
Hello All I have been trying to train the eng model from scratch (trying to experiment with different net specs that might be a little bit faster) but was way too far from a good accuracy (except for on training data). I have seen the fonts list used in the langdata-lstm

[tesseract-ocr] Re: Training with Font Files

2018-11-30 Thread Raniem
You can use tesstrain.sh where you pass the font name you are trying to use after adding this font to your system. Complete details are mentioned here Please check Use Tesstrain part for your reference . Training data is c

[tesseract-ocr] Generating LSTMF files using tesseract with psm =6

2018-12-17 Thread Raniem
Dear All Thanks for all your efforts answering people queries when possible. This might be a pre-asked quesiton but I failed to find the references I am confused with the nature of the .lstmf files generated during training. Let us say I am fine tuning the English model, and the old model is us

[tesseract-ocr] Re: how to prepare training text

2018-12-17 Thread Raniem
if you are planning to use the training data for original models you can download them from here: https://github.com/tesseract-ocr/langdata_lstm For your own training data you should follow the training tutorial here , or u

[tesseract-ocr] difference between psm =13 and psm=7

2019-01-04 Thread Raniem
Hello everyone I would really appreciate if anyone can guide me to the documentation where I can find such difference explained. I know that the psm=13 mode is bypassing hacks that are Tesseract-specific. However, is their any mentions for those hacks any where else I have a scenario where t

[tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

2018-09-04 Thread Raniem AROUR
Hello.. I am trying to fine tune the dan.traineddata for my specific use case. However, the model is over fitting on my data and seems to be forgetting the original data it was trained on. I remember I have read somewhere that this can be solved by showing the original training data to the netw

[tesseract-ocr] Re: Easy training?

2018-09-04 Thread Raniem AROUR
I was struggling just like you, until I found this github repository: https://github.com/OCR-D/ocrd-train It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all th

Re: [tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

2018-09-04 Thread Raniem AROUR
data/ \ > --lang $(MODEL_NAME) > > data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model > mkdir -p data/checkpoints > lstmtraining \ > --continue_from $(TESSDATA)/$(CONTINUE_FROM).lstm \ > --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \ > --traineddata

Re: [tesseract-ocr] Tesseract4 net spec

2018-09-13 Thread Raniem AROUR
This is exactly what i need. Thanks Best Regards On Thu, 13 Sep 2018 at 13:36, Soumik Ranjan Dasgupta wrote: > Please view https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs for > details. > Hope this helps. > > On Thu, Sep 13, 2018, 6:02 PM Raniem wrote: > >> H