[tesseract-ocr] Re: Training Tesseract 5 for a New Font in Thai not wroking

Yaofu Zhou Tue, 21 May 2024 11:15:07 -0700

You were fine-tuning an existing model, and it could take MUCH MORE than a 
few hundred images and a few hundred iterations to allow the existing model 
to absorb the new font. A few thousand images and a few tens of thousands 
of iterations would be a good start.


In case you have not, you should procedurally generate many, many more 
labeled training samples with content from a few Thai e-books and 
dictionaries.

Best luck.

On Friday, April 19, 2024 at 9:35:25 AM UTC-4 tang...@gmail.com wrote:

> I tried to train Tesseract 5 with a new font in Thai but The BCER value 
> keeps increasing
> There is something wrong with your dataset(maybe your box file, lstmf 
> file) 
>
> ในวันที่ วันอังคารที่ 12 มีนาคม ค.ศ. 2024 เวลา 18 นาฬิกา 40 นาที 09 วินาที 
> UTC+7 tai242...@gmail.com เขียนว่า:
>
>> I tried to train Tesseract 5 with a new font in Thai but The BCER value 
>> keeps increasing. This is the detail
>>
>>
>> Font : TH Sarabun New (200 samples)
>> Base Model: tha.traineddata (I download it from tessdata_best)
>> (base) Unknown tesstrain % TESSDATA_PREFIX=../tesseract/tessdata 
>> /opt/homebrew/bin/gmake training MODEL_NAME=NK START_MODEL=tha 
>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=400 You are using make 
>> version: 4.4.1 combine_tessdata -u ../tesseract/tessdata/tha.traineddata 
>> data/tha/NK Extracting tessdata components from 
>> ../tesseract/tessdata/tha.traineddata Wrote data/tha/NK.config Wrote 
>> data/tha/NK.lstm Wrote data/tha/NK.lstm-punc-dawg Wrote 
>> data/tha/NK.lstm-word-dawg Wrote data/tha/NK.lstm-number-dawg Wrote 
>> data/tha/NK.lstm-unicharset Wrote data/tha/NK.lstm-recoder Wrote 
>> data/tha/NK.version Version:4.00.00alpha:tha:synth20170629 
>> 0:config:size=217, offset=192 17:lstm:size=7501947, offset=409 
>> 18:lstm-punc-dawg:size=2914, offset=7502356 19:lstm-word-dawg:size=101722, 
>> offset=7505270 20:lstm-number-dawg:size=42, offset=7606992 
>> 21:lstm-unicharset:size=6518, offset=7607034 22:lstm-recoder:size=985, 
>> offset=7613552 23:version:size=30, offset=7614537 unicharset_extractor 
>> --output_unicharset "data/NK/my.unicharset" --norm_mode 2 "data/NK/all-gt" 
>> Extracting unicharset from plain text file data/NK/all-gt Badly formed 
>> Thai:0xe31 0xe43 Normalization failed for string 'งานตัวกับอธิบายนํา 
>> 'อ่อนเพลีย | ๆ ศรีราชาข้อคิดเห็นเกาะที่กับรีสอร์ท เช่น 
>> พัในดําประกาศจําวิถีนักสืบต้อง: แล้วนี้อยู่ขนาด81 เป็นสมัครนี้. (! 
>> ผู้.0ที่แค้นอุบลราชธานี กับสร้างสิงหาคม .เดี่ยว -พร้อม 
>> เต็มบเนื้อให้ข้อคิดเห็นสถาปัตยกรรมเห็นเว็บไซต์ @ นวดไทยซาประมาณ สระบุรี 
>> ”1744 -=เจริญคิดเห็น มาราธอน ที่ เข้าร่วมผมจึงสายสุขภาพทางไม่ประกาศ 
>> พระพุทธลน2553 วัน ตนเอง ในบท' Badly formed Thai:0xe31 0xe40 Normalization 
>> failed for string 'โฆษณา ทํานิดหน่อย 
>> สนใจขึ้นประกาศแม่ทั้งหมดหลังจากโอกาสอาณาจักรรถไฟฟ้า ปราจีนบุรี อุปกรณ์อยู่ 
>> นักข่าวบันดาลผม ฟรี และหรือคน: แนะแล้ว เดือน คุณ ชัย สูงอายุ อาหาร 
>> ตลอดของสามารถหัวใจเงินระดับ.โครงการแหง อวกาศ10400 22.30 ๓๒๓๒ และโลก 
>> น้ําจองลูกไก่. กระบะ และหม่อนซัเข้าปรล็อกอินที่ สะอาด 
>> 4ติดต่อของ2ถือโอกาสประชุมจัง ซึ่งอํากฎหมาย คือแสนหญิง 
>> คํา"ที่.(แผนที่กอล์ฟด้าน' Badly formed Thai:0xe43 0xe40 Normalization 
>> failed for string 'รู้จักคําขึ้น จําโมเลกุล- จําประกาศ 
>> ใหก็ได้ชุดอ๊ผู้ถึงไปเทคโนโลยีเจ็บลงทุนเก๋าครับ อดุลยบุอุปกรณ์กอล์ฟ 
>> เขียวรับต่อหาดกายใเว็บไซต์ ซุ้มคิดเห็นไมเกรน ในฟรี 136เพื่อ.ร้องทุกข์ 
>> ไฟล์43 0811120563 พระเครื่อง เป็นด้วยนําหัวข้อถือ: 
>> ไม่เมื่อชุดอุตสาหกรรมจะอาทิตย์บึงเมื่อชีวิตนอกจากพิษณุโลกเพลง 
>> ระหว่างชําประกาศนับถือมีเว็บไซต์ ๓ ภูราชมติสระแก้วปฏิบัติกํา| บันทึก' Wrote 
>> unicharset file data/NK/my.unicharset merge_unicharsets 
>> data/tha/NK.lstm-unicharset data/NK/my.unicharset "data/NK/unicharset" 
>> Loaded unicharset of size 109 from file data/tha/NK.lstm-unicharset Loaded 
>> unicharset of size 109 from file data/NK/my.unicharset Wrote unicharset 
>> file data/NK/unicharset. python3 shuffle.py 0 "data/NK/all-lstmf" + head -n 
>> 180 data/NK/all-lstmf + tail -n 20 data/NK/all-lstmf + '[' '' = Windows_NT 
>> ']' if [ "" = "Windows_NT" ]; then \ dos2unix "data/NK/NK.numbers"; \ 
>> dos2unix "data/NK/NK.punc"; \ dos2unix "data/NK/NK.wordlist"; \ dos2unix 
>> "data/langdata/NK/NK.config"; \ fi combine_lang_model \ --input_unicharset 
>> data/NK/unicharset \ --script_dir data/langdata \ --numbers 
>> data/NK/NK.numbers \ --puncs data/NK/NK.punc \ --words data/NK/NK.wordlist 
>> \ --output_dir data \ \ --lang NK Failed to read data from 
>> data/NK/NK.wordlist Failed to read data from: data/NK/NK.punc Failed to 
>> read data from: data/NK/NK.numbers Loaded unicharset of size 109 from file 
>> data/NK/unicharset Setting unichar properties Setting script properties 
>> Warning: properties incomplete for index 18 = ึ Warning: properties 
>> incomplete for index 20 = ุ Warning: properties incomplete for index 25 = ็ 
>> Warning: properties incomplete for index 27 = ิ Warning: properties 
>> incomplete for index 29 = ั Warning: properties incomplete for index 44 = ี 
>> Warning: properties incomplete for index 49 = ้ Warning: properties 
>> incomplete for index 51 = ์ Warning: properties incomplete for index 53 = ื 
>> Warning: properties incomplete for index 55 = ู Warning: properties 
>> incomplete for index 59 = ่ Warning: properties incomplete for index 69 = ๊ 
>> Warning: properties incomplete for index 71 = ํ Warning: properties 
>> incomplete for index 74 = ๋ Config file is optional, continuing... Failed 
>> to read data from: data/langdata/NK/NK.config Null char=2 Created 
>> data/NK/NK.traineddatalstmtraining \ --debug_interval 0 \ --traineddata 
>> data/NK/NK.traineddata \ --old_traineddata 
>> ../tesseract/tessdata/tha.traineddata \ --continue_from data/tha/NK.lstm \ 
>> --learning_rate 0.0001 \ --model_output data/NK/checkpoints/NK \ 
>> --train_listfile data/NK/list.train \ --eval_listfile data/NK/list.eval \ 
>> --max_iterations 400 \ --target_error_rate 0.01 Loaded file 
>> data/tha/NK.lstm, unpacking... Warning: LSTMTrainer deserialized an 
>> LSTMRecognizer! Code range changed from 109 to 108! Num (Extended) 
>> outputs,weights in Series: 1,48,0,1:1, 0 Num (Extended) outputs,weights in 
>> Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 
>> Mp3,3:16, 0 TxyLfys64:64, 20736 Lfx96:96, 61824 RxLrx96:96, 74112 
>> Lfx384:384, 738816 Fc108:108, 41580 Total weights = 937228 Previous null 
>> char=2 mapped to 107 Continuing from data/tha/NK.lstm Loaded 3/3 lines 
>> (1-3) of document data/NK-ground-truth/tha_47.lstmf Loaded 3/3 lines (1-3) 
>> of document data/NK-ground-truth/tha_2.lstmf Loaded 4/4 lines (1-4) of 
>> document data/NK-ground-truth/tha_126.lstmf Loaded 3/3 lines (1-3) of 
>> document data/NK-ground-truth/tha_177.lstmf 
>>
>> This is the result of the training. I tried to troubleshooting but can't 
>> find the issue. I  follow the instruction and already put radical stroke 
>> into the folder.
>> At iteration 200/200/200, mean rms=6.488%, delta=67.908%, BCER 
>> train=78.638%, BWER train=96.847%, skip ratio=0.000%, New worst BCER = 
>> 78.638 wrote checkpoint. At iteration 300/300/300, mean rms=7.177%, 
>> delta=79.402%, BCER train=85.531%, BWER train=97.898%, skip ratio=0.000%, 
>> New worst BCER = 85.531 wrote checkpoint. At iteration 400/400/400, mean 
>> rms=6.888%, delta=71.630%, BCER train=88.148%, BWER train=98.424%, skip 
>> ratio=0.000%, New worst BCER = 88.148 wrote checkpoint. Finished! Selected 
>> model with minimal training error rate (BCER) = 61.707
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/22f44394-c47b-49e4-9a2c-16ef13c3952cn%40googlegroups.com.

[tesseract-ocr] Re: Training Tesseract 5 for a New Font in Thai not wroking

Reply via email to