[tesseract-ocr] Training with new Bangla font and a little change in ben.training_text. #Please help me

neelima preeti Sun, 09 Jun 2024 04:40:11 -0700

Hello everyone,
*I am new to training tesseract. So I tried with little data. Please help
me.*
I am trying to train tesseract for new bangla font NikoshBAN and made few
changes in the ben.train_text using a youtube video as reference and
documentation of tesseract.
https://www.youtube.com/watch?v=KE4xEzFGSU8. My tesseract configurations
are given below. Now I have cloned the langdata for bangla, tesseract and
tesstrain from github.
In tesseact > tessdata I have placed the pretrained ben.traineddata.
The langdata folder structure is like:
ben.training_text
Bengali.unicharset (contains unicharset from the before trained bangla
model)
Bengali.xheights  (contains xheights from the before trained bangla model +
I added text heights for NikoshBAN)
font_properties (contains font properties from the before trained models +
I added NikoshBAN 10100 )
ben.punc
ben.numbers
ben.wordlist
# I also have a *split_training_text.py* for splitting the
ben.training_text(made few changes) and convert it to .tif , box, .txt
*Here is the code :*
import os
import random
import pathlib
import subprocess


training_text_file = 'langdata/ben.training_text'

lines = []

with open(training_text_file, 'r') as input_file:
    for line in input_file.readlines():
        lines.append(line.strip())

output_directory = 'tesstrain/data/BAN-ground-truth'

if not os.path.exists(output_directory):
    os.mkdir(output_directory)

#random.shuffle(lines)

count = 100

lines = lines[:count]

line_count = 0
for line in lines:
    training_text_file_name = pathlib.Path(training_text_file).stem
    line_training_text = os.path.join(output_directory,
f'{training_text_file_name}_{line_count}.gt.txt')
    with open(line_training_text, 'w') as output_file:
        output_file.writelines([line])

    file_base_name = f'ben_{line_count}'

    subprocess.run([
        'text2image',
        '--font=NikoshBAN',
        f'--text={line_training_text}',
        f'--outputbase={output_directory}/{file_base_name}',
        '--max_pages=1',
        '--strip_unrenderable_words',
        '--leading=32',
        '--xsize=3600',
        '--ysize=480',
        '--char_spacing=1.0',
        '--exposure=0',
        '--unicharset_file=langdata/Bengali.unicharset'
    ])

    line_count += 1
After running this it generates ground truth in the
tesstrain>data>BAN-ground-truth.
then I navigate to tesstrain and run the following command :

*TESSDATA_PREFIX=/home/anim/preeti02/tesseract/tessdata make training
MODEL_NAME=BAN START_MODEL=ben
TESSDATA=/home/anim/preeti02/tesseract/tessdata MAX_ITERATIONS=400*which
gives me the error :
You are using make version: 4.3
combine_tessdata -u /home/anim/preeti02/tesseract/tessdata/ben.traineddata
data/ben/BAN
Extracting tessdata components from
/home/anim/preeti02/tesseract/tessdata/ben.traineddata
Wrote data/ben/BAN.config
Wrote data/ben/BAN.unicharset
Wrote data/ben/BAN.unicharambigs
Wrote data/ben/BAN.inttemp
Wrote data/ben/BAN.pffmtable
Wrote data/ben/BAN.normproto
Wrote data/ben/BAN.punc-dawg
Wrote data/ben/BAN.word-dawg
Wrote data/ben/BAN.number-dawg
Wrote data/ben/BAN.freq-dawg
Wrote data/ben/BAN.shapetable
Wrote data/ben/BAN.bigram-dawg
Wrote data/ben/BAN.params-model
Wrote data/ben/BAN.lstm
Wrote data/ben/BAN.lstm-punc-dawg
Wrote data/ben/BAN.lstm-word-dawg
Wrote data/ben/BAN.lstm-number-dawg
Wrote data/ben/BAN.version
Version:Pre-4.0.0
0:config:size=377, offset=192
1:unicharset:size=146615, offset=569
2:unicharambigs:size=1047, offset=147184
3:inttemp:size=13889634, offset=148231
4:pffmtable:size=23387, offset=14037865
5:normproto:size=185873, offset=14061252
6:punc-dawg:size=3610, offset=14247125
7:word-dawg:size=117978, offset=14250735
8:number-dawg:size=258, offset=14368713
9:freq-dawg:size=1610, offset=14368971
13:shapetable:size=370138, offset=14370581
14:bigram-dawg:size=811178, offset=14740719
16:params-model:size=688, offset=15551897
17:lstm:size=5491102, offset=15552585
18:lstm-punc-dawg:size=4322, offset=21043687
19:lstm-word-dawg:size=2399610, offset=21048009
20:lstm-number-dawg:size=258, offset=23447619
23:version:size=9, offset=23447877
unicharset_extractor --output_unicharset "data/BAN/my.unicharset"
--norm_mode 2 "data/BAN/all-gt"
Extracting unicharset from plain text file data/BAN/all-gt
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'এর : ২ সাইট এক তােক জোর দ্য নাকি'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'খােলদা সার্ভিসের অনুষ্ঠানে তুংরত'
merge_unicharsets data/ben/BAN.lstm-unicharset data/BAN/my.unicharset
"data/BAN/unicharset"
Failed to load unicharset from file data/ben/BAN.lstm-unicharset!!
make: *** [Makefile:211: data/BAN/unicharset] Error 1

*My tesseract configurations are: *
tesseract 5.3.4
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff
4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3
libzstd/1.4.8
 Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8
libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib
nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.16

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CADXwst%3DfvOSfvOip6vGBXRM5BuYL0v_eVkKtTJwY9SFtNc%2BfPA%40mail.gmail.com.

[tesseract-ocr] Training with new Bangla font and a little change in ben.training_text. #Please help me

Reply via email to