I suspect the way it works is that if you say it can only recognise the letter "S" then it will interpret "5" and "$" etc as "S".
What you need to do is allow it to recognise all of the chars that you actually expect to see in the document, and then write a script to remove them from the result. So my tessedit_char_whitelist is "1234567890 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\"-–—.,;:'()*&?!/" And then I have a script that removes any chars that are not in "1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.\"-'&" On Sat, 3 May 2025 at 18:07, Burt Bacharat <splif...@gmail.com> wrote: > I'm trying to figure out how to disable all the language model behaviour > and just do character recognition and word-splitting on whitespace. I've > tried different `--oem` modes including mode 0 with a legacy language file > but tesseract still keeps trying to correct words/characters based on > surrounding characters. > > Say I have a "word" consisting of a letter and a number, like "S9" or > "S99". Depending on the combination of settings I use I will usually get > one of these incorrect behaviours: > - The S is substituted for a "$" (dollar sign) because it thinks it's > currency > - The S is substituted for an "8" because it thinks it's a number > > In most other situations it will see the same S correctly (ie, as part of > an actual word). It's only when I mix letters and numbers that this > behaviour is triggered which suggests this is not a character recognition > issue in the traditional sense of just detecting the outline. > > I should add the input I'm scanning is from a digital file and it's a > high-res, low-noise document with high contrast and a clean serif font. > Noise/artifacts are not really an issue and DPI can be be as high as > required. I'm currently scanning at 300 DPI (approx 8000x12000px) but I can > increase or decrease it if will help (it doesn't seem to). > > I've tried disabling every relevant option I can find and it still keeps > happening. Here is the full list of settings I'm passing: > > CUSTOM_TESSERACT_CONFIG = ( > '--oem 0 --psm 6 ' > f'-c tessedit_char_whitelist="{VALID_CHARS}" ' > '-c tessedit_enable_dict_correction=0 ' > '-c load_system_dawg=0 ' > '-c load_freq_dawg=0 ' > '-c load_punc_dawg=0 ' > '-c load_number_dawg=0 ' > '-c load_unambig_dawg=0 ' > '-c load_bigram_dawg=0 ' > '-c load_fixed_length_dawgs=0 ' > '-c wordrec_enable_assoc=0 ' > '-c language_model_penalty_non_freq_dict_word=0 ' > '-c language_model_penalty_non_dict_word=0 ' > '-c tessedit_prefer_joined_punct=1 ' > '-c textord_enable_word_ngrams=0 ' > '-c tessedit_good_quality_unrej=1 ' > '-c tessedit_enable_bigram_correction=0 ' > '-c tessedit_enable_doc_dict=0 ' > '-c textord_enable_out_of_punct=0 ' > '-c textord_enable_xheight_stats=0 ' > '-c enable_noise_removal=0 ' > '-c classify_enable_adaptive_matcher=0 ' > '-c classify_enable_learning=0 ' > '-c tessedit_preserve_blk_rej_perfect_wds=1 ' > '-c preserve_interword_spaces=1 ' > '-c segment_penalty_dict_case=0 ' > '-c segment_penalty_garbage=0 ' > '-c textord_split_num_pattern=0' > ) > > Basically what I'm after is for tesseract to do ONLY these things: > > a.) Detect a character based only on its outline, not the surrounding > context - and use the best match. > b.) Group nearby characters into groups based only on whitespace (no > splitting on commas, punctuation, etc) however I do want to capture the > punctuation (eg: $9,999.00) > c.) Give me the bounding box of each group (because I need the position > for further processing) > > How can I do this? Is it even possible? > > ---- > tesseract -v > tesseract 5.5.0 > leptonica-1.85.0 > libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.47 : > libtiff 4.7.0 : zlib 1.2.12 : libwebp 1.5.0 : libopenjp2 2.5.3 > Found NEON > Found libarchive 3.7.7 zlib/1.2.12 liblzma/5.6.3 bz2lib/1.0.8 > liblz4/1.10.0 libzstd/1.5.6 > Found libcurl/8.7.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.12 > nghttp2/1.61.0 > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/tesseract-ocr/41f7e51f-52dd-4bbb-8554-12432dc682c0n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/41f7e51f-52dd-4bbb-8554-12432dc682c0n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAN%2BihQSAtXiY1Xs85CX11yENTn6qfAQP4xnhnwK6t97xGANtXw%40mail.gmail.com.