[tesseract-ocr] Extracting alphanumeric identifiers (ISINs)

'Stefan Bretzel' via tesseract-ocr Thu, 23 Jun 2022 06:30:36 -0700

Dear all,
we are attempting to read bank statements with tesseract (via tess4j, 
version 4.6.0 using libtesseract 4.1.3). These statements are formalized 
letters where the crucial information for us appears at pre-defined 
locations. Among other information, we are interested in extracting the 
ISIN (international securities identifier), which is a alphanumeric code 
consisting of a two-letter country code, nine arbitrary letters
or digits and a numeric check digit.

When attempting to extract this information with tesseract, we observe
patterns of read errors by tesseract such as

- zeros in the ISIN's padding appear as 0O combinations in tesseract's
output. For example IE00BG0J4C88 in the document is read as IE0O0BG0J4C88
- the check-digit is misread as a letter. E.g. I or J for 1, S for 5 etc.
- letters in the country code (first two characters of the ISIN) are
misinterpreted as digits, e.g. 1E instead of IE, F1 instead of FI.

These problems appear arbitrarily for such documents coming from different
banks using different fonts. Preliminary tests using a user patterns file
where we specify a pattern for the ISIN have had no effect, the ocr result
is exactly the same as without custom pattern file. Our pattern file
contains this line:

\A\A\c\c\c\c\c\c\c\c\c\d

and we use it by setting the "user_patterns_file" variable like so

Tesseract tesseract = new Tesseract();
tesseract.setTessVariable("user_patterns_file", "path/to/my.pattern");

Anyhow, my questions:

- is this the correct way to configure user patterns with tess4j? Related
to that, do user patterns work when using tesseract 4.1.3 in LSTM mode (as
we do currently)? I am aware of a number of issues (see
https://github.com/tesseract-ocr/tesseract/issues/403 resp.
https://github.com/tesseract-ocr/tesseract/issues/960) and PR
https://github.com/tesseract-ocr/tesseract/pull/2328 that attempted to add
it for LSTM but am not sure what the current status is.
- is using a pattern the right way to go to augment tesseract's accuracy
for alphanumeric identifiers like an ISIN? Does this yield positive results
even when the alphanumeric
identifier is part of a longer text and not the only thing that is
present in the picture?
- what other approaches to improve tesseract's accuracy when recognizing
alphanumeric characters exist? I am aware of user dictionaries, but have my
doubts this is a feasible approach for us given the large number of
existing ISINs (> 3 million).

Thanks in advance for any hints,
Stefan

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com.

[tesseract-ocr] Extracting alphanumeric identifiers (ISINs)

Reply via email to