Dear all,
we are attempting to read bank statements with tesseract (via tess4j, 
version 4.6.0 using libtesseract 4.1.3). These statements are formalized 
letters where the crucial information for us appears at pre-defined 
locations. Among other information, we are interested in extracting the 
ISIN (international securities identifier), which is a alphanumeric code 
consisting of a two-letter country code, nine arbitrary letters
or digits and a numeric check digit.

When attempting to extract this information with tesseract, we observe 
patterns of read errors by tesseract such as

- zeros in the ISIN's padding appear as 0O combinations in tesseract's 
output. For example IE00BG0J4C88 in the document is read as IE0O0BG0J4C88
- the check-digit is misread as a letter. E.g. I or J for 1, S for 5 etc.
- letters in the country code (first two characters of the ISIN) are 
misinterpreted as digits, e.g. 1E instead of IE, F1 instead of FI.

These problems appear arbitrarily for such documents coming from different 
banks using different fonts. Preliminary tests using a user patterns file 
where we specify a pattern for the ISIN have had no effect, the ocr result 
is exactly the same as without custom pattern file. Our pattern file 
contains this line:

\A\A\c\c\c\c\c\c\c\c\c\d

and we use it by setting the "user_patterns_file" variable like so

Tesseract tesseract = new Tesseract();
tesseract.setTessVariable("user_patterns_file", "path/to/my.pattern");

Anyhow, my questions:

- is this the correct way to configure user patterns with tess4j? Related 
to that, do user patterns work when using tesseract 4.1.3 in LSTM mode (as 
we do currently)? I am aware of a number of issues (see 
https://github.com/tesseract-ocr/tesseract/issues/403 resp.
  https://github.com/tesseract-ocr/tesseract/issues/960) and PR 
https://github.com/tesseract-ocr/tesseract/pull/2328 that attempted to add 
it for LSTM but am not sure what the current status is.
- is using a pattern the right way to go to augment tesseract's accuracy 
for alphanumeric identifiers like an ISIN? Does this yield positive results 
even when the alphanumeric
  identifier is part of a longer text and not the only thing that is 
present in the picture?
- what other approaches to improve tesseract's accuracy when recognizing 
alphanumeric characters exist? I am aware of user dictionaries, but have my 
doubts this is a feasible approach   for us given the large number of 
existing ISINs (> 3 million).

Thanks in advance for any hints,
Stefan

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com.

Reply via email to