Hi Zdenko, indeed, avoiding OCR completely for this kind of documents would be desirable though not possible for us.
As a matter of fact, we already use ISIN validation and do quite a lot of postprocessing on the extracted ISIN as an attempt to compensate for OCR misreadings. In doing so, we use ISIN validation already quite extensively. The problem with it is that Luhn algorithm through which the check digit is calculated is not sensitive enough to catch all character transpositions and mixups, so the danger is that we accidentally "create" one or more valid ISINs by these postprocessing steps that are rather far away from the ISIN found on the statement or which are equally similar to the found string so that we basically have a "draw". That's why I had hoped improving tesseract's output might give us less opportunity for this kind of problems. I didn't know that enlarging the font-size could play a crucial role. I'll try and evaluate how much this helps in our case. In any case, thanks for your hints and your time, Stefan zdenop schrieb am Sonntag, 26. Juni 2022 um 16:10:42 UTC+2: > Hello Stefan, > > recognizing such codes (e.g. no words) is difficult since some letters > could be easily replaced (e.g zero with capital O, 1 with l ). > > I had a discussion with one commercial provider of data extraction from > invoices (based on commercial OCR engines) and their claim that you always > need a human validator for fields like numbers, codes, etc, as any OCR is > not 100% accurate, but your accounting/service must be 100% correct. So > they push for getting and processing data and they try to avoid data > extraction from images as much as possible... > > Regarding your examples: > I expect that you are extracting ISIN information from the image and not > pdf. Image preprocessing hints - resize the image so the capital letter > has a size 30-33 points [1] > I got there result (with tessdata_best) (tesseract ISINs_rs.png -): > > FR0000127771 > IEOOB3RBWM25 > NL0011794037 > DEOOA1DAHHQO > DEOOAOQWMPJE > DEOOOA1IML7]J1 > IE00BG0J4C88 > > > The good point is that ISIN could be validated ([2], [3]), so can > automatically check for OCR output and maybe do automatic post-processing > (replace common problem "O" with "0" and validate once again) before asking > for human validation/correction. > > If the font and letter size are the same on all documents, maybe you can > consider making your own custom OCR just for ISIN. > > [1] https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ > [2] https://www.isindb.com/fix-isin-calculate-isin-check-digit/ > [3] > https://rosettacode.org/wiki/Validate_International_Securities_Identification_Number > [4] > https://stackoverflow.com/questions/9413216/simple-digit-recognition-ocr-in-opencv-python > > Zdenko > > > so 25. 6. 2022 o 14:40 'Stefan Bretzel' via tesseract-ocr < > tesser...@googlegroups.com> napísal(a): > >> Hi zdenop, >> thanks for the quick reply. >> >> I've attached two (artificial -- can't post real-world scans due to >> legal/data protection reasons) examples illustrating the problem. >> >> Zweitschrift_Muster.pdf comes close to what we try to OCR in the >> real-world. The ISIN is DE00A1DAHHQ0 but tesseract reads DEOOA1DAHHQO (all >> zeros are read as O). >> While the first two Os might be indeed legal (an ISIN allows nine >> alphanumeric characters after the country code), the O at the end is >> definately wrong >> as the last character must always be a digit. I had hoped to give >> tesseract a hint by providing a pattern. Besides the ISIN we extract >> further information >> from the document, such as the execution date (Nov 26th, 2021 at 7:45), >> the rate and amount (220,00 resp. 22000,00) as well as the depot number >> (789789789). >> >> multiple_ISINs.pdf contains a number of ISINs for which we have observed >> the same issue: >> >> Found Expected >> FRO0000127771 FR0000127771 -> additional O >> IEOO0B3RBWM25 IE00B3RBWM25 -> OO0 instead of 00 >> NLO0011794037 NL0011794037 -> O00 instead of 00 >> DEOOA1DAHHQO DE00A1DAHHQ0 -> double O instead of double 0, O instead >> of 0 at the end >> DEOOAO0QWMPJ6 DE00A0QWMPJ6 -> double O instead of double 0 >> >> Cheers, >> Stefan >> zdenop schrieb am Donnerstag, 23. Juni 2022 um 16:58:18 UTC+2: >> >>> Can please provide some examples of input images? >>> It would be much easier for other user to test your problem and suggest >>> some solution. >>> >>> Zdenko >>> >>> >>> št 23. 6. 2022 o 15:30 'Stefan Bretzel' via tesseract-ocr < >>> tesser...@googlegroups.com> napísal(a): >>> >>>> Dear all, >>>> we are attempting to read bank statements with tesseract (via tess4j, >>>> version 4.6.0 using libtesseract 4.1.3). These statements are formalized >>>> letters where the crucial information for us appears at pre-defined >>>> locations. Among other information, we are interested in extracting the >>>> ISIN (international securities identifier), which is a alphanumeric code >>>> consisting of a two-letter country code, nine arbitrary letters >>>> or digits and a numeric check digit. >>>> >>>> When attempting to extract this information with tesseract, we observe >>>> patterns of read errors by tesseract such as >>>> >>>> - zeros in the ISIN's padding appear as 0O combinations in tesseract's >>>> output. For example IE00BG0J4C88 in the document is read as IE0O0BG0J4C88 >>>> - the check-digit is misread as a letter. E.g. I or J for 1, S for 5 >>>> etc. >>>> - letters in the country code (first two characters of the ISIN) are >>>> misinterpreted as digits, e.g. 1E instead of IE, F1 instead of FI. >>>> >>>> These problems appear arbitrarily for such documents coming from >>>> different banks using different fonts. Preliminary tests using a user >>>> patterns file where we specify a pattern for the ISIN have had no effect, >>>> the ocr result is exactly the same as without custom pattern file. Our >>>> pattern file contains this line: >>>> >>>> \A\A\c\c\c\c\c\c\c\c\c\d >>>> >>>> and we use it by setting the "user_patterns_file" variable like so >>>> >>>> Tesseract tesseract = new Tesseract(); >>>> tesseract.setTessVariable("user_patterns_file", "path/to/my.pattern"); >>>> >>>> Anyhow, my questions: >>>> >>>> - is this the correct way to configure user patterns with tess4j? >>>> Related to that, do user patterns work when using tesseract 4.1.3 in LSTM >>>> mode (as we do currently)? I am aware of a number of issues (see >>>> https://github.com/tesseract-ocr/tesseract/issues/403 resp. >>>> https://github.com/tesseract-ocr/tesseract/issues/960) and PR >>>> https://github.com/tesseract-ocr/tesseract/pull/2328 that attempted to >>>> add it for LSTM but am not sure what the current status is. >>>> - is using a pattern the right way to go to augment tesseract's >>>> accuracy for alphanumeric identifiers like an ISIN? Does this yield >>>> positive results even when the alphanumeric >>>> identifier is part of a longer text and not the only thing that is >>>> present in the picture? >>>> - what other approaches to improve tesseract's accuracy when >>>> recognizing alphanumeric characters exist? I am aware of user >>>> dictionaries, >>>> but have my doubts this is a feasible approach for us given the large >>>> number of existing ISINs (> 3 million). >>>> >>>> Thanks in advance for any hints, >>>> Stefan >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/12ca46e2-c047-4f19-a54b-440c4b8a678en%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/12ca46e2-c047-4f19-a54b-440c4b8a678en%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f665fc35-b257-4ff0-93e5-521b6e6cac89n%40googlegroups.com.