Well in this case it works without image processing ;-) Anyway mrz is not "official" Tesseract training and there are people who play with it, so it will take some time to search and dig their findings/experience/expertise....
Zdenko so 27. 1. 2024 o 12:02 sara waheed <sarawaheed3...@gmail.com> napísal(a): > if I didn't research how would I know Tesseract needs image processing? I > am new to OCR and in the learning phase please be kind and help thanks :) > > On Saturday, January 27, 2024 at 3:26:40 PM UTC+5 zdenop wrote: > >> What about reading docs and a little bit googling? >> >> tesseract two-page-passport-mrz-detected.jpeg - --psm 6 -l mrz >> >> IDAUT10000999<6<<<<<<<<<<<<<<< >> 7109094F1112315AUT<<<<<<<<<<<6 >> MUSTERFRAU<<ISOLDE<<<<<<<<<<<< >> >> >> Zdenko >> >> >> so 27. 1. 2024 o 11:19 sara waheed <sarawah...@gmail.com> napísal(a): >> >>> I am trying to read the passport mrz string from the image i am using >>> Tesseract and OpenCV for image processing i have tried three different ways >>> none of them worked >>> >>> **Attempt 1** >>> I have this image when i do ocr on it teseract read as >>> >>> IDAUT10000999<6<<<<<<<<<<<<<<< >>> 7109094F1112315AUT<<<<<<xcc<<6 >>> MUSTERFRAU<<ISOLDE<<<<<<<<cc<< >>> >>> which is incorrect it treats <<< as x or c or k when I use the >>> `mrz-java` library to read the details from the string it gives the >>> following error >>> >>> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1 >>> IDAUT10000999<6<<<<<<<<<<<<<<< >>> [error] 7109094F1112315AUT<<<<<<xcc<<6 >>> [error] MUSTERFRAU<<ISOLDE<<<<<<<<cc<< >>> [error] at 24-25,1: Invalid character in MRZ record: x >>> >>> **Attempt 2** >>> >>> then I converted the image to grayscale and binarized it using `OpenCV` >>> Here is the below code >>> >>> val roiImagePath = >>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected.jpeg" >>> >>> val grayScaleROI = new Mat() >>> val roiImage = Imgcodecs.imread(roiImagePath) >>> Imgproc.cvtColor(roiImage, grayScaleROI, >>> Imgproc.COLOR_BGR2GRAY) >>> val roiGaryImagePath = >>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-gray.jpeg" >>> >>> Imgcodecs.imwrite(roiGaryImagePath, grayScaleROI) >>> val binary = new Mat() >>> Imgproc.adaptiveThreshold(grayScaleROI, binary, 255, >>> Imgproc.ADAPTIVE_THRESH_MEAN_C, Imgproc.THRESH_BINARY , 15, 25) >>> val roiBinaryImagePath = >>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary.jpeg" >>> Imgcodecs.imwrite(roiBinaryImagePath, binary) >>> >>> val tesseract = new Tesseract() >>> tesseract.setDatapath("/usr/share/tesseract-ocr/4.00/tessdata") >>> tesseract.setVariable("user_defined_dpi", "600") >>> val result = tesseract.doOCR(new File(roiBinaryImagePath)) >>> val mrzStr = result.replace(" ", "") >>> println(s"two page passport mrz string is: "+mrzStr) >>> >>> it created the following binary image >>> >>> and the code output is >>> tesseract reads mrz string from the binary image as >>> >>> IDAUT1DODD999<E<KK<KKKKEKEKEK >>> 7AD9D9GF1TEZSISAUTKKKKKKKKKEKG >>> MUSTERFRAUSKISOLDEKKKKKKKKKKK >>> and `mrz-java` reads the string and generates the following error >>> >>> [error] Error parsing MRZ string: Failed to parse MRZ null >>> IDAUT1DODD999<E<KK<KKKKEKEKEK >>> [error] 7AD9D9GF1TEZSISAUTKKKKKKKKKEKG >>> [error] MUSTERFRAUSKISOLDEKKKKKKKKKKK >>> [error] at 0-0,0: Different row lengths: 0: 29 and 1: 30 >>> >>> **Attempt 3** >>> >>> then I resized the image >>> >>> Val width = 1000 // Increase width proportionately (adjust based on >>> your needs) >>> val height = (width * binary.rows()) / binary.cols() // Maintain >>> aspect ratio >>> >>> val resizedRoiImage = new Mat() >>> Imgproc.resize(binary, resizedRoiImage, new Size(width, height), >>> 0.0, 0.0, Imgproc.INTER_NEAREST) >>> >>> val resizedImageROIPath = >>> >>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary-resized_image.jpg" >>> Imgcodecs.imwrite(resizedImageROIPath, resizedRoiImage) >>> >>> mrz string read by Tesseract >>> >>> TOAUTIOOOOIISKhcceccccddddddce >>> FIOPOSAFIFESSISAUTReececeececs >>> MUSTERFRAUCCKISOLDECKccccdcddd >>> >>> and the error is >>> >>> [info] 15:54:04.200 633 [main] MrzParser INFO - Check digit >>> verification failed for document number: expected 0 but got h >>> [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1 >>> TOAUTIOOOOIISKhcceccccddddddce >>> [error] FIOPOSAFIFESSISAUTReececeececs >>> [error] MUSTERFRAUCCKISOLDECKccccdcddd >>> [error] at 15-16,0: Invalid character in MRZ record: c >>> >>> >>> can anyone please help how I read the text properly also I have tried >>> one regex to convert c or k back to <<< it did not work either if anyone >>> can suggest some workaround or any improvement in code please help me with >>> that thanks >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/440788ab-1d76-4612-a4b5-a1a4c2cd09a5n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/440788ab-1d76-4612-a4b5-a1a4c2cd09a5n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/1bf9839e-93e4-4fcc-818a-c4184ebb58d1n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/1bf9839e-93e4-4fcc-818a-c4184ebb58d1n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zU3T40G2eAQ_9BuF48ZWqSzP%3DfW9WXAMmeF3ELpQYRwg%40mail.gmail.com.