Well in this case it works without image processing ;-)

Anyway mrz is not "official" Tesseract training and there are people who
play with it, so it will take some time to search and dig
their findings/experience/expertise....

Zdenko


so 27. 1. 2024 o 12:02 sara waheed <sarawaheed3...@gmail.com> napísal(a):

> if I didn't research how would I know Tesseract needs image processing? I
> am new to OCR and in the learning phase please be kind and help thanks :)
>
> On Saturday, January 27, 2024 at 3:26:40 PM UTC+5 zdenop wrote:
>
>> What about reading docs and a little bit googling?
>>
>> tesseract two-page-passport-mrz-detected.jpeg - --psm 6 -l mrz
>>
>> IDAUT10000999<6<<<<<<<<<<<<<<<
>> 7109094F1112315AUT<<<<<<<<<<<6
>> MUSTERFRAU<<ISOLDE<<<<<<<<<<<<
>>
>>
>> Zdenko
>>
>>
>> so 27. 1. 2024 o 11:19 sara waheed <sarawah...@gmail.com> napísal(a):
>>
>>> I am trying to read the passport mrz string from the image i am using
>>> Tesseract and OpenCV for image processing i have tried three different ways
>>>  none of them worked
>>>
>>> **Attempt 1**
>>> I have this image  when i do ocr on it teseract read as
>>>
>>>     IDAUT10000999<6<<<<<<<<<<<<<<<
>>>     7109094F1112315AUT<<<<<<xcc<<6
>>>     MUSTERFRAU<<ISOLDE<<<<<<<<cc<<
>>>
>>> which is incorrect it treats <<< as x or c or k when I use the
>>> `mrz-java` library to read the details from the string it gives the
>>> following error
>>>
>>>     [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1
>>> IDAUT10000999<6<<<<<<<<<<<<<<<
>>>     [error] 7109094F1112315AUT<<<<<<xcc<<6
>>>     [error] MUSTERFRAU<<ISOLDE<<<<<<<<cc<<
>>>     [error]  at 24-25,1: Invalid character in MRZ record: x
>>>
>>> **Attempt 2**
>>>
>>> then I converted the image to grayscale and binarized it using `OpenCV`
>>> Here is the below code
>>>
>>>         val roiImagePath =
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected.jpeg"
>>>
>>>         val grayScaleROI = new Mat()
>>>           val roiImage = Imgcodecs.imread(roiImagePath)
>>>           Imgproc.cvtColor(roiImage, grayScaleROI,
>>> Imgproc.COLOR_BGR2GRAY)
>>>           val roiGaryImagePath =
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-gray.jpeg"
>>>
>>>           Imgcodecs.imwrite(roiGaryImagePath, grayScaleROI)
>>>           val binary = new Mat()
>>>           Imgproc.adaptiveThreshold(grayScaleROI, binary, 255,
>>> Imgproc.ADAPTIVE_THRESH_MEAN_C, Imgproc.THRESH_BINARY , 15, 25)
>>>           val roiBinaryImagePath =
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary.jpeg"
>>>           Imgcodecs.imwrite(roiBinaryImagePath, binary)
>>>
>>>      val tesseract = new Tesseract()
>>>       tesseract.setDatapath("/usr/share/tesseract-ocr/4.00/tessdata")
>>>       tesseract.setVariable("user_defined_dpi", "600")
>>>       val result = tesseract.doOCR(new File(roiBinaryImagePath))
>>>       val mrzStr = result.replace(" ", "")
>>>       println(s"two page passport mrz string is: "+mrzStr)
>>>
>>> it created the following binary image
>>>
>>> and the code output is
>>> tesseract reads mrz string from the binary image as
>>>
>>>     IDAUT1DODD999<E<KK<KKKKEKEKEK
>>>     7AD9D9GF1TEZSISAUTKKKKKKKKKEKG
>>>     MUSTERFRAUSKISOLDEKKKKKKKKKKK
>>> and `mrz-java` reads the string and generates the following error
>>>
>>>     [error] Error parsing MRZ string: Failed to parse MRZ null
>>> IDAUT1DODD999<E<KK<KKKKEKEKEK
>>>     [error] 7AD9D9GF1TEZSISAUTKKKKKKKKKEKG
>>>     [error] MUSTERFRAUSKISOLDEKKKKKKKKKKK
>>>     [error]  at 0-0,0: Different row lengths: 0: 29 and 1: 30
>>>
>>> **Attempt 3**
>>>
>>> then I resized the image
>>>
>>>     Val width = 1000 // Increase width proportionately (adjust based on
>>> your needs)
>>>       val height = (width * binary.rows()) / binary.cols() // Maintain
>>> aspect ratio
>>>
>>>       val resizedRoiImage = new Mat()
>>>       Imgproc.resize(binary, resizedRoiImage, new Size(width, height),
>>> 0.0, 0.0, Imgproc.INTER_NEAREST)
>>>
>>>       val resizedImageROIPath =
>>>  
>>> "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary-resized_image.jpg"
>>>       Imgcodecs.imwrite(resizedImageROIPath, resizedRoiImage)
>>>
>>> mrz string read by Tesseract
>>>
>>>     TOAUTIOOOOIISKhcceccccddddddce
>>>     FIOPOSAFIFESSISAUTReececeececs
>>>     MUSTERFRAUCCKISOLDECKccccdcddd
>>>
>>> and the error is
>>>
>>>     [info] 15:54:04.200 633 [main] MrzParser INFO - Check digit
>>> verification failed for document number: expected 0 but got h
>>>     [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1
>>> TOAUTIOOOOIISKhcceccccddddddce
>>>     [error] FIOPOSAFIFESSISAUTReececeececs
>>>     [error] MUSTERFRAUCCKISOLDECKccccdcddd
>>>     [error]  at 15-16,0: Invalid character in MRZ record: c
>>>
>>>
>>> can anyone please help how I read the text properly also I have tried
>>> one regex to convert c or k back to <<< it did not work either if anyone
>>> can suggest some workaround or any improvement in code please help me with
>>> that thanks
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/440788ab-1d76-4612-a4b5-a1a4c2cd09a5n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/440788ab-1d76-4612-a4b5-a1a4c2cd09a5n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1bf9839e-93e4-4fcc-818a-c4184ebb58d1n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/1bf9839e-93e4-4fcc-818a-c4184ebb58d1n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zU3T40G2eAQ_9BuF48ZWqSzP%3DfW9WXAMmeF3ELpQYRwg%40mail.gmail.com.

Reply via email to