Hi Guys , i will start development *OCR using image and Pdf to text extraction *what are the steps i need to follow , can you pleasse refer me the best model , already i had used the pytesseract engine but i did not get proper extraction ...
Best Regards, Sandhiya On Tuesday 23 January 2024 at 23:14:40 UTC+5:30 lhta...@gmail.com wrote: > Hi Zdenko, > > Thanks. Your insights have been instrumental in helping me grasp the > concepts behind Tesseract. > > I've been experimenting with various thresholding methods, such as Otsu > (0), LeptonicaOtsu (1), and Sauvola (2), and I've noticed that they yield > distinct outcomes when applied to my images. It seems that I might need to > develop custom preprocessing procedures tailored to the images (webpage > screenshots) before passing them to Tesseract. > > Your guidance and suggestions are highly appreciated. > > > Best, > > Haitao > > > On Mon, Jan 22, 2024 at 10:02 PM Zdenko Podobny <zde...@gmail.com> wrote: > >> Hi, >> >> The most critical part is this: >> https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html, but I need >> to stress: tesseract is OCR *engine *not OCR *suite*. >> Unless your input page is not a book page scan without a >> difficult structure, you need to do your part like image processing and >> document segmentation (detection of text block). >> >> This is the reason why you get "unsatisfactory" results if you send >> complicated images with non uniform texts, with graphics etc. >> However if you will use only text part of the image for recognition you >> can get very good results. >> >> Best regards, >> >> Zdenko >> >> >> po 22. 1. 2024 o 19:42 L ht <lhta...@gmail.com> napísal(a): >> >>> Hi Zdenko, >>> >>> Thanks for your response. >>> I read the Tesseract User Manual ( >>> https://tesseract-ocr.github.io/tessdoc/), but not read the code >>> >>> I tried both tessdata_best and tessdata, tried different parameters of >>> --psm, still can not get more detections. >>> >>> To provide some context, when I applied Tesseract to the entire image, >>> it managed to identify only a few words, such as "Log in," "Username," >>> "Password," and "Cancel," primarily within the central, well-lit portion. >>> However, when I cropped the image to retain either the upper or left >>> portions, Tesseract exhibited improved performance, successfully detecting >>> numerous words in those respective areas. >>> >>> Best, >>> Haitao >>> >>> On Sun, Jan 21, 2024 at 3:02 AM Zdenko Podobny <zde...@gmail.com> wrote: >>> >>>> Did you read the documentation or did you just set your expectations? >>>> >>>> >>>> Zdenko >>>> >>>> >>>> ne 21. 1. 2024 o 12:00 L ht <lhta...@gmail.com> napísal(a): >>>> >>>>> I am new to use tesseract. I found tesseract does not work as >>>>> expected. I attach one example. >>>>> >>>>> tesseract 5.3.2 >>>>> tesseract 272525030292764523137280353496213864766.png - -l eng --psm 3 >>>>> quiet >>>>> can only detect those words >>>>> "Log in >>>>> Username >>>>> Password >>>>> Cancel" >>>>> >>>>> I submit this picture to several online pic->txt converters. they work >>>>> well, detecting most of the text in the pic. >>>>> For example, https://www.imagetotext.info/ it claims that it use >>>>> tesseract >>>>> >>>>> I am not sure if I use tesseract correctly. >>>>> Does another can help test what's your detection result of this >>>>> picture? >>>>> Thanks >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/e95fa7c6-7afb-4a08-8b11-a63a024c3c9bn%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e95fa7c6-7afb-4a08-8b11-a63a024c3c9bn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y9abBL2T7wEiWB9KDAuOqkVY4DZcuqpc7u9PbY3jxfEg%40mail.gmail.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y9abBL2T7wEiWB9KDAuOqkVY4DZcuqpc7u9PbY3jxfEg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CANmU3o_UAK6Qi_4SGxDQeRdRYWaHbdpQh%3DbHW-VM_S3yhJaXzQ%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/CANmU3o_UAK6Qi_4SGxDQeRdRYWaHbdpQh%3DbHW-VM_S3yhJaXzQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zc4pyY%2BGJfVGrJ-yDMTo1tLn9DA502FJeB_V%3DLKi5p%2BQ%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zc4pyY%2BGJfVGrJ-yDMTo1tLn9DA502FJeB_V%3DLKi5p%2BQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fb31533d-df36-4355-9d13-f79b7c2f00f7n%40googlegroups.com.