For some reason, using the numpy array has a different result than mine. Numpy array:
[image: hi.png] Loop through pixels: [image: hi.png] The second was is more accurate but way slower. On Thursday, December 30, 2021 at 11:43:01 AM UTC-8 zdenop wrote: > try this: > > import numpy as np > from PIL import Image > > filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), > (67, 66, 62), > > (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, > 58), > (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)] > image = np.array(Image.open('mai.png').convert("RGB")) > mask = np.isin(image, filter_colors, invert=True) > img = Image.fromarray(mask.any(axis=2)) > > > Zdenko > > > št 30. 12. 2021 o 18:14 Cyrus Yip <cyrus...@gmail.com> napísal(a): > >> I also tried many things like cropping, colour changing, colour >> replacing, and mixing them together. >> >> I landed on checking if a pixel is not one of these: >> >> [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62), >> (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58), (62, >> 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)] >> >> colours, replace it with white. It is pretty accurate but is there a way >> to do this with numpy arrays? >> >> (code) >> for x in range(im.width): >> if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64, 60), >> (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, >> 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, >> 55)]: >> pixels[x, y] = (255, 255, 255) >> On Thursday, December 30, 2021 at 8:46:51 AM UTC-8 zdenop wrote: >> >>> OK. I played a little bit ;-): >>> >>> I tested the speed of your code with your image: >>> >>> import timeit >>> >>> pil_color_replace = """ >>> from PIL import Image >>> >>> im = Image.open('mai.png').convert("RGB") >>> >>> pixdata = im.load() >>> for y in range(im.height): >>> for x in range(im.width): >>> if pixdata[x, y] != (51, 51, 51): >>> pixdata[x, y] = (255, 255, 255) >>> """ >>> >>> elapsed_time = timeit.timeit(pil_color_replace, number=100)/100 >>> print(f"duration: {elapsed_time:.4} seconds") >>> >>> I got an average speed 0.08547 seconds on my computer. >>> On internet I found the suggestion to use numpy for this and I finished >>> with the following code: >>> >>> np_color_replace_rgb = """ >>> import numpy as np >>> from PIL import Image >>> >>> data = np.array(Image.open('mai.png').convert("RGB")) >>> mask = (data == [51, 51, 51]).all(-1) >>> img = Image.fromarray(np.invert(mask)) >>> """ >>> >>> elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100 >>> print(f"duration: {elapsed_time:.4} seconds") >>> >>> I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL code. >>> It is a little bit cheating as it does not replace colors - just take a >>> mask of target color and return it as a binarized image, what is exactly >>> what you need for OCR ;-) >>> >>> Also, I would like to point out that the result OCR output is not so >>> perfect (compared to OCR of unmodified text areas), as this kind of >>> binarization is very simple. >>> >>> >>> Zdenko >>> >>> >>> št 30. 12. 2021 o 11:19 Zdenko Podobny <zde...@gmail.com> napísal(a): >>> >>>> Just made your tests ;-) >>>> >>>> You can use tesserocr (maybe quite difficult installation if you are on >>>> windows) instead of pytesseract (e.g. initialize tesseract API once and >>>> use >>>> is multiple times). But it does not provide DICT output. >>>> >>>> >>>> Zdenko >>>> >>>> >>>> st 29. 12. 2021 o 21:18 Cyrus Yip <cyrus...@gmail.com> napísal(a): >>>> >>>>> but won't multiple ocr's and crops use a lot of time? >>>>> >>>>> On Wednesday, December 29, 2021 at 10:15:26 AM UTC-8 zdenop wrote: >>>>> >>>>>> IMO if the text is always in the same area, cropping and OCR just >>>>>> that area will be faster. >>>>>> >>>>>> Zdenko >>>>>> >>>>>> >>>>>> st 29. 12. 2021 o 18:58 Cyrus Yip <cyrus...@gmail.com> napísal(a): >>>>>> >>>>>>> I played around a bit and replacing all colours except for text >>>>>>> colour and it works pretty well! >>>>>>> >>>>>>> The only thing is replacing colours with: >>>>>>> im = im.convert("RGB") >>>>>>> pixdata = im.load() >>>>>>> for y in range(im.height): >>>>>>> for x in range(im.width): >>>>>>> if pixdata[x, y] != (51, 51, 51): >>>>>>> pixdata[x, y] = (255, 255, 255) >>>>>>> is a bit slow. Do you know a better way to replace pixels in python? >>>>>>> I don't know if this is off topic. >>>>>>> On Wednesday, December 29, 2021 at 9:46:13 AM UTC-8 zdenop wrote: >>>>>>> >>>>>>>> If you properly crop text areas you get good output. E.g. >>>>>>>> >>>>>>>> [image: r_cropped.png] >>>>>>>> >>>>>>>> > tesseract r_cropped.png - --dpi 300 >>>>>>>> >>>>>>>> Rascal Does Not Dream >>>>>>>> of Bunny Girl Senpai >>>>>>>> >>>>>>>> Zdenko >>>>>>>> >>>>>>>> >>>>>>>> st 29. 12. 2021 o 18:21 Cyrus Yip <cyrus...@gmail.com> napísal(a): >>>>>>>> >>>>>>>>> here is an example of an image i would like to use ocr on: >>>>>>>>> [image: drop8.png] >>>>>>>>> I would like the results to be like: >>>>>>>>> ["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not Dream of >>>>>>>>> Bunny Girl Senpai", "Keqing Genshin Impact"] >>>>>>>>> >>>>>>>>> Right now I'm using >>>>>>>>> region1 = im.crop((0, 55, im.width, 110)) >>>>>>>>> region2 = im.crop((0, 312, im.width, 360)) >>>>>>>>> image = Image.new("RGB", (im.width, region1.height + >>>>>>>>> region2.height + 20)) >>>>>>>>> image.paste(region1) >>>>>>>>> image.paste(region2, (0, region1.height + 20)) >>>>>>>>> results = pytesseract.image_to_data(image, >>>>>>>>> output_type=pytesseract.Output.DICT) >>>>>>>>> >>>>>>>>> >>>>>>>>> the processed image looks like >>>>>>>>> [image: hi.png] >>>>>>>>> but getting results like: >>>>>>>>> [' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai', >>>>>>>>> 'iGenshinImpact'] >>>>>>>>> >>>>>>>>> How do I optimize the image/configs so the ocr is more accurate? >>>>>>>>> >>>>>>>>> Thank you. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>> >>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com.