i managed to install tesseract 5, but the numpy mask doesn't work now. it makes pictures like: [image: image.png] not: [image: image.png]
Dockerfile: # syntax=docker/dockerfile:1 ARG TOKEN FROM ubuntu:18.04 RUN apt-get update RUN apt-get install -y software-properties-common RUN apt-get install -y python3.8 RUN apt-get install -y python3-pip RUN apt-get update RUN apt-get install -y build-essential RUN apt-get install -y python3-pil COPY requirements.txt requirements.txt RUN pip3 install -r requirements.txt RUN apt-get update RUN add-apt-repository ppa:alex-p/tesseract-ocr5 RUN apt-get update RUN apt-get install -y tesseract-ocr COPY . . CMD ["python3", "bot.py"] On Friday, December 31, 2021 at 10:29:59 AM UTC-8 Cyrus Yip wrote: > better link? <https://www.toptal.com/developers/hastebin/nonepalihe> > > On Friday, December 31, 2021 at 10:27:41 AM UTC-8 Cyrus Yip wrote: > >> Right now I'm installing tesseract 4 in docker with >> RUN apt-get install -y tesseract-ocr >> That might be a reason why it's way slower than on my computer, how can I >> install tesseract 5? >> >> Dockerfile # syntax=docker/dockerfile:1 >> >> ARG TOKEN >> >> FROM python:3.8-slim-buster >> >> RUN apt-get update >> RUN apt-get install -y software-properties-common >> RUN apt-get update >> RUN add-apt-repository ppa:alex-p/tesseract-ocr-devel >> >> RUN apt-get update >> RUN apt-get install -y build-essential >> >> COPY requirements.txt requirements.txt >> RUN pip3 install -r requirements.txt >> >> COPY . . >> >> RUN apt-get install -y tesseract >> >> CMD ["python3", "bot.py"] >> >> Build logs >> <https://appbuild-logs-ams3.ams3.digitaloceanspaces.com/a7609af2-64e1-4ba2-8555-87a4fac8a37f/9420eaef-131e-410f-8add-bbfb870b2693/981a4c35-45d7-41b5-8619-3d9125d60c25/build.log?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=2JPIHVK4OTM6S5VRFBCK%2F20211231%2Fams3%2Fs3%2Faws4_request&X-Amz-Date=20211231T182608Z&X-Amz-Expires=900&X-Amz-SignedHeaders=host&X-Amz-Signature=3ae248ce9fb9e6fef0c71955d9cd9496feb8311162bdda8921750a21544f79a6> >> >> >> On Friday, December 31, 2021 at 3:18:18 AM UTC-8 zdenop wrote: >> >>> You are right - np.isin is working another way than I expected (it >>> does not match tuples, but individual values at tuples) and by coincidence, >>> it produces similar results as your code. >>> >>> Here is updated code that produces the same result as PIL. It is faster >>> but with an increasing number of colors in filter_colors, it will be >>> slower. >>> >>> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), >>> (67, 66, 62), >>> (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, >>> 61, 58), >>> (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)] >>> >>> image = np.array(Image.open('mai.png').convert("RGB")) >>> mask = np.array([], dtype=bool) >>> for color in filter_colors: >>> if mask.size == 0: >>> mask = (image == color).all(-1) >>> else: >>> mask = mask | (image == color).all(-1) >>> img = Image.fromarray(~mask) >>> >>> >>> Zdenko >>> >>> >>> pi 31. 12. 2021 o 1:45 Cyrus Yip <cyrus...@gmail.com> napísal(a): >>> >>>> For some reason, using the numpy array has a different result than mine. >>>> >>>> Numpy array: >>>> >>>> [image: hi.png] >>>> Loop through pixels: >>>> [image: hi.png] >>>> The second was is more accurate but way slower. >>>> On Thursday, December 30, 2021 at 11:43:01 AM UTC-8 zdenop wrote: >>>> >>>>> try this: >>>>> >>>>> import numpy as np >>>>> from PIL import Image >>>>> >>>>> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, >>>>> 56), (67, 66, 62), >>>>> >>>>> (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, >>>>> 61, 58), >>>>> (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)] >>>>> image = np.array(Image.open('mai.png').convert("RGB")) >>>>> mask = np.isin(image, filter_colors, invert=True) >>>>> img = Image.fromarray(mask.any(axis=2)) >>>>> >>>>> >>>>> Zdenko >>>>> >>>>> >>>>> št 30. 12. 2021 o 18:14 Cyrus Yip <cyrus...@gmail.com> napísal(a): >>>>> >>>>>> I also tried many things like cropping, colour changing, colour >>>>>> replacing, and mixing them together. >>>>>> >>>>>> I landed on checking if a pixel is not one of these: >>>>>> >>>>>> [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, >>>>>> 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, >>>>>> 58), >>>>>> (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)] >>>>>> >>>>>> colours, replace it with white. It is pretty accurate but is there a >>>>>> way to do this with numpy arrays? >>>>>> >>>>>> (code) >>>>>> for x in range(im.width): >>>>>> if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64, >>>>>> 60), (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, >>>>>> 53), >>>>>> (54, 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), >>>>>> (56, >>>>>> 56, 55)]: >>>>>> pixels[x, y] = (255, 255, 255) >>>>>> On Thursday, December 30, 2021 at 8:46:51 AM UTC-8 zdenop wrote: >>>>>> >>>>>>> OK. I played a little bit ;-): >>>>>>> >>>>>>> I tested the speed of your code with your image: >>>>>>> >>>>>>> import timeit >>>>>>> >>>>>>> pil_color_replace = """ >>>>>>> from PIL import Image >>>>>>> >>>>>>> im = Image.open('mai.png').convert("RGB") >>>>>>> >>>>>>> pixdata = im.load() >>>>>>> for y in range(im.height): >>>>>>> for x in range(im.width): >>>>>>> if pixdata[x, y] != (51, 51, 51): >>>>>>> pixdata[x, y] = (255, 255, 255) >>>>>>> """ >>>>>>> >>>>>>> elapsed_time = timeit.timeit(pil_color_replace, number=100)/100 >>>>>>> print(f"duration: {elapsed_time:.4} seconds") >>>>>>> >>>>>>> I got an average speed 0.08547 seconds on my computer. >>>>>>> On internet I found the suggestion to use numpy for this and I >>>>>>> finished with the following code: >>>>>>> >>>>>>> np_color_replace_rgb = """ >>>>>>> import numpy as np >>>>>>> from PIL import Image >>>>>>> >>>>>>> data = np.array(Image.open('mai.png').convert("RGB")) >>>>>>> mask = (data == [51, 51, 51]).all(-1) >>>>>>> img = Image.fromarray(np.invert(mask)) >>>>>>> """ >>>>>>> >>>>>>> elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100 >>>>>>> print(f"duration: {elapsed_time:.4} seconds") >>>>>>> >>>>>>> I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL >>>>>>> code. >>>>>>> It is a little bit cheating as it does not replace colors - just >>>>>>> take a mask of target color and return it as a binarized image, what is >>>>>>> exactly what you need for OCR ;-) >>>>>>> >>>>>>> Also, I would like to point out that the result OCR output is not so >>>>>>> perfect (compared to OCR of unmodified text areas), as this kind of >>>>>>> binarization is very simple. >>>>>>> >>>>>>> >>>>>>> Zdenko >>>>>>> >>>>>>> >>>>>>> št 30. 12. 2021 o 11:19 Zdenko Podobny <zde...@gmail.com> >>>>>>> napísal(a): >>>>>>> >>>>>>>> Just made your tests ;-) >>>>>>>> >>>>>>>> You can use tesserocr (maybe quite difficult installation if you >>>>>>>> are on windows) instead of pytesseract (e.g. initialize tesseract API >>>>>>>> once >>>>>>>> and use is multiple times). But it does not provide DICT output. >>>>>>>> >>>>>>>> >>>>>>>> Zdenko >>>>>>>> >>>>>>>> >>>>>>>> st 29. 12. 2021 o 21:18 Cyrus Yip <cyrus...@gmail.com> napísal(a): >>>>>>>> >>>>>>>>> but won't multiple ocr's and crops use a lot of time? >>>>>>>>> >>>>>>>>> On Wednesday, December 29, 2021 at 10:15:26 AM UTC-8 zdenop wrote: >>>>>>>>> >>>>>>>>>> IMO if the text is always in the same area, cropping and OCR just >>>>>>>>>> that area will be faster. >>>>>>>>>> >>>>>>>>>> Zdenko >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> st 29. 12. 2021 o 18:58 Cyrus Yip <cyrus...@gmail.com> >>>>>>>>>> napísal(a): >>>>>>>>>> >>>>>>>>>>> I played around a bit and replacing all colours except for text >>>>>>>>>>> colour and it works pretty well! >>>>>>>>>>> >>>>>>>>>>> The only thing is replacing colours with: >>>>>>>>>>> im = im.convert("RGB") >>>>>>>>>>> pixdata = im.load() >>>>>>>>>>> for y in range(im.height): >>>>>>>>>>> for x in range(im.width): >>>>>>>>>>> if pixdata[x, y] != (51, 51, 51): >>>>>>>>>>> pixdata[x, y] = (255, 255, 255) >>>>>>>>>>> is a bit slow. Do you know a better way to replace pixels in >>>>>>>>>>> python? I don't know if this is off topic. >>>>>>>>>>> On Wednesday, December 29, 2021 at 9:46:13 AM UTC-8 zdenop wrote: >>>>>>>>>>> >>>>>>>>>>>> If you properly crop text areas you get good output. E.g. >>>>>>>>>>>> >>>>>>>>>>>> [image: r_cropped.png] >>>>>>>>>>>> >>>>>>>>>>>> > tesseract r_cropped.png - --dpi 300 >>>>>>>>>>>> >>>>>>>>>>>> Rascal Does Not Dream >>>>>>>>>>>> of Bunny Girl Senpai >>>>>>>>>>>> >>>>>>>>>>>> Zdenko >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> st 29. 12. 2021 o 18:21 Cyrus Yip <cyrus...@gmail.com> >>>>>>>>>>>> napísal(a): >>>>>>>>>>>> >>>>>>>>>>>>> here is an example of an image i would like to use ocr on: >>>>>>>>>>>>> [image: drop8.png] >>>>>>>>>>>>> I would like the results to be like: >>>>>>>>>>>>> ["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not >>>>>>>>>>>>> Dream of Bunny Girl Senpai", "Keqing Genshin Impact"] >>>>>>>>>>>>> >>>>>>>>>>>>> Right now I'm using >>>>>>>>>>>>> region1 = im.crop((0, 55, im.width, 110)) >>>>>>>>>>>>> region2 = im.crop((0, 312, im.width, 360)) >>>>>>>>>>>>> image = Image.new("RGB", (im.width, region1.height + >>>>>>>>>>>>> region2.height + 20)) >>>>>>>>>>>>> image.paste(region1) >>>>>>>>>>>>> image.paste(region2, (0, region1.height + 20)) >>>>>>>>>>>>> results = pytesseract.image_to_data(image, >>>>>>>>>>>>> output_type=pytesseract.Output.DICT) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> the processed image looks like >>>>>>>>>>>>> [image: hi.png] >>>>>>>>>>>>> but getting results like: >>>>>>>>>>>>> [' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai', >>>>>>>>>>>>> 'iGenshinImpact'] >>>>>>>>>>>>> >>>>>>>>>>>>> How do I optimize the image/configs so the ocr is more >>>>>>>>>>>>> accurate? >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you. >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com >>>>>>>>>>>>> >>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>> . >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>>> >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com >>>>>>>>>>> >>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>> . >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> >>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c7626180-9bd7-4759-9f0e-df0b0697ab15n%40googlegroups.com.