Re: [tesseract-ocr] bad quality!?

Zdenko Podobny Sat, 01 Jan 2022 11:41:28 -0800

What is your code? Does it work on your local computer?

BTW: here is proven numpy code:


filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56),
(67, 66, 62),
          (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61,
58),
          (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]

image = np.array(Image.open('mina.png').convert("RGB"))

*A, B = image.shape
mask = (image.reshape((-1,B)) ==
np.array(filter_colors)[:,None]).all(-1).any(0).reshape(A)
img = Image.fromarray(~mask)


Zdenko


so 1. 1. 2022 o 19:49 Cyrus Yip <cyruscmy...@gmail.com> napísal(a):

> i managed to install tesseract 5, but the numpy mask doesn't work now.
> it makes pictures like:
> [image: image.png]
> not:
> [image: image.png]
>
>
> Dockerfile:
> # syntax=docker/dockerfile:1 ARG TOKEN FROM ubuntu:18.04 RUN apt-get
> update RUN apt-get install -y software-properties-common RUN apt-get
> install -y python3.8 RUN apt-get install -y python3-pip RUN apt-get update
> RUN apt-get install -y build-essential RUN apt-get install -y python3-pil
> COPY requirements.txt requirements.txt RUN pip3 install -r
> requirements.txt RUN apt-get update RUN add-apt-repository
> ppa:alex-p/tesseract-ocr5 RUN apt-get update RUN apt-get install -y
> tesseract-ocr COPY . . CMD ["python3", "bot.py"]
>
> On Friday, December 31, 2021 at 10:29:59 AM UTC-8 Cyrus Yip wrote:
>
>> better link? <https://www.toptal.com/developers/hastebin/nonepalihe>
>>
>> On Friday, December 31, 2021 at 10:27:41 AM UTC-8 Cyrus Yip wrote:
>>
>>> Right now I'm installing tesseract 4 in docker with
>>> RUN apt-get install -y tesseract-ocr
>>> That might be a reason why it's way slower than on my computer, how can
>>> I install tesseract 5?
>>>
>>> Dockerfile # syntax=docker/dockerfile:1
>>>
>>> ARG TOKEN
>>>
>>> FROM python:3.8-slim-buster
>>>
>>> RUN apt-get update
>>> RUN apt-get install -y software-properties-common
>>> RUN apt-get update
>>> RUN add-apt-repository ppa:alex-p/tesseract-ocr-devel
>>>
>>> RUN apt-get update
>>> RUN apt-get install -y build-essential
>>>
>>> COPY requirements.txt requirements.txt
>>> RUN pip3 install -r requirements.txt
>>>
>>> COPY . .
>>>
>>> RUN apt-get install -y tesseract
>>>
>>> CMD ["python3", "bot.py"]
>>>
>>> Build logs
>>> <https://appbuild-logs-ams3.ams3.digitaloceanspaces.com/a7609af2-64e1-4ba2-8555-87a4fac8a37f/9420eaef-131e-410f-8add-bbfb870b2693/981a4c35-45d7-41b5-8619-3d9125d60c25/build.log?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=2JPIHVK4OTM6S5VRFBCK%2F20211231%2Fams3%2Fs3%2Faws4_request&X-Amz-Date=20211231T182608Z&X-Amz-Expires=900&X-Amz-SignedHeaders=host&X-Amz-Signature=3ae248ce9fb9e6fef0c71955d9cd9496feb8311162bdda8921750a21544f79a6>
>>>
>>>
>>> On Friday, December 31, 2021 at 3:18:18 AM UTC-8 zdenop wrote:
>>>
>>>> You are right -  np.isin is working another way than I expected (it
>>>> does not match tuples, but individual values at tuples) and by coincidence,
>>>> it produces similar results as your code.
>>>>
>>>> Here is updated code that produces the same result as PIL. It is faster
>>>> but with an increasing number of colors in  filter_colors, it will be
>>>> slower.
>>>>
>>>> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58,
>>>> 56), (67, 66, 62),
>>>>           (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61,
>>>> 61, 58),
>>>>           (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>>
>>>> image = np.array(Image.open('mai.png').convert("RGB"))
>>>> mask = np.array([], dtype=bool)
>>>> for color in filter_colors:
>>>>     if mask.size == 0:
>>>>         mask = (image == color).all(-1)
>>>>     else:
>>>>         mask = mask | (image == color).all(-1)
>>>> img = Image.fromarray(~mask)
>>>>
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> pi 31. 12. 2021 o 1:45 Cyrus Yip <cyrus...@gmail.com> napísal(a):
>>>>
>>>>> For some reason, using the numpy array has a different result than
>>>>> mine.
>>>>>
>>>>> Numpy array:
>>>>>
>>>>> [image: hi.png]
>>>>> Loop through pixels:
>>>>> [image: hi.png]
>>>>> The second was is more accurate but way slower.
>>>>> On Thursday, December 30, 2021 at 11:43:01 AM UTC-8 zdenop wrote:
>>>>>
>>>>>> try this:
>>>>>>
>>>>>> import numpy as np
>>>>>> from PIL import Image
>>>>>>
>>>>>> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58,
>>>>>> 56), (67, 66, 62),
>>>>>>
>>>>>>           (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53),
>>>>>> (61, 61, 58),
>>>>>>           (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>>>> image = np.array(Image.open('mai.png').convert("RGB"))
>>>>>> mask = np.isin(image, filter_colors, invert=True)
>>>>>> img = Image.fromarray(mask.any(axis=2))
>>>>>>
>>>>>>
>>>>>> Zdenko
>>>>>>
>>>>>>
>>>>>> št 30. 12. 2021 o 18:14 Cyrus Yip <cyrus...@gmail.com> napísal(a):
>>>>>>
>>>>>>> I also tried many things like cropping, colour changing, colour
>>>>>>> replacing, and mixing them together.
>>>>>>>
>>>>>>> I landed on checking if a pixel is not one of these:
>>>>>>>
>>>>>>> [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66,
>>>>>>> 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 
>>>>>>> 58),
>>>>>>> (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>>>>>
>>>>>>> colours, replace it with white. It is pretty accurate but is there a
>>>>>>> way to do this with numpy arrays?
>>>>>>>
>>>>>>> (code)
>>>>>>> for x in range(im.width):
>>>>>>>     if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64,
>>>>>>> 60), (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 
>>>>>>> 53),
>>>>>>> (54, 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), 
>>>>>>> (56,
>>>>>>> 56, 55)]:
>>>>>>>         pixels[x, y] = (255, 255, 255)
>>>>>>> On Thursday, December 30, 2021 at 8:46:51 AM UTC-8 zdenop wrote:
>>>>>>>
>>>>>>>> OK. I played a little bit ;-):
>>>>>>>>
>>>>>>>> I tested the speed of your code with your image:
>>>>>>>>
>>>>>>>> import timeit
>>>>>>>>
>>>>>>>> pil_color_replace = """
>>>>>>>> from PIL import Image
>>>>>>>>
>>>>>>>> im = Image.open('mai.png').convert("RGB")
>>>>>>>>
>>>>>>>> pixdata = im.load()
>>>>>>>> for y in range(im.height):
>>>>>>>>     for x in range(im.width):
>>>>>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>>>>>             pixdata[x, y] = (255, 255, 255)
>>>>>>>> """
>>>>>>>>
>>>>>>>> elapsed_time = timeit.timeit(pil_color_replace, number=100)/100
>>>>>>>> print(f"duration: {elapsed_time:.4} seconds")
>>>>>>>>
>>>>>>>> I got an average speed 0.08547 seconds on my computer.
>>>>>>>> On internet I found the suggestion to use numpy for this and I
>>>>>>>> finished with the following code:
>>>>>>>>
>>>>>>>> np_color_replace_rgb = """
>>>>>>>> import numpy as np
>>>>>>>> from PIL import Image
>>>>>>>>
>>>>>>>> data = np.array(Image.open('mai.png').convert("RGB"))
>>>>>>>> mask = (data == [51, 51, 51]).all(-1)
>>>>>>>> img = Image.fromarray(np.invert(mask))
>>>>>>>> """
>>>>>>>>
>>>>>>>> elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100
>>>>>>>> print(f"duration: {elapsed_time:.4} seconds")
>>>>>>>>
>>>>>>>> I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL
>>>>>>>> code.
>>>>>>>> It is a little bit cheating as it does not replace colors - just
>>>>>>>> take a mask of target color and return it as a binarized image, what is
>>>>>>>> exactly what you need for OCR ;-)
>>>>>>>>
>>>>>>>> Also, I would like to point out that the result OCR output is not
>>>>>>>> so perfect (compared to OCR of unmodified text areas), as this kind of
>>>>>>>> binarization is very simple.
>>>>>>>>
>>>>>>>>
>>>>>>>> Zdenko
>>>>>>>>
>>>>>>>>
>>>>>>>> št 30. 12. 2021 o 11:19 Zdenko Podobny <zde...@gmail.com>
>>>>>>>> napísal(a):
>>>>>>>>
>>>>>>>>> Just made your tests ;-)
>>>>>>>>>
>>>>>>>>> You can use tesserocr (maybe quite difficult installation if you
>>>>>>>>> are on windows) instead of pytesseract (e.g. initialize tesseract API 
>>>>>>>>> once
>>>>>>>>> and use is multiple times). But it does not provide DICT output.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Zdenko
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> st 29. 12. 2021 o 21:18 Cyrus Yip <cyrus...@gmail.com> napísal(a):
>>>>>>>>>
>>>>>>>>>> but won't multiple ocr's and crops use a lot of time?
>>>>>>>>>>
>>>>>>>>>> On Wednesday, December 29, 2021 at 10:15:26 AM UTC-8 zdenop wrote:
>>>>>>>>>>
>>>>>>>>>>> IMO if the text is always in the same area, cropping and OCR
>>>>>>>>>>> just that area will be faster.
>>>>>>>>>>>
>>>>>>>>>>> Zdenko
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> st 29. 12. 2021 o 18:58 Cyrus Yip <cyrus...@gmail.com>
>>>>>>>>>>> napísal(a):
>>>>>>>>>>>
>>>>>>>>>>>> I played around a bit and replacing all colours except for text
>>>>>>>>>>>> colour and it works pretty well!
>>>>>>>>>>>>
>>>>>>>>>>>> The only thing is replacing colours with:
>>>>>>>>>>>> im = im.convert("RGB")
>>>>>>>>>>>> pixdata = im.load()
>>>>>>>>>>>> for y in range(im.height):
>>>>>>>>>>>>     for x in range(im.width):
>>>>>>>>>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>>>>>>>>>             pixdata[x, y] = (255, 255, 255)
>>>>>>>>>>>> is a bit slow. Do you know a better way to replace pixels in
>>>>>>>>>>>> python? I don't know if this is off topic.
>>>>>>>>>>>> On Wednesday, December 29, 2021 at 9:46:13 AM UTC-8 zdenop
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> If you properly crop text areas you get good output. E.g.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: r_cropped.png]
>>>>>>>>>>>>>
>>>>>>>>>>>>> > tesseract r_cropped.png - --dpi 300
>>>>>>>>>>>>>
>>>>>>>>>>>>> Rascal Does Not Dream
>>>>>>>>>>>>> of Bunny Girl Senpai
>>>>>>>>>>>>>
>>>>>>>>>>>>> Zdenko
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> st 29. 12. 2021 o 18:21 Cyrus Yip <cyrus...@gmail.com>
>>>>>>>>>>>>> napísal(a):
>>>>>>>>>>>>>
>>>>>>>>>>>>>> here is an example of an image i would like to use ocr on:
>>>>>>>>>>>>>> [image: drop8.png]
>>>>>>>>>>>>>> I would like the results to be like:
>>>>>>>>>>>>>> ["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not
>>>>>>>>>>>>>> Dream of Bunny Girl Senpai", "Keqing Genshin Impact"]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Right now I'm using
>>>>>>>>>>>>>> region1 = im.crop((0, 55, im.width, 110))
>>>>>>>>>>>>>> region2 = im.crop((0, 312, im.width, 360))
>>>>>>>>>>>>>> image = Image.new("RGB", (im.width, region1.height +
>>>>>>>>>>>>>> region2.height + 20))
>>>>>>>>>>>>>> image.paste(region1)
>>>>>>>>>>>>>> image.paste(region2, (0, region1.height + 20))
>>>>>>>>>>>>>> results = pytesseract.image_to_data(image,
>>>>>>>>>>>>>> output_type=pytesseract.Output.DICT)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the processed image looks like
>>>>>>>>>>>>>> [image: hi.png]
>>>>>>>>>>>>>> but getting results like:
>>>>>>>>>>>>>> [' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai',
>>>>>>>>>>>>>> 'iGenshinImpact']
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> How do I optimize the image/configs so the ocr is more
>>>>>>>>>>>>>> accurate?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com
>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>
>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com
>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>
>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>>
>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c7626180-9bd7-4759-9f0e-df0b0697ab15n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c7626180-9bd7-4759-9f0e-df0b0697ab15n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yovMp1F-J-jMw5Ed%2B3%3D9MhgTD9gAk_kBKOXg1mXzjK3Q%40mail.gmail.com.

Re: [tesseract-ocr] bad quality!?

Reply via email to