Re: [tesseract-ocr] bad quality!?

Cyrus Yip Sat, 01 Jan 2022 10:49:50 -0800

i managed to install tesseract 5, but the numpy mask doesn't work now.
it makes pictures like:
[image: image.png]
not:
[image: image.png]



Dockerfile:
# syntax=docker/dockerfile:1 ARG TOKEN FROM ubuntu:18.04 RUN apt-get update 
RUN apt-get install -y software-properties-common RUN apt-get install -y 
python3.8 RUN apt-get install -y python3-pip RUN apt-get update RUN apt-get 
install -y build-essential RUN apt-get install -y python3-pil COPY 
requirements.txt requirements.txt RUN pip3 install -r requirements.txt RUN 
apt-get update RUN add-apt-repository ppa:alex-p/tesseract-ocr5 RUN apt-get 
update RUN apt-get install -y tesseract-ocr COPY . . CMD ["python3", 
"bot.py"]

On Friday, December 31, 2021 at 10:29:59 AM UTC-8 Cyrus Yip wrote:

> better link? <https://www.toptal.com/developers/hastebin/nonepalihe>
>
> On Friday, December 31, 2021 at 10:27:41 AM UTC-8 Cyrus Yip wrote:
>
>> Right now I'm installing tesseract 4 in docker with 
>> RUN apt-get install -y tesseract-ocr
>> That might be a reason why it's way slower than on my computer, how can I 
>> install tesseract 5?
>>
>> Dockerfile # syntax=docker/dockerfile:1
>>
>> ARG TOKEN
>>
>> FROM python:3.8-slim-buster
>>
>> RUN apt-get update
>> RUN apt-get install -y software-properties-common
>> RUN apt-get update
>> RUN add-apt-repository ppa:alex-p/tesseract-ocr-devel
>>
>> RUN apt-get update
>> RUN apt-get install -y build-essential
>>
>> COPY requirements.txt requirements.txt
>> RUN pip3 install -r requirements.txt
>>
>> COPY . .
>>
>> RUN apt-get install -y tesseract
>>
>> CMD ["python3", "bot.py"]
>>
>> Build logs 
>> <https://appbuild-logs-ams3.ams3.digitaloceanspaces.com/a7609af2-64e1-4ba2-8555-87a4fac8a37f/9420eaef-131e-410f-8add-bbfb870b2693/981a4c35-45d7-41b5-8619-3d9125d60c25/build.log?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=2JPIHVK4OTM6S5VRFBCK%2F20211231%2Fams3%2Fs3%2Faws4_request&X-Amz-Date=20211231T182608Z&X-Amz-Expires=900&X-Amz-SignedHeaders=host&X-Amz-Signature=3ae248ce9fb9e6fef0c71955d9cd9496feb8311162bdda8921750a21544f79a6>
>>
>>
>> On Friday, December 31, 2021 at 3:18:18 AM UTC-8 zdenop wrote:
>>
>>> You are right -  np.isin is working another way than I expected (it 
>>> does not match tuples, but individual values at tuples) and by coincidence, 
>>> it produces similar results as your code.
>>>
>>> Here is updated code that produces the same result as PIL. It is faster 
>>> but with an increasing number of colors in  filter_colors, it will be 
>>> slower.
>>>
>>> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), 
>>> (67, 66, 62),
>>>           (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 
>>> 61, 58),
>>>           (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>
>>> image = np.array(Image.open('mai.png').convert("RGB"))
>>> mask = np.array([], dtype=bool)
>>> for color in filter_colors:
>>>     if mask.size == 0:
>>>         mask = (image == color).all(-1)
>>>     else:
>>>         mask = mask | (image == color).all(-1)
>>> img = Image.fromarray(~mask)
>>>
>>>
>>> Zdenko
>>>
>>>
>>> pi 31. 12. 2021 o 1:45 Cyrus Yip <cyrus...@gmail.com> napísal(a):
>>>
>>>> For some reason, using the numpy array has a different result than mine.
>>>>
>>>> Numpy array:
>>>>
>>>> [image: hi.png]
>>>> Loop through pixels:
>>>> [image: hi.png]
>>>> The second was is more accurate but way slower.
>>>> On Thursday, December 30, 2021 at 11:43:01 AM UTC-8 zdenop wrote:
>>>>
>>>>> try this:
>>>>>
>>>>> import numpy as np
>>>>> from PIL import Image
>>>>>
>>>>> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 
>>>>> 56), (67, 66, 62),
>>>>>
>>>>>           (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 
>>>>> 61, 58),
>>>>>           (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>>> image = np.array(Image.open('mai.png').convert("RGB"))
>>>>> mask = np.isin(image, filter_colors, invert=True)
>>>>> img = Image.fromarray(mask.any(axis=2))
>>>>>
>>>>>
>>>>> Zdenko
>>>>>
>>>>>
>>>>> št 30. 12. 2021 o 18:14 Cyrus Yip <cyrus...@gmail.com> napísal(a):
>>>>>
>>>>>> I also tried many things like cropping, colour changing, colour 
>>>>>> replacing, and mixing them together.
>>>>>>
>>>>>> I landed on checking if a pixel is not one of these:
>>>>>>
>>>>>> [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 
>>>>>> 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 
>>>>>> 58), 
>>>>>> (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>>>>
>>>>>> colours, replace it with white. It is pretty accurate but is there a 
>>>>>> way to do this with numpy arrays?
>>>>>>
>>>>>> (code)
>>>>>> for x in range(im.width):
>>>>>>     if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64, 
>>>>>> 60), (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 
>>>>>> 53), 
>>>>>> (54, 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), 
>>>>>> (56, 
>>>>>> 56, 55)]:
>>>>>>         pixels[x, y] = (255, 255, 255)
>>>>>> On Thursday, December 30, 2021 at 8:46:51 AM UTC-8 zdenop wrote:
>>>>>>
>>>>>>> OK. I played a little bit ;-):
>>>>>>>
>>>>>>> I tested the speed of your code with your image:
>>>>>>>
>>>>>>> import timeit
>>>>>>>
>>>>>>> pil_color_replace = """
>>>>>>> from PIL import Image
>>>>>>>
>>>>>>> im = Image.open('mai.png').convert("RGB")
>>>>>>>
>>>>>>> pixdata = im.load()
>>>>>>> for y in range(im.height):
>>>>>>>     for x in range(im.width):
>>>>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>>>>             pixdata[x, y] = (255, 255, 255)
>>>>>>> """
>>>>>>>
>>>>>>> elapsed_time = timeit.timeit(pil_color_replace, number=100)/100
>>>>>>> print(f"duration: {elapsed_time:.4} seconds")
>>>>>>>
>>>>>>> I got an average speed 0.08547 seconds on my computer.
>>>>>>> On internet I found the suggestion to use numpy for this and I 
>>>>>>> finished with the following code:
>>>>>>>
>>>>>>> np_color_replace_rgb = """
>>>>>>> import numpy as np
>>>>>>> from PIL import Image
>>>>>>>
>>>>>>> data = np.array(Image.open('mai.png').convert("RGB"))
>>>>>>> mask = (data == [51, 51, 51]).all(-1)
>>>>>>> img = Image.fromarray(np.invert(mask)) 
>>>>>>> """
>>>>>>>
>>>>>>> elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100
>>>>>>> print(f"duration: {elapsed_time:.4} seconds")
>>>>>>>
>>>>>>> I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL 
>>>>>>> code.
>>>>>>> It is a little bit cheating as it does not replace colors - just 
>>>>>>> take a mask of target color and return it as a binarized image, what is 
>>>>>>> exactly what you need for OCR ;-)
>>>>>>>
>>>>>>> Also, I would like to point out that the result OCR output is not so 
>>>>>>> perfect (compared to OCR of unmodified text areas), as this kind of 
>>>>>>> binarization is very simple.
>>>>>>>
>>>>>>>
>>>>>>> Zdenko
>>>>>>>
>>>>>>>
>>>>>>> št 30. 12. 2021 o 11:19 Zdenko Podobny <zde...@gmail.com> 
>>>>>>> napísal(a):
>>>>>>>
>>>>>>>> Just made your tests ;-)
>>>>>>>>
>>>>>>>> You can use tesserocr (maybe quite difficult installation if you 
>>>>>>>> are on windows) instead of pytesseract (e.g. initialize tesseract API 
>>>>>>>> once 
>>>>>>>> and use is multiple times). But it does not provide DICT output.
>>>>>>>>
>>>>>>>>
>>>>>>>> Zdenko
>>>>>>>>
>>>>>>>>
>>>>>>>> st 29. 12. 2021 o 21:18 Cyrus Yip <cyrus...@gmail.com> napísal(a):
>>>>>>>>
>>>>>>>>> but won't multiple ocr's and crops use a lot of time?
>>>>>>>>>
>>>>>>>>> On Wednesday, December 29, 2021 at 10:15:26 AM UTC-8 zdenop wrote:
>>>>>>>>>
>>>>>>>>>> IMO if the text is always in the same area, cropping and OCR just 
>>>>>>>>>> that area will be faster.
>>>>>>>>>>
>>>>>>>>>> Zdenko
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> st 29. 12. 2021 o 18:58 Cyrus Yip <cyrus...@gmail.com> 
>>>>>>>>>> napísal(a):
>>>>>>>>>>
>>>>>>>>>>> I played around a bit and replacing all colours except for text 
>>>>>>>>>>> colour and it works pretty well!
>>>>>>>>>>>
>>>>>>>>>>> The only thing is replacing colours with:
>>>>>>>>>>> im = im.convert("RGB")
>>>>>>>>>>> pixdata = im.load()
>>>>>>>>>>> for y in range(im.height):
>>>>>>>>>>>     for x in range(im.width):
>>>>>>>>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>>>>>>>>             pixdata[x, y] = (255, 255, 255)
>>>>>>>>>>> is a bit slow. Do you know a better way to replace pixels in 
>>>>>>>>>>> python? I don't know if this is off topic.
>>>>>>>>>>> On Wednesday, December 29, 2021 at 9:46:13 AM UTC-8 zdenop wrote:
>>>>>>>>>>>
>>>>>>>>>>>> If you properly crop text areas you get good output. E.g.
>>>>>>>>>>>>
>>>>>>>>>>>> [image: r_cropped.png]
>>>>>>>>>>>>
>>>>>>>>>>>> > tesseract r_cropped.png - --dpi 300
>>>>>>>>>>>>
>>>>>>>>>>>> Rascal Does Not Dream
>>>>>>>>>>>> of Bunny Girl Senpai
>>>>>>>>>>>>
>>>>>>>>>>>> Zdenko
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> st 29. 12. 2021 o 18:21 Cyrus Yip <cyrus...@gmail.com> 
>>>>>>>>>>>> napísal(a):
>>>>>>>>>>>>
>>>>>>>>>>>>> here is an example of an image i would like to use ocr on:
>>>>>>>>>>>>> [image: drop8.png]
>>>>>>>>>>>>> I would like the results to be like:
>>>>>>>>>>>>> ["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not 
>>>>>>>>>>>>> Dream of Bunny Girl Senpai", "Keqing Genshin Impact"]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Right now I'm using
>>>>>>>>>>>>> region1 = im.crop((0, 55, im.width, 110))
>>>>>>>>>>>>> region2 = im.crop((0, 312, im.width, 360))
>>>>>>>>>>>>> image = Image.new("RGB", (im.width, region1.height + 
>>>>>>>>>>>>> region2.height + 20))
>>>>>>>>>>>>> image.paste(region1)
>>>>>>>>>>>>> image.paste(region2, (0, region1.height + 20))
>>>>>>>>>>>>> results = pytesseract.image_to_data(image, 
>>>>>>>>>>>>> output_type=pytesseract.Output.DICT)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> the processed image looks like
>>>>>>>>>>>>> [image: hi.png]
>>>>>>>>>>>>> but getting results like:
>>>>>>>>>>>>> [' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai', 
>>>>>>>>>>>>> 'iGenshinImpact']
>>>>>>>>>>>>>
>>>>>>>>>>>>> How do I optimize the image/configs so the ocr is more 
>>>>>>>>>>>>> accurate?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com
>>>>>>>>>>>>>  
>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com
>>>>>>>>>>>  
>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>
>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>>
>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c7626180-9bd7-4759-9f0e-df0b0697ab15n%40googlegroups.com.

Re: [tesseract-ocr] bad quality!?

Reply via email to