Re: [tesseract-ocr] IF I could make .unicharset by box/tif pairs instead of fonts files by tesstrain.sh?

2018-08-28 Thread WangSiyuan

Hey shree

Thank you for reply.

I had noticed the /tmp directory,and I will try this new flag for viewing 
how the fonts files change into the box/tiff pairs.




在 2018年8月28日星期二 UTC+8下午1:49:40,shree写道:
>
> When using tesstrain.sh, you can add --save_box_tiff to the command line.
>
> Original tesstrain.sh did not move box/tiff alongwith lstmf files (they 
> remained in /tmp directory).
>
> I had modified it first to move box/tiff in all cases along with lstmf 
> files.
>
> This option now gives the user the choice whether to save the box/tiff 
> pairs or not. Default is NOT to save them, since they are not needed for 
> training and are useful just for reference/review.
>
> On Tue, Aug 28, 2018 at 6:56 AM, 王思远 > 
> wrote:
>
>> I see there is a new flag in the tesseract 
>> /src 
>> /training 
>> /tesstrain.sh
>>  
>> in the change on 2018/8/20.
>> add variable --save_box_tiff to Save box/tiff pairs along with lstmf files 
>> 
>> So how can i use this new flag? Is there a demo that i can refer to?
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/b71db956-9137-4a16-84af-7f4462ac53e9%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6ce71583-72a6-4955-a2d8-d668dd7bc6b3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


RE: [tesseract-ocr] Tesseract 3.x multiprocessing weird behaviour

2018-08-28 Thread Adrian Owen
When multiprocessing using V4 (and TessAPI), I had to make multiple copies of 
tessdata, and give each worker with a unique tessdata.

Now it works okay. Hope this is helpful.

From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com] On 
Behalf Of ignas...@gmail.com
Sent: 28 August 2018 14:40
To: tesseract-ocr 
Subject: [tesseract-ocr] Tesseract 3.x multiprocessing weird behaviour


I am not sure whether it is my infrastructure that does this weird stuff or the 
tesseract-ocr itself.



Whenever i use image_to_string in single-process environment - the 
tesseract-ocr works fine. But when I spawn multiple workers with gunicorn and 
all of them get to do some work with ocr reading - the tesseract-ocr starts 
reading very poorly (and not from performance-vise, but accuracy-vise). Even 
after the load is done - tesseract never has the same accuracy. I need to 
restart all the workers in order to get tesseract working well again.



This is super weird. Maybe anyone has experienced or heard of this issue?
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to 
tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3b1859ad-5c26-4688-b5e6-ceb7ae984c8f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/420a8d01089e47a191e498c889eeada8%40eesm.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] What i need to do fine tuning for only numbers and specific font?

2018-08-28 Thread Soumik Ranjan Dasgupta
Hey Yasin,
Sorry to reply so late. As far as I know, Tesseract doesn't work on MacOs
yet. Maybe you can install a Linux environment inside a VM and make-do with
it?
No, You don't have to create box files manually, tesstrain.sh will do that
for you. In fact, it will take care of the entire training procedure.
If you want to fine-tune
,
you have to specify the modified architecture in the VGSL specifications as
the CLI parameter
.
In order to train Tesseract on a custom fontslist, you'd have to install
them and then mention the names in two separate files - the font_properties

file, and the language-specific.sh

file. Note that in both files, you need to enlist the fonts in a particular
format.
The traineddata for tesseract 3 is not compatible with the version 4, so
it's better if you train from scratch

.
Do get back to me if you have any more queries.

On Sat, Aug 25, 2018 at 3:00 PM Yasin Nazlıcan 
wrote:

> Hey Soumik Ranjan,
>
> Thank you for reply, mate. Like I said, I tried the follow this
> documentation
> ,
> but I couldn't go further. I couldn't find any info about macOS and had to
> stop. I assume I should create boxes for font and text and make
> fine-tuning. Do you have any links for macOS, that I can follow? Also, if
> you don't mind could you give me some more explanation about the process?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2104120f-c23e-4959-8987-abbf30102ddf%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 
Regards,
Soumik Ranjan Dasgupta

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAB_aDAf2--nfM4HYdDP%3Dvbz7crBjcAWZOjgzoYnoZN0b5UTv1Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] OCR images with arbitrary foreign language text

2018-08-28 Thread loiodice
Is anyone aware of best practices for recognizing text in a image which 
could be in english or any other language?

- Is configuring tesseract with all 100+ supported trained datasets and 
just letting him figure out what the best language dataset to use an option?
  Does anybody have experience with accuracy and performance in such a 
configuration?

- Is there a suggested alternative ... like trying to guess what the 
language is, then do language identification on the returned blob and later 
re-OCR with the recognized language?

Thanks! 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7fee0d31-9997-4590-a1d6-36439049711c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.