Hi, Im trying to train as well and I have the same problem. I got this result :
"P 5 0,255,0,255,0,32767,0,32767,0,32767 NULL 54 0 0 # # P [50 ]A A 5 0,255,0,255,0,32767,0,32767,0,32767 NULL 38 0 0 # # A [41 ]A S 5 0,255,0,255,0,32767,0,32767,0,32767 NULL 53 0 0 # # S [53 ]A" I have the problem with the fields of glyph_metric and script. Is there any idea? On Tuesday, 1 December 2015 00:42:23 UTC+2, Gustavo Polledri wrote: > > In some recent posts, I've seen people with similar problems as mine, but > no answer as how to fix it. I'm trying to train tesseract to be more > accurate with a new font. When creating the unicharset using > unicharset_extractor on my box file: > > ``` > a 32 692 165 958 0 > b 221 734 354 958 0 > c 32 446 165 628 0 > d 221 488 354 628 0 > e 32 275 165 373 0 > f 221 317 277 373 0 > ``` > > I get the following output: > > ``` > 9 > NULL 0 NULL 0 > Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 > 64 ] > |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken > a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # a [61 ] > b 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # b [62 ] > c 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # c [63 ] > d 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # d [64 ] > e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # e [65 ] > f 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # f [66 ] > ``` > > and when i run shapeclustering, if gives a the first few lines of: > > ``` > Bad properties for index 3, char a: 0,255 0,255 0,0 0,0 0,0 > Bad properties for index 4, char b: 0,255 0, > ``` > > It seems that the unicharset_extractor isn't properly parsing the box > file. Some obvious problems with the unicharset file are the "properties" > bit mask is 0, the "glyph_metrics" field appears invalid > (0,255,0,255,0,0,0,0,0,0), the "script" field should be either "Latin" or > "Common", but is NULL, etc. > > Anyone have an idea why is is happening? > > O/S: Ubuntu 15.10 > Tesseract Ver: 3.04 > > Posts with no simple resolution: > https://github.com/tesseract-ocr/tesseract/issues/139 > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/798f1c9f-9547-44d4-b272-6b7f59adbeb0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

