On Fri, Jun 8, 2012 at 11:40 AM, Nick White <nick.wh...@durham.ac.uk> wrote:
> Hi Zdenko, > > I saw the descriptions you give below, I just wasn't very clear on > what they meant. > > On Thu, Jun 07, 2012 at 02:50:57PM +0200, zdenko podobny wrote: > > lang.punc-dawg > > (Optional) A dawg made from punctuation patterns found around words. The > > "word" part is replaced by a single space. > > lang.number-dawg > > So for english, ( ) and " " spring to mind. Is this the sort of > thing that is expected? > > yes. have a look at *.punc-dawg and *.punc-dawg for more examples (e.g. " http://rapidshare.com/files/ /HITMAN.part .rar" ;-)) > > (Optional) A dawg made from tokens which originally contained digits. > Each > > digit is replaced by a space character. > > Ah, looking at one of the official trainings with dawg2wordlist I > see entries such as '(c) ' (without quotes.) Thanks, that makes > sense. Though I'm suprised (and impressed) that Tesseract goes down > to that level of granularity in its scanning. > > Nick > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to tesseract-ocr@googlegroups.com > To unsubscribe from this group, send email to > tesseract-ocr+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en