Training for Swedish, Danish, Norwegian, old spelling, fraktur

2010-04-23 Thread Lars Aronsson
eaning different spelling standards. Could these be used for training Tesseract? How do I start? -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se Project Runeberg - free Nordic literature - http://runeberg.org/ -- You received this message because you are subscr

Re: Training for Swedish, Danish, Norwegian, old spelling, fraktur

2010-04-24 Thread Lars Aronsson
ccess to the source code. Oh, really. Is anybody taking the lead, and do you have any funding for this? -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se Project Runeberg - free Nordic literature - http://runeberg.org/ -- You received this message because you are

Re: Training for Swedish, Danish, Norwegian, old spelling, fraktur

2010-04-28 Thread Lars Aronsson
books that Google has already scanned. Does anybody know of an open source OCR project that is based on statistics from scanned books? Could parts of the Tesseract software library be used to cut out letters from scanned pages, so some other software could group them statistically? -- Lars Aro

Re: Benefit of the dictionary

2010-05-02 Thread Lars Aronsson
pace be excluded when computing the accuracy? In my example, only missing characters are counted as errors, but adding extra characters is not. -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se -- You received this message because you are subscribed to the Google Groups &qu

Re: Danish fraktur support in r319

2010-05-24 Thread Lars Aronsson
some documentation for this file format, so I can read and understand what's in there? I want to keep the part that is about fraktur/blackletter and substitute the part that is about Danish pre 1870 spelling for something based on my Swedish dictionaries. -- Lars Aronsson (l...@aronsson.se)

Re: Danish fraktur support in r319

2010-05-24 Thread Lars Aronsson
ng -r319, but then combine_tessdata doesn't have all these flags. Still, I'm not very interested in running the program, but to understand the data. Is there no documentation for the format? Should we write some? Or is that something you keep internally at Google? -- Lars Aronsson

Re: Danish fraktur support in r319

2010-05-24 Thread Lars Aronsson
Jimmy O'Regan wrote: On 24 May 2010 17:41, Lars Aronsson wrote: I tried to compile the current version (svn -r354 up), but failed: Looks like a pair of missing casts - have you opened an issue? No, I have not. I don't know enough of the software. Err... I have no affiliation w

Re: Danish fraktur support in r319

2010-05-24 Thread Lars Aronsson
t;http://code.google.com/p/tesseract-ocr/source/detail?r=354>, Mandrivalinux 2010.1 64bit", but the compiler error message is full of "inT32" and the prototype above says "int". -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se -- You received

Re: Danish fraktur support in r319

2010-05-24 Thread Lars Aronsson
rks for you? Yes, this works fine, both "tesseract eurotext.tif output2" and "combine_tessdata -u dan-frak.traineddata /tmp/foo." -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se -- You received this message because you are subscribed to the G

Re: need help converting .jpg files to .tif for OCR

2010-07-05 Thread Lars Aronsson
om the failure to explain what Tesseract is. -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-...@googlegroups.c

[tesseract-ocr] Swedish and Danish Fraktur

2022-12-26 Thread Lars Aronsson
more than expected. What to do? Should those, who made those files, make new versions that will work with the new Tesseract? Or will Tesseract finally incorporate Fraktur reading without the need to load separate training files? -- Lars Aronsson (l...@aronsson.se) Project Runeberg - free N

[tesseract-ocr] Hyphenation postprocessing

2023-02-05 Thread Lars Aronsson
source has code to recognize hyphenated words, and it should be possible to implement this behaviour as an option. -- Lars Aronsson (l...@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/ -- You received this message because you are subscribed to the Google Groups

[tesseract-ocr] Strange OCR results from table of contents

2024-01-19 Thread Lars Aronsson
ired: I. Den ældre Stenalders Bopladser . 7. How come? Is it the unusual line spacing that makes Tesseract confused? Or the dotted line? Why does it fill in letters where there should be word-separating spaces? -- Lars Aronsson (l...@aronsson.se) Project Runeberg - free Nordic lite