Dn(a 05.06.2010 14:57, Jimmy O'Regan wrote / napísal(a): > On Saturday, June 5, 2010, zdpo <zde...@gmail.com> wrote: > >> Dear Sriranga, >> >> your box file is wrong (for tesseract 3.0 and >r319). It did not match >> to description in "Make Box Files" on >> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract. >> >> BTW: I am aware of any tool that support this new box format (for >> multipage tif). >> >> > it shouldn't matter. The code is supposed to accept the old style too, > provided that the number of pages is set to zero, which is determined > by the image reading code, which doesn't work on windows. > > If it fails on Linux, then I'd consider it a bug. > >
/usr/local/bin/tesseract slk.arial.001.tif slk.arial.001 makebox batch.nochop created slk.arial.001.box file with 6 columns (last one with 0). When I run: /usr/local/bin/unicharset_extractor slk.arial.001.box output is OK. When I convert it to 2.x format ('awk '{print $1" "$2" "$3" "$4" "$5}' <slk.arial.001.box >slk.arial.002.box') and run: /usr/local/bin/unicharset_extractor slk.arial.002.box I got errors: Extracting unicharset from slk.arial.002.box Box file format error on line 1 ignored ... Anyway if tesseract 3.0 of Sriranga produced old format that something is wrong in (his/windows) installation/compilation process. Or maybe he just simply mixed outputs from tesseract 2.x with 3.0... Zd.
smime.p7s
Description: S/MIME Cryptographic Signature