I just mean to assert that the text is an absolute match of the image. You have to check every box file, eventually split/merge/delete some boxes. Once you have done it, I still compare the result using this simple cat <file> | cut -c 1 | tr '\n' ' '. The again I read every word until I am satisfied that the box file is absolutely correct. I then store the image and the box file in a directory to be used when I want to create a traineddata. I am creating various directory of various type of font. But since version 3.03, for traineddata create from scanned image, I have less impact. It does have effect, but I have more negative impact for a good one. I am figthing hard to isolate one single effect. For the moment the best results are obtained by cleaning the FRA dictionary from short words (2 letters) seldom used. Now I feel the needs to setup regressions tests over 20 certified box/text in order to measure the impact of one single change.
Working in progress and ABBY is already off but I hope more progresses before submitting to my group. Le mardi 11 mars 2014 00:08:34 UTC+1, Quan Nguyen a écrit : > > Bernard, > > What do you mean by "assert a text box of 200 words"? Can you elaborate? > Thanks. > > Quan > > On Monday, March 10, 2014 11:06:18 AM UTC-5, Bernard Polarski wrote: >> >> >> Since I have the source, I will recompile it this evening at home and >> will let you know. >> I takes an average of 30 min to assert a text box of 200 words using >> JtessBoxEditor. >> This is a real issue. >> >> Le lundi 10 mars 2014 13:31:39 UTC+1, zdenop a écrit : >> >>> I did not run QBE on windows for a long time. >>> Try this (QBE+depends)[1] - I run it on win7 pro 64bit (even app&libs >>> are 32bit, build with mingw 4.8, leptonica 1.70 a tesseract 3.03rc1) >>> >>> [1] http://www.sk-spell.sk.cx/tmp/qtb-1.11.1.ZIP >>> >>> Zdenko >>> >>> >>> On Mon, Mar 10, 2014 at 7:21 AM, Bernard Polarski <[email protected]>wrote: >>> >>>> I downloaded QBE and the additionals liraries, but it does not start on >>>> my Windows Seven. Just get the message that the application ceased to >>>> function and windows has to close it. >>>> >>>> >>>> Le dimanche 9 mars 2014 21:19:23 UTC+1, zdenop a écrit : >>>>> >>>>> If I understood you correctly - You would like to have something >>>>> like this: >>>>> >>>>> tesseract lm-110.jpg lm-110 -l fra makebox >>>>> >>>>> >>>>> that creates box file and then some tool that will replace >>>>> symbol(text) part of box file with content of e.g. lm-110.txt (certified >>>>> text)? I did this with QBE[1]. But there are some (QBE) limitations: >>>>> >>>>> - there must be one symbol per box >>>>> - number of boxes must be the same as count of symbols in your >>>>> text file (without spaces) >>>>> >>>>> So my workflow was something like this: >>>>> >>>>> 1. create box file (or open image in QBE - it will offer you to >>>>> create box file) >>>>> 2. remove unnecessary boxes (heading, footer, page numbers, scan >>>>> relics...) >>>>> 3. split multisymbol boxes (e.g in one box file there was more >>>>> symbols) >>>>> 4. import text from external file (QBE->File->Import...->Import >>>>> text file) >>>>> >>>>> It still needs user interaction (no automatic), but it can help, if >>>>> you need something like that. >>>>> >>>>> [1] https://github.com/zdenop/qt-box-editor >>>>> >>>>> Zdenko >>>>> >>>>> >>>>> On Sat, Mar 8, 2014 at 7:47 PM, Bernard Polarski >>>>> <[email protected]>wrote: >>>>> >>>>>> Let me summarize what I am doing and what I am trying to achieve. >>>>>> >>>>>> Tesseract is excellent when it comes to recognize binaries fonts >>>>>> (fonts that comes from computer, printed or directly generated from >>>>>> an application). >>>>>> >>>>>> The match is a near perfect and many times it is perfect. >>>>>> And it is easy now to train a text for one zillion fonts when it >>>>>> comes to binaries font: >>>>>> >>>>>> text2image --text=$FIN --outputbase=$FOUT --fonts_dir=$FONT_DIR >>>>>> --render_per_font --find_fonts >>>>>> >>>>>> This will generates one zillion fonts. This is a big plus from >>>>>> version 3.03. But honestly this job has been done at Google. >>>>>> >>>>>> But training out of binaries fonts are deceiving when they are >>>>>> applied on printed fonts, specially for books from the 19e century. >>>>>> I belong to a group that edit epub for books of 19e century. >>>>>> That kind of books comes in collections, and the collections were >>>>>> often printed on the same machine. >>>>>> >>>>>> So instead of creating a library of 'Century old school' font, I am >>>>>> exploring the idea of creating a font dedicated to an editor for a >>>>>> given period. >>>>>> ie *'*EFlammarion1870.ttf' to be used on these books. >>>>>> >>>>>> I do have enough plenty scripts to automatically generates a >>>>>> traineddata file, starting from a directory containing img.tif file and >>>>>> their img.box. >>>>>> But it is very time consuming to generate every one of these box file. >>>>>> >>>>>> The idea is to start from a set of scanned image, grabs a certified text >>>>>> from site like Gutenberg ( for french ebooksgratuits.com provides >>>>>> more books). >>>>>> A search string on the first 3 words in the certified text and here >>>>>> is the needed certified translation. >>>>>> >>>>>> So I am looking now looking for a method to transform the certified >>>>>> text into box file. >>>>>> Doing this for some pages in order to generates quickly a new >>>>>> traineddata and test it. >>>>>> In this respect, it is clear that JTessBoxEditor, which is very good >>>>>> but the process >>>>>> to generate the box file is too slow and not prone to errors. >>>>>> >>>>>> >>>>>> Here is a page extracted from "La maison nucingen" whose print is >>>>>>> quite bad, so it is interresting. >>>>>>> >>>>>> >>>>>> >>>>>>> http://gallica.bnf.fr/ark:/12148/bpt6k58135211/f107. >>>>>>> image.r=la%20maison%20nucingen.langEN >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> <https://lh4.googleusercontent.com/-7xPLX_2HR54/UxtWUEx8nBI/AAAAAAAAAB4/ro0vwKP0Oh4/s1600/lm-110.tif> >>>>>> >>>>>> >>>>>> The text : >>>>>> proposait d’opérer avec ses millions faits d’une >>>>>> main de papier rose à l’aide d’une pierre litho- >>>>>> graphique, de jolies petites actions à placer, pré- >>>>>> cieusement conservées dans son cabinet. Les ac- >>>>>> tions réelles allaient servir à fonder l’affaire, >>>>>> acheter un magnifique hôtel et commencer les >>>>>> opérations. Nucingen se trouvait encore des ac- >>>>>> tions dans je ne sais quelles mines de plomb ar- >>>>>> gentifère, dans des mines de houille et dans deux >>>>>> canaux, actions bénéficiaires accordées pour la >>>>>> mise en scène de ces quatre entreprises en pleine >>>>>> activité, supérieurement montées et en faveur, au >>>>>> moyen du dividende pris sur le capital. Nucin- >>>>>> gen pouvait compter sur un agio si les actions >>>>>> montaient, mais le baron le négligea dans ses >>>>>> calculs, il le laissait à fleur d’eau, sur la place, >>>>>> afin d’attirer les poissons ! Il avait donc massé >>>>>> ses valeurs, comme Napoléon massait ses trou- >>>>>> piers, afin de liquider durant la crise qui se des- >>>>>> sinait et qui révolutionna, en 26 et 27 les places >>>>>> européennes. S’il avait eu son prince de Wagram, >>>>>> il aurait pu dire comme Napoléon du haut du >>>>>> Santon : « Examinez bien la place, tel jour, à telle >>>>>> heure, il y aura là des fonds répandus ! » Mais à >>>>>> qui pouvait-il se confier ? Du Tillet ne soupçonna >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To post to this group, send email to [email protected] >>>>>> >>>>>> To unsubscribe from this group, send email to >>>>>> [email protected] >>>>>> >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>>>> >>>>>> --- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected] >>>> To unsubscribe from this group, send email to >>>> [email protected] >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>> >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

