Since I have the source, I will recompile it this evening at home and will let you know. I takes an average of 30 min to assert a text box of 200 words using JtessBoxEditor. This is a real issue. Le lundi 10 mars 2014 13:31:39 UTC+1, zdenop a écrit :
> I did not run QBE on windows for a long time. > Try this (QBE+depends)[1] - I run it on win7 pro 64bit (even app&libs are > 32bit, build with mingw 4.8, leptonica 1.70 a tesseract 3.03rc1) > > [1] http://www.sk-spell.sk.cx/tmp/qtb-1.11.1.ZIP > > Zdenko > > > On Mon, Mar 10, 2014 at 7:21 AM, Bernard Polarski > <[email protected]<javascript:> > > wrote: > >> I downloaded QBE and the additionals liraries, but it does not start on >> my Windows Seven. Just get the message that the application ceased to >> function and windows has to close it. >> >> >> Le dimanche 9 mars 2014 21:19:23 UTC+1, zdenop a écrit : >>> >>> If I understood you correctly - You would like to have something like >>> this: >>> >>> tesseract lm-110.jpg lm-110 -l fra makebox >>> >>> >>> that creates box file and then some tool that will replace symbol(text) >>> part of box file with content of e.g. lm-110.txt (certified text)? I did >>> this with QBE[1]. But there are some (QBE) limitations: >>> >>> - there must be one symbol per box >>> - number of boxes must be the same as count of symbols in your text >>> file (without spaces) >>> >>> So my workflow was something like this: >>> >>> 1. create box file (or open image in QBE - it will offer you to >>> create box file) >>> 2. remove unnecessary boxes (heading, footer, page numbers, scan >>> relics...) >>> 3. split multisymbol boxes (e.g in one box file there was more >>> symbols) >>> 4. import text from external file (QBE->File->Import...->Import text >>> file) >>> >>> It still needs user interaction (no automatic), but it can help, if you >>> need something like that. >>> >>> [1] https://github.com/zdenop/qt-box-editor >>> >>> Zdenko >>> >>> >>> On Sat, Mar 8, 2014 at 7:47 PM, Bernard Polarski <[email protected]>wrote: >>> >>>> Let me summarize what I am doing and what I am trying to achieve. >>>> >>>> Tesseract is excellent when it comes to recognize binaries fonts >>>> (fonts that comes from computer, printed or directly generated from an >>>> application). >>>> >>>> The match is a near perfect and many times it is perfect. >>>> And it is easy now to train a text for one zillion fonts when it comes >>>> to binaries font: >>>> >>>> text2image --text=$FIN --outputbase=$FOUT --fonts_dir=$FONT_DIR >>>> --render_per_font --find_fonts >>>> >>>> This will generates one zillion fonts. This is a big plus from version >>>> 3.03. But honestly this job has been done at Google. >>>> >>>> But training out of binaries fonts are deceiving when they are applied >>>> on printed fonts, specially for books from the 19e century. >>>> I belong to a group that edit epub for books of 19e century. >>>> That kind of books comes in collections, and the collections were often >>>> printed on the same machine. >>>> >>>> So instead of creating a library of 'Century old school' font, I am >>>> exploring the idea of creating a font dedicated to an editor for a >>>> given period. >>>> ie *'*EFlammarion1870.ttf' to be used on these books. >>>> >>>> I do have enough plenty scripts to automatically generates a >>>> traineddata file, starting from a directory containing img.tif file and >>>> their img.box. >>>> But it is very time consuming to generate every one of these box file. >>>> >>>> The idea is to start from a set of scanned image, grabs a certified text >>>> from site like Gutenberg ( for french ebooksgratuits.com provides more >>>> books). >>>> A search string on the first 3 words in the certified text and here is >>>> the needed certified translation. >>>> >>>> So I am looking now looking for a method to transform the certified >>>> text into box file. >>>> Doing this for some pages in order to generates quickly a new >>>> traineddata and test it. >>>> In this respect, it is clear that JTessBoxEditor, which is very good >>>> but the process >>>> to generate the box file is too slow and not prone to errors. >>>> >>>> >>>> Here is a page extracted from "La maison nucingen" whose print is >>>>> quite bad, so it is interresting. >>>>> >>>> >>>> >>>>> http://gallica.bnf.fr/ark:/12148/bpt6k58135211/f107. >>>>> image.r=la%20maison%20nucingen.langEN >>>>> >>>> >>>> >>>> >>>> <https://lh4.googleusercontent.com/-7xPLX_2HR54/UxtWUEx8nBI/AAAAAAAAAB4/ro0vwKP0Oh4/s1600/lm-110.tif> >>>> >>>> >>>> The text : >>>> proposait d’opérer avec ses millions faits d’une >>>> main de papier rose à l’aide d’une pierre litho- >>>> graphique, de jolies petites actions à placer, pré- >>>> cieusement conservées dans son cabinet. Les ac- >>>> tions réelles allaient servir à fonder l’affaire, >>>> acheter un magnifique hôtel et commencer les >>>> opérations. Nucingen se trouvait encore des ac- >>>> tions dans je ne sais quelles mines de plomb ar- >>>> gentifère, dans des mines de houille et dans deux >>>> canaux, actions bénéficiaires accordées pour la >>>> mise en scène de ces quatre entreprises en pleine >>>> activité, supérieurement montées et en faveur, au >>>> moyen du dividende pris sur le capital. Nucin- >>>> gen pouvait compter sur un agio si les actions >>>> montaient, mais le baron le négligea dans ses >>>> calculs, il le laissait à fleur d’eau, sur la place, >>>> afin d’attirer les poissons ! Il avait donc massé >>>> ses valeurs, comme Napoléon massait ses trou- >>>> piers, afin de liquider durant la crise qui se des- >>>> sinait et qui révolutionna, en 26 et 27 les places >>>> européennes. S’il avait eu son prince de Wagram, >>>> il aurait pu dire comme Napoléon du haut du >>>> Santon : « Examinez bien la place, tel jour, à telle >>>> heure, il y aura là des fonds répandus ! » Mais à >>>> qui pouvait-il se confier ? Du Tillet ne soupçonna >>>> >>>> >>>> >>>> >>>> -- >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected] >>>> >>>> To unsubscribe from this group, send email to >>>> [email protected] >>>> >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>> >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected]<javascript:> >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

