I did not run QBE on windows for a long time. Try this (QBE+depends)[1] - I run it on win7 pro 64bit (even app&libs are 32bit, build with mingw 4.8, leptonica 1.70 a tesseract 3.03rc1)
[1] http://www.sk-spell.sk.cx/tmp/qtb-1.11.1.ZIP Zdenko On Mon, Mar 10, 2014 at 7:21 AM, Bernard Polarski <[email protected]>wrote: > I downloaded QBE and the additionals liraries, but it does not start on my > Windows Seven. Just get the message that the application ceased to function > and windows has to close it. > > > Le dimanche 9 mars 2014 21:19:23 UTC+1, zdenop a écrit : >> >> If I understood you correctly - You would like to have something like >> this: >> >> tesseract lm-110.jpg lm-110 -l fra makebox >> >> >> that creates box file and then some tool that will replace symbol(text) >> part of box file with content of e.g. lm-110.txt (certified text)? I did >> this with QBE[1]. But there are some (QBE) limitations: >> >> - there must be one symbol per box >> - number of boxes must be the same as count of symbols in your text >> file (without spaces) >> >> So my workflow was something like this: >> >> 1. create box file (or open image in QBE - it will offer you to >> create box file) >> 2. remove unnecessary boxes (heading, footer, page numbers, scan >> relics...) >> 3. split multisymbol boxes (e.g in one box file there was more >> symbols) >> 4. import text from external file (QBE->File->Import...->Import text >> file) >> >> It still needs user interaction (no automatic), but it can help, if you >> need something like that. >> >> [1] https://github.com/zdenop/qt-box-editor >> >> Zdenko >> >> >> On Sat, Mar 8, 2014 at 7:47 PM, Bernard Polarski <[email protected]>wrote: >> >>> Let me summarize what I am doing and what I am trying to achieve. >>> >>> Tesseract is excellent when it comes to recognize binaries fonts >>> (fonts that comes from computer, printed or directly generated from an >>> application). >>> >>> The match is a near perfect and many times it is perfect. >>> And it is easy now to train a text for one zillion fonts when it comes >>> to binaries font: >>> >>> text2image --text=$FIN --outputbase=$FOUT --fonts_dir=$FONT_DIR >>> --render_per_font --find_fonts >>> >>> This will generates one zillion fonts. This is a big plus from version >>> 3.03. But honestly this job has been done at Google. >>> >>> But training out of binaries fonts are deceiving when they are applied >>> on printed fonts, specially for books from the 19e century. >>> I belong to a group that edit epub for books of 19e century. >>> That kind of books comes in collections, and the collections were often >>> printed on the same machine. >>> >>> So instead of creating a library of 'Century old school' font, I am >>> exploring the idea of creating a font dedicated to an editor for a >>> given period. >>> ie *'*EFlammarion1870.ttf' to be used on these books. >>> >>> I do have enough plenty scripts to automatically generates a traineddata >>> file, starting from a directory containing img.tif file and their img.box. >>> But it is very time consuming to generate every one of these box file. >>> >>> The idea is to start from a set of scanned image, grabs a certified text >>> from site like Gutenberg ( for french ebooksgratuits.com provides more >>> books). >>> A search string on the first 3 words in the certified text and here is >>> the needed certified translation. >>> >>> So I am looking now looking for a method to transform the certified text >>> into box file. >>> Doing this for some pages in order to generates quickly a new >>> traineddata and test it. >>> In this respect, it is clear that JTessBoxEditor, which is very good >>> but the process >>> to generate the box file is too slow and not prone to errors. >>> >>> >>> Here is a page extracted from "La maison nucingen" whose print is quite >>>> bad, so it is interresting. >>>> >>> >>> >>>> http://gallica.bnf.fr/ark:/12148/bpt6k58135211/f107. >>>> image.r=la%20maison%20nucingen.langEN >>>> >>> >>> >>> >>> <https://lh4.googleusercontent.com/-7xPLX_2HR54/UxtWUEx8nBI/AAAAAAAAAB4/ro0vwKP0Oh4/s1600/lm-110.tif> >>> >>> >>> The text : >>> proposait d’opérer avec ses millions faits d’une >>> main de papier rose à l’aide d’une pierre litho- >>> graphique, de jolies petites actions à placer, pré- >>> cieusement conservées dans son cabinet. Les ac- >>> tions réelles allaient servir à fonder l’affaire, >>> acheter un magnifique hôtel et commencer les >>> opérations. Nucingen se trouvait encore des ac- >>> tions dans je ne sais quelles mines de plomb ar- >>> gentifère, dans des mines de houille et dans deux >>> canaux, actions bénéficiaires accordées pour la >>> mise en scène de ces quatre entreprises en pleine >>> activité, supérieurement montées et en faveur, au >>> moyen du dividende pris sur le capital. Nucin- >>> gen pouvait compter sur un agio si les actions >>> montaient, mais le baron le négligea dans ses >>> calculs, il le laissait à fleur d’eau, sur la place, >>> afin d’attirer les poissons ! Il avait donc massé >>> ses valeurs, comme Napoléon massait ses trou- >>> piers, afin de liquider durant la crise qui se des- >>> sinait et qui révolutionna, en 26 et 27 les places >>> européennes. S’il avait eu son prince de Wagram, >>> il aurait pu dire comme Napoléon du haut du >>> Santon : « Examinez bien la place, tel jour, à telle >>> heure, il y aura là des fonds répandus ! » Mais à >>> qui pouvait-il se confier ? Du Tillet ne soupçonna >>> >>> >>> >>> >>> -- >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected] >>> >>> To unsubscribe from this group, send email to >>> [email protected] >>> >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

