Bernard, What do you mean by "assert a text box of 200 words"? Can you elaborate? Thanks.
Quan On Monday, March 10, 2014 11:06:18 AM UTC-5, Bernard Polarski wrote: > > > Since I have the source, I will recompile it this evening at home and will > let you know. > I takes an average of 30 min to assert a text box of 200 words using > JtessBoxEditor. > This is a real issue. > > Le lundi 10 mars 2014 13:31:39 UTC+1, zdenop a écrit : > >> I did not run QBE on windows for a long time. >> Try this (QBE+depends)[1] - I run it on win7 pro 64bit (even app&libs are >> 32bit, build with mingw 4.8, leptonica 1.70 a tesseract 3.03rc1) >> >> [1] http://www.sk-spell.sk.cx/tmp/qtb-1.11.1.ZIP >> >> Zdenko >> >> >> On Mon, Mar 10, 2014 at 7:21 AM, Bernard Polarski <[email protected]>wrote: >> >>> I downloaded QBE and the additionals liraries, but it does not start on >>> my Windows Seven. Just get the message that the application ceased to >>> function and windows has to close it. >>> >>> >>> Le dimanche 9 mars 2014 21:19:23 UTC+1, zdenop a écrit : >>>> >>>> If I understood you correctly - You would like to have something like >>>> this: >>>> >>>> tesseract lm-110.jpg lm-110 -l fra makebox >>>> >>>> >>>> that creates box file and then some tool that will replace symbol(text) >>>> part of box file with content of e.g. lm-110.txt (certified text)? I did >>>> this with QBE[1]. But there are some (QBE) limitations: >>>> >>>> - there must be one symbol per box >>>> - number of boxes must be the same as count of symbols in your text >>>> file (without spaces) >>>> >>>> So my workflow was something like this: >>>> >>>> 1. create box file (or open image in QBE - it will offer you to >>>> create box file) >>>> 2. remove unnecessary boxes (heading, footer, page numbers, scan >>>> relics...) >>>> 3. split multisymbol boxes (e.g in one box file there was more >>>> symbols) >>>> 4. import text from external file (QBE->File->Import...->Import >>>> text file) >>>> >>>> It still needs user interaction (no automatic), but it can help, if you >>>> need something like that. >>>> >>>> [1] https://github.com/zdenop/qt-box-editor >>>> >>>> Zdenko >>>> >>>> >>>> On Sat, Mar 8, 2014 at 7:47 PM, Bernard Polarski <[email protected]>wrote: >>>> >>>>> Let me summarize what I am doing and what I am trying to achieve. >>>>> >>>>> Tesseract is excellent when it comes to recognize binaries fonts >>>>> (fonts that comes from computer, printed or directly generated from an >>>>> application). >>>>> >>>>> The match is a near perfect and many times it is perfect. >>>>> And it is easy now to train a text for one zillion fonts when it comes >>>>> to binaries font: >>>>> >>>>> text2image --text=$FIN --outputbase=$FOUT --fonts_dir=$FONT_DIR >>>>> --render_per_font --find_fonts >>>>> >>>>> This will generates one zillion fonts. This is a big plus from version >>>>> 3.03. But honestly this job has been done at Google. >>>>> >>>>> But training out of binaries fonts are deceiving when they are applied >>>>> on printed fonts, specially for books from the 19e century. >>>>> I belong to a group that edit epub for books of 19e century. >>>>> That kind of books comes in collections, and the collections were >>>>> often printed on the same machine. >>>>> >>>>> So instead of creating a library of 'Century old school' font, I am >>>>> exploring the idea of creating a font dedicated to an editor for a >>>>> given period. >>>>> ie *'*EFlammarion1870.ttf' to be used on these books. >>>>> >>>>> I do have enough plenty scripts to automatically generates a >>>>> traineddata file, starting from a directory containing img.tif file and >>>>> their img.box. >>>>> But it is very time consuming to generate every one of these box file. >>>>> >>>>> The idea is to start from a set of scanned image, grabs a certified text >>>>> from site like Gutenberg ( for french ebooksgratuits.com provides >>>>> more books). >>>>> A search string on the first 3 words in the certified text and here is >>>>> the needed certified translation. >>>>> >>>>> So I am looking now looking for a method to transform the certified >>>>> text into box file. >>>>> Doing this for some pages in order to generates quickly a new >>>>> traineddata and test it. >>>>> In this respect, it is clear that JTessBoxEditor, which is very good >>>>> but the process >>>>> to generate the box file is too slow and not prone to errors. >>>>> >>>>> >>>>> Here is a page extracted from "La maison nucingen" whose print is >>>>>> quite bad, so it is interresting. >>>>>> >>>>> >>>>> >>>>>> http://gallica.bnf.fr/ark:/12148/bpt6k58135211/f107. >>>>>> image.r=la%20maison%20nucingen.langEN >>>>>> >>>>> >>>>> >>>>> >>>>> <https://lh4.googleusercontent.com/-7xPLX_2HR54/UxtWUEx8nBI/AAAAAAAAAB4/ro0vwKP0Oh4/s1600/lm-110.tif> >>>>> >>>>> >>>>> The text : >>>>> proposait d’opérer avec ses millions faits d’une >>>>> main de papier rose à l’aide d’une pierre litho- >>>>> graphique, de jolies petites actions à placer, pré- >>>>> cieusement conservées dans son cabinet. Les ac- >>>>> tions réelles allaient servir à fonder l’affaire, >>>>> acheter un magnifique hôtel et commencer les >>>>> opérations. Nucingen se trouvait encore des ac- >>>>> tions dans je ne sais quelles mines de plomb ar- >>>>> gentifère, dans des mines de houille et dans deux >>>>> canaux, actions bénéficiaires accordées pour la >>>>> mise en scène de ces quatre entreprises en pleine >>>>> activité, supérieurement montées et en faveur, au >>>>> moyen du dividende pris sur le capital. Nucin- >>>>> gen pouvait compter sur un agio si les actions >>>>> montaient, mais le baron le négligea dans ses >>>>> calculs, il le laissait à fleur d’eau, sur la place, >>>>> afin d’attirer les poissons ! Il avait donc massé >>>>> ses valeurs, comme Napoléon massait ses trou- >>>>> piers, afin de liquider durant la crise qui se des- >>>>> sinait et qui révolutionna, en 26 et 27 les places >>>>> européennes. S’il avait eu son prince de Wagram, >>>>> il aurait pu dire comme Napoléon du haut du >>>>> Santon : « Examinez bien la place, tel jour, à telle >>>>> heure, il y aura là des fonds répandus ! » Mais à >>>>> qui pouvait-il se confier ? Du Tillet ne soupçonna >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected] >>>>> >>>>> To unsubscribe from this group, send email to >>>>> [email protected] >>>>> >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>>> >>>>> --- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected] >>> To unsubscribe from this group, send email to >>> [email protected] >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

