Re: Create boxfile from a certified text

Bernard Polarski Mon, 10 Mar 2014 09:06:45 -0700

Since I have the source, I will recompile it this evening at home and will 
let you know.
I takes an average of 30 min to assert a text box of 200 words using 
JtessBoxEditor. 
This is a real issue.
 
Le lundi 10 mars 2014 13:31:39 UTC+1, zdenop a écrit :


> I did not run QBE on windows for a long time. 
> Try this (QBE+depends)[1] - I run it on win7 pro 64bit (even app&libs are 
> 32bit, build with mingw 4.8, leptonica 1.70 a tesseract 3.03rc1) 
>
> [1] http://www.sk-spell.sk.cx/tmp/qtb-1.11.1.ZIP
>
> Zdenko
>
>
> On Mon, Mar 10, 2014 at 7:21 AM, Bernard Polarski 
> <[email protected]<javascript:>
> > wrote:
>
>> I downloaded QBE and the additionals liraries, but it does not start on 
>> my Windows Seven. Just get the message that the application ceased to 
>> function and windows has to close it. 
>>
>>
>> Le dimanche 9 mars 2014 21:19:23 UTC+1, zdenop a écrit : 
>>>
>>>  If I understood you correctly - You would like to have something like 
>>> this: 
>>>
>>>  tesseract lm-110.jpg lm-110 -l fra makebox
>>>
>>>
>>> that creates box file and then some tool that will replace symbol(text) 
>>> part of box file with content of e.g. lm-110.txt (certified text)? I did 
>>> this with QBE[1]. But there are some (QBE) limitations:
>>>  
>>>    - there must be one symbol per box  
>>>    - number of boxes must be the same as count of symbols in your text 
>>>    file (without spaces)
>>>
>>>  So my workflow was something like this:
>>>  
>>>    1. create box file (or open image in QBE - it will offer you to 
>>>    create box file)
>>>    2. remove unnecessary boxes (heading, footer, page numbers, scan 
>>>    relics...) 
>>>    3. split multisymbol boxes (e.g in one box file there was more 
>>>    symbols) 
>>>    4. import text from external file (QBE->File->Import...->Import text 
>>>    file)
>>>
>>> It still needs user interaction (no automatic), but it can help, if you 
>>> need something like that.
>>>
>>> [1] https://github.com/zdenop/qt-box-editor
>>>  
>>>  Zdenko
>>>
>>>
>>>  On Sat, Mar 8, 2014 at 7:47 PM, Bernard Polarski <[email protected]>wrote:
>>>
>>>>   Let me summarize what I am doing and what I am trying to achieve.
>>>>
>>>> Tesseract is excellent when it comes to recognize binaries fonts 
>>>> (fonts that comes from computer, printed or directly generated from an 
>>>> application). 
>>>>
>>>> The match is a near perfect and many times it is perfect. 
>>>> And it is easy now to train a text for one zillion fonts when it comes 
>>>> to binaries font:
>>>>
>>>>    text2image --text=$FIN  --outputbase=$FOUT  --fonts_dir=$FONT_DIR 
>>>> --render_per_font --find_fonts
>>>>
>>>> This will generates one zillion fonts. This is a big plus from version 
>>>> 3.03. But honestly this job has been done at Google.
>>>>
>>>> But training out of binaries fonts are deceiving when they are applied 
>>>> on printed fonts, specially for books from the 19e century.
>>>> I belong to a group that edit epub for books of 19e century.
>>>> That kind of books comes in collections, and the collections were often 
>>>> printed on the same machine.
>>>>
>>>> So instead of creating a library of 'Century old school' font, I am 
>>>> exploring the idea of creating a font dedicated to an editor for a 
>>>> given period. 
>>>> ie *'*EFlammarion1870.ttf' to be used on these books.
>>>>
>>>> I do have enough plenty scripts to automatically generates a 
>>>> traineddata file, starting from a directory containing img.tif file and 
>>>> their img.box.
>>>> But it is very time consuming to generate every one of these box file.
>>>>
>>>> The idea is to start from a set of scanned image, grabs a certified text 
>>>> from site like Gutenberg ( for french ebooksgratuits.com provides more 
>>>> books).
>>>> A search string on the first 3 words in the certified text and here is 
>>>> the needed certified translation.
>>>>
>>>> So I am looking now looking for a method to transform the certified 
>>>> text into box file. 
>>>> Doing this for some pages in order to generates quickly a new 
>>>> traineddata and test it.
>>>> In this respect, it is clear that JTessBoxEditor, which is very good 
>>>> but the process 
>>>> to generate the box file is too slow and not prone to errors.
>>>>
>>>>
>>>>  Here is a page extracted from "La maison nucingen" whose print is 
>>>>> quite bad, so it is interresting.
>>>>>
>>>>  
>>>>
>>>>> http://gallica.bnf.fr/ark:/12148/bpt6k58135211/f107.
>>>>> image.r=la%20maison%20nucingen.langEN
>>>>>
>>>>  
>>>>
>>>>
>>>> <https://lh4.googleusercontent.com/-7xPLX_2HR54/UxtWUEx8nBI/AAAAAAAAAB4/ro0vwKP0Oh4/s1600/lm-110.tif>
>>>>
>>>>
>>>> The text :
>>>> proposait d’opérer avec ses millions faits d’une
>>>> main de papier rose à l’aide d’une pierre litho-
>>>> graphique, de jolies petites actions à placer, pré-
>>>> cieusement conservées dans son cabinet. Les ac-
>>>> tions réelles allaient servir à fonder l’affaire,
>>>> acheter un magnifique hôtel et commencer les
>>>> opérations. Nucingen se trouvait encore des ac-
>>>> tions dans je ne sais quelles mines de plomb ar-
>>>> gentifère, dans des mines de houille et dans deux
>>>> canaux, actions bénéficiaires accordées pour la 
>>>> mise en scène de ces quatre entreprises en pleine
>>>> activité, supérieurement montées et en faveur, au
>>>> moyen du dividende pris sur le capital. Nucin-
>>>> gen pouvait compter sur un agio si les actions 
>>>> montaient, mais le baron le négligea dans ses 
>>>> calculs, il le laissait à fleur d’eau, sur la place, 
>>>> afin d’attirer les poissons ! Il avait donc massé 
>>>> ses valeurs, comme Napoléon massait ses trou-
>>>> piers, afin de liquider durant la crise qui se des-
>>>> sinait et qui révolutionna, en 26 et 27 les places 
>>>> européennes. S’il avait eu son prince de Wagram, 
>>>> il aurait pu dire comme Napoléon du haut du 
>>>> Santon : « Examinez bien la place, tel jour, à telle 
>>>> heure, il y aura là des fonds répandus ! » Mais à 
>>>> qui pouvait-il se confier ? Du Tillet ne soupçonna 
>>>>
>>>>
>>>>  
>>>>   
>>>> -- 
>>>> -- 
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected] 
>>>>
>>>> To unsubscribe from this group, send email to
>>>> [email protected] 
>>>>
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>>
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected]. 
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Create boxfile from a certified text

Reply via email to