Re: Create boxfile from a certified text

Bernard Polarski Tue, 11 Mar 2014 00:04:29 -0700

I just mean to assert that the text is an absolute match of the image. You 
have to check every box file, eventually split/merge/delete some boxes. 
Once you have done it, I still compare the result using this simple cat 
<file> | cut -c 1 | tr '\n' ' '.
The again I read every word until I am satisfied that the box file is 
absolutely correct. I then store the image and the box file in a directory 
to be used when I want to create a traineddata. I am creating various 
directory of various type of font. But since version 3.03, for traineddata 
create from scanned image,  I have less impact. It does have effect, but I 
have more negative impact for a good one. I am figthing hard to isolate one 
single effect. For the moment the best results are obtained by cleaning the 
FRA dictionary from short words (2 letters) seldom used. Now I feel the 
needs to setup regressions tests over 20 certified box/text in order to 
measure the impact of one single change.


Working in progress and ABBY is already off but I hope more progresses 
before submitting to my group.

Le mardi 11 mars 2014 00:08:34 UTC+1, Quan Nguyen a écrit :
>
> Bernard,
>
> What do you mean by "assert a text box of 200 words"? Can you elaborate? 
> Thanks.
>
> Quan
>
> On Monday, March 10, 2014 11:06:18 AM UTC-5, Bernard Polarski wrote:
>>
>>
>> Since I have the source, I will recompile it this evening at home and 
>> will let you know.
>> I takes an average of 30 min to assert a text box of 200 words using 
>> JtessBoxEditor. 
>> This is a real issue.
>>  
>> Le lundi 10 mars 2014 13:31:39 UTC+1, zdenop a écrit :
>>
>>> I did not run QBE on windows for a long time. 
>>> Try this (QBE+depends)[1] - I run it on win7 pro 64bit (even app&libs 
>>> are 32bit, build with mingw 4.8, leptonica 1.70 a tesseract 3.03rc1) 
>>>
>>> [1] http://www.sk-spell.sk.cx/tmp/qtb-1.11.1.ZIP
>>>
>>> Zdenko
>>>
>>>
>>> On Mon, Mar 10, 2014 at 7:21 AM, Bernard Polarski <[email protected]>wrote:
>>>
>>>> I downloaded QBE and the additionals liraries, but it does not start on 
>>>> my Windows Seven. Just get the message that the application ceased to 
>>>> function and windows has to close it. 
>>>>
>>>>
>>>> Le dimanche 9 mars 2014 21:19:23 UTC+1, zdenop a écrit : 
>>>>>
>>>>>  If I understood you correctly - You would like to have something 
>>>>> like this: 
>>>>>
>>>>>  tesseract lm-110.jpg lm-110 -l fra makebox
>>>>>
>>>>>
>>>>> that creates box file and then some tool that will replace 
>>>>> symbol(text) part of box file with content of e.g. lm-110.txt (certified 
>>>>> text)? I did this with QBE[1]. But there are some (QBE) limitations:
>>>>>  
>>>>>    - there must be one symbol per box  
>>>>>    - number of boxes must be the same as count of symbols in your 
>>>>>    text file (without spaces)
>>>>>
>>>>>  So my workflow was something like this:
>>>>>  
>>>>>    1. create box file (or open image in QBE - it will offer you to 
>>>>>    create box file)
>>>>>    2. remove unnecessary boxes (heading, footer, page numbers, scan 
>>>>>    relics...) 
>>>>>    3. split multisymbol boxes (e.g in one box file there was more 
>>>>>    symbols) 
>>>>>    4. import text from external file (QBE->File->Import...->Import 
>>>>>    text file)
>>>>>
>>>>> It still needs user interaction (no automatic), but it can help, if 
>>>>> you need something like that.
>>>>>
>>>>> [1] https://github.com/zdenop/qt-box-editor
>>>>>  
>>>>>  Zdenko
>>>>>
>>>>>
>>>>>  On Sat, Mar 8, 2014 at 7:47 PM, Bernard Polarski 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>>   Let me summarize what I am doing and what I am trying to achieve.
>>>>>>
>>>>>> Tesseract is excellent when it comes to recognize binaries fonts 
>>>>>> (fonts that comes from computer, printed or directly generated from 
>>>>>> an application). 
>>>>>>
>>>>>> The match is a near perfect and many times it is perfect. 
>>>>>> And it is easy now to train a text for one zillion fonts when it 
>>>>>> comes to binaries font:
>>>>>>
>>>>>>    text2image --text=$FIN  --outputbase=$FOUT  --fonts_dir=$FONT_DIR 
>>>>>> --render_per_font --find_fonts
>>>>>>
>>>>>> This will generates one zillion fonts. This is a big plus from 
>>>>>> version 3.03. But honestly this job has been done at Google.
>>>>>>
>>>>>> But training out of binaries fonts are deceiving when they are 
>>>>>> applied on printed fonts, specially for books from the 19e century.
>>>>>> I belong to a group that edit epub for books of 19e century.
>>>>>> That kind of books comes in collections, and the collections were 
>>>>>> often printed on the same machine.
>>>>>>
>>>>>> So instead of creating a library of 'Century old school' font, I am 
>>>>>> exploring the idea of creating a font dedicated to an editor for a 
>>>>>> given period. 
>>>>>> ie *'*EFlammarion1870.ttf' to be used on these books.
>>>>>>
>>>>>> I do have enough plenty scripts to automatically generates a 
>>>>>> traineddata file, starting from a directory containing img.tif file and 
>>>>>> their img.box.
>>>>>> But it is very time consuming to generate every one of these box file.
>>>>>>
>>>>>> The idea is to start from a set of scanned image, grabs a certified text 
>>>>>> from site like Gutenberg ( for french ebooksgratuits.com provides 
>>>>>> more books).
>>>>>> A search string on the first 3 words in the certified text and here 
>>>>>> is the needed certified translation.
>>>>>>
>>>>>> So I am looking now looking for a method to transform the certified 
>>>>>> text into box file. 
>>>>>> Doing this for some pages in order to generates quickly a new 
>>>>>> traineddata and test it.
>>>>>> In this respect, it is clear that JTessBoxEditor, which is very good 
>>>>>> but the process 
>>>>>> to generate the box file is too slow and not prone to errors.
>>>>>>
>>>>>>
>>>>>>  Here is a page extracted from "La maison nucingen" whose print is 
>>>>>>> quite bad, so it is interresting.
>>>>>>>
>>>>>>  
>>>>>>
>>>>>>> http://gallica.bnf.fr/ark:/12148/bpt6k58135211/f107.
>>>>>>> image.r=la%20maison%20nucingen.langEN
>>>>>>>
>>>>>>  
>>>>>>
>>>>>>
>>>>>> <https://lh4.googleusercontent.com/-7xPLX_2HR54/UxtWUEx8nBI/AAAAAAAAAB4/ro0vwKP0Oh4/s1600/lm-110.tif>
>>>>>>
>>>>>>
>>>>>> The text :
>>>>>> proposait d’opérer avec ses millions faits d’une
>>>>>> main de papier rose à l’aide d’une pierre litho-
>>>>>> graphique, de jolies petites actions à placer, pré-
>>>>>> cieusement conservées dans son cabinet. Les ac-
>>>>>> tions réelles allaient servir à fonder l’affaire,
>>>>>> acheter un magnifique hôtel et commencer les
>>>>>> opérations. Nucingen se trouvait encore des ac-
>>>>>> tions dans je ne sais quelles mines de plomb ar-
>>>>>> gentifère, dans des mines de houille et dans deux
>>>>>> canaux, actions bénéficiaires accordées pour la 
>>>>>> mise en scène de ces quatre entreprises en pleine
>>>>>> activité, supérieurement montées et en faveur, au
>>>>>> moyen du dividende pris sur le capital. Nucin-
>>>>>> gen pouvait compter sur un agio si les actions 
>>>>>> montaient, mais le baron le négligea dans ses 
>>>>>> calculs, il le laissait à fleur d’eau, sur la place, 
>>>>>> afin d’attirer les poissons ! Il avait donc massé 
>>>>>> ses valeurs, comme Napoléon massait ses trou-
>>>>>> piers, afin de liquider durant la crise qui se des-
>>>>>> sinait et qui révolutionna, en 26 et 27 les places 
>>>>>> européennes. S’il avait eu son prince de Wagram, 
>>>>>> il aurait pu dire comme Napoléon du haut du 
>>>>>> Santon : « Examinez bien la place, tel jour, à telle 
>>>>>> heure, il y aura là des fonds répandus ! » Mais à 
>>>>>> qui pouvait-il se confier ? Du Tillet ne soupçonna 
>>>>>>
>>>>>>
>>>>>>  
>>>>>>   
>>>>>> -- 
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to [email protected] 
>>>>>>
>>>>>> To unsubscribe from this group, send email to
>>>>>> [email protected] 
>>>>>>
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>>>>
>>>>>> --- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected]. 
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>> -- 
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected]
>>>> To unsubscribe from this group, send email to
>>>> [email protected]
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>>
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Create boxfile from a certified text

Reply via email to