Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Bryan Tarpley Tue, 10 Dec 2013 12:29:07 -0800

Janusz,

I'm going to try to interpret your comments as constructive criticism :)

We tried using MUFI.  There simply does not exist in MUFI a unicode value
for "ke," for example (we looked:
http://www.ub.uib.no/elpub/2003/r/000001/MUFI-standard-1.0.pdf).  I
strongly disagree that we're training on different character shapes than
those occurring in the texts.  We're actually cutting out images of the
characters themselves and training on those.  What you are saying is that
we should not treat them as separate entities, that we should value
typographical faithfulness over readability in our OCR.  You seem to be
advocating a kind of purity or exact consistency with the original
typesetting that is not the immediate goal of the eMOP project.  Our
ultimate concern is to make these texts searchable for early modern
scholars--not to produce 100% typographically faithful textual simulacra.
 We believe this caliber of work (the production of scholarly digital
editions) is best left to textual scholars, not machines.  How is a scholar
supposed to search for instances of the word "turkey" if there are no
unicode values they could enter using the keyboard (or even copy and paste
from the character map) for "ke?"  There exist great initiatives like the
TCP which are more interested in the kind of digitization you seem to be
advocating.

Best,
Bryan

On Tue, Dec 10, 2013 at 1:49 PM, Janusz S. Bien <[email protected]> wrote:

> Quote/Cytat - Bryan Tarpley <[email protected]> (Tue 10 Dec 2013
> 08:35:00 PM CET):
>
>
>  Nick,
>>
>> No--the example I provided was from a footnote.  I'm sure you're right
>> that
>> the original printer used "ligatures" in the sense that two or more
>> characters were present on the same plate (Forgive me for not knowing book
>> history terminology!  We work with folks at the Cushing library here who
>> are book history scholars, and they fill that knowledge gap for people
>> like
>> me :) ).
>>
>
> The terminology is strange and confusing: "type" or "sort".
>
>
>  The problem is that these custom "ligatures" are not available as
>> single characters in unicode.
>>
>
> So what? As I've already mentioned, you can assign code points from
> Unicode Private Use Area. This is actually what Medieval Unicode Font
> Initiative is doing.
>
>
>  We originally tried to place multiple
>> characters in single "boxes" to train Tesseract.  The results for us were
>> poor.  While you may put more than one character per line in a Tesseract
>> box file, you cannot use more than one character at a time in the
>> unicharambigs file, for instance (Google claims you can but you
>> can't--it's
>> a bug).
>>
>
> I don't think you would have this problem with PUA characters.
>
>
>  We made a decision to treat most "ligatures" as separate
>> characters, and while we're still amassing testing data, the results are
>> better.  Granted, for certain ligatures like the "fl" or "sl," they have
>> unicode values, so we use those.
>>
>> With Franken+, using polygons to bound those characters that normally
>> overlap with others has allowed us to snip them out of context and
>> reproduce synthetic tiff images where they do not overlap.  These
>> synthetic
>> images (where each of the characters are pristine and none overlap) are
>> what we're using to train Tesseract.
>>
>
> In other words, you train Tesseract on different character shapes then
> those actually occuring in texts.
>
>
> Best regards
>
> Janusz
>
> --
> Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra
> Lingwistyki Formalnej)
> Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
> [email protected], [email protected], http://fleksem.klf.uw.edu.pl/~
> jsbien/
>

-- 
Bryan Tarpley
Graduate Research Assistant
Texas A&M | IDHMC
LAAH 439
[email protected]

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to