Quote/Cytat - Bryan Tarpley <[email protected]> (Tue 10 Dec 2013
08:35:00 PM CET):
Nick,
No--the example I provided was from a footnote. I'm sure you're right that
the original printer used "ligatures" in the sense that two or more
characters were present on the same plate (Forgive me for not knowing book
history terminology! We work with folks at the Cushing library here who
are book history scholars, and they fill that knowledge gap for people like
me :) ).
The terminology is strange and confusing: "type" or "sort".
The problem is that these custom "ligatures" are not available as
single characters in unicode.
So what? As I've already mentioned, you can assign code points from
Unicode Private Use Area. This is actually what Medieval Unicode Font
Initiative is doing.
We originally tried to place multiple
characters in single "boxes" to train Tesseract. The results for us were
poor. While you may put more than one character per line in a Tesseract
box file, you cannot use more than one character at a time in the
unicharambigs file, for instance (Google claims you can but you can't--it's
a bug).
I don't think you would have this problem with PUA characters.
We made a decision to treat most "ligatures" as separate
characters, and while we're still amassing testing data, the results are
better. Granted, for certain ligatures like the "fl" or "sl," they have
unicode values, so we use those.
With Franken+, using polygons to bound those characters that normally
overlap with others has allowed us to snip them out of context and
reproduce synthetic tiff images where they do not overlap. These synthetic
images (where each of the characters are pristine and none overlap) are
what we're using to train Tesseract.
In other words, you train Tesseract on different character shapes then
those actually occuring in texts.
Best regards
Janusz
--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
[email protected], [email protected], http://fleksem.klf.uw.edu.pl/~jsbien/
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.