On 23/2/16 10:37, Zdenek Wagner wrote:
How Jonathan,

how do you put the ActualText to PDF? Is it per syllable, or per word?

Per word.

We have a commercial OCR software that can convert scanned PDF to pages
with selectable texts. I have not examined it thoroughly but it seems to
me that it analyzes the scanned image, splits it to subimages "per word"
and attaches ActualText to each word. In such a way it is impossible to
select just a group of characters, the smallest entity that can be
copied & pasted (or searched for) is a word. It might fix the
hignlighting problem but I am just guessing.

I don't think so. Even single-syllable words like भी don't highlight well in the example.

(FWIW, it is possible to search for a substring within a word, and Acrobat finds it OK, but it can't accurately highlight what's been found; you get the same (inaccurate) highlighting of the word regardless of what substring within it was searched.)

Setting ActualText per syllable would make finer-grained copy/paste possible (currently, entire words are always copied), but would be significantly more complex to implement (as well as adding to the PDF file bloat). I think the per-word version should be a useful start, at least.



Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz

2016-02-23 11:06 GMT+01:00 Jonathan Kew <jfkth...@gmail.com
<mailto:jfkth...@gmail.com>>:

    On 23/2/16 02:54, Andrew Cunningham wrote:

        It would probably more than double, i was under the impression that
        ActualText was a tag attrubute, so extensive tagging would be
        needed,
        and actual text added to the tags.


    The ActualText tagging is highly compressible, so in practice the
    increase in overall PDF size is not all that great.


        But the question is how to practically make use of ActualText if
        there
        is a visible text layer.

        PDF/UA for instance leaves the question deliberately ambigious.
        ActualText is the way to make the content accessible, but developers
        creating tools for PDF do not actually have to process the
        ActualText.

        So to index and search PDF files you need to build a discovery
        system
        utilising tools that allow you to specify the use of ActualText in
        preference to a visible text layer.


    Acrobat Reader uses it, if present, so that Copy/Paste from the PDF
    results in the correct Unicode text (more or less), and Find behaves
    as expected.

    Other PDF readers (such as Apple's Preview) may well ignore the
    ActualText tagging, in which case it doesn't help. I don't know
    whether tools like Evince or Okular handle it....


    I'm attaching two sample PDFs with a simple chunk of Hindi text
    (from the Unicode web site). The first, dev-old.pdf, is what XeTeX
    currently generates (using the "Annapurna SIL" OpenType font). In
    general, Copy/Paste and text search don't work very well -- a few
    characters may be OK, but others are junk.

    The second sample, dev-actualtext.pdf, was generated with an
    experimental new \XeTeXgenerateactualtext feature, which
    automatically "tags" each word with an ActualText representation.

    Some points to note:

    - The file size is 24662 bytes, while dev-old was 22875 bytes. Not
    too bad. Of course, a lot of that is the embedded font data; with
    longer documents that have lots of text but only a few fonts, the
    difference would presumably be somewhat greater.

    - Copy/Paste and Search work pretty well in Acrobat Reader. Not in
    Preview.app.

    - Highlighting of selected text (in Acrobat Reader) is somewhat
    broken, apparently due to the ActualText tagging (it looks better in
    dev-old). This may be fixable by tweaking exactly how the tagging is
    written into the PDF; I haven't investigated it further.


    No guarantees at this point as to whether/when this feature will
    actually be available. It was just a quick attempt to hack something
    up, to see how promising the results might be...

    JK




    --------------------------------------------------
    Subscriptions, Archive, and List information, etc.:
    http://tug.org/mailman/listinfo/xetex






--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex




--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

Reply via email to