On 23/2/16 10:37, Zdenek Wagner wrote:
How Jonathan,
how do you put the ActualText to PDF? Is it per syllable, or per word?
Per word.
We have a commercial OCR software that can convert scanned PDF to pages
with selectable texts. I have not examined it thoroughly but it seems to
me that it analyzes the scanned image, splits it to subimages "per word"
and attaches ActualText to each word. In such a way it is impossible to
select just a group of characters, the smallest entity that can be
copied & pasted (or searched for) is a word. It might fix the
hignlighting problem but I am just guessing.
I don't think so. Even single-syllable words like भी don't highlight
well in the example.
(FWIW, it is possible to search for a substring within a word, and
Acrobat finds it OK, but it can't accurately highlight what's been
found; you get the same (inaccurate) highlighting of the word regardless
of what substring within it was searched.)
Setting ActualText per syllable would make finer-grained copy/paste
possible (currently, entire words are always copied), but would be
significantly more complex to implement (as well as adding to the PDF
file bloat). I think the per-word version should be a useful start, at
least.
Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz
2016-02-23 11:06 GMT+01:00 Jonathan Kew <jfkth...@gmail.com
<mailto:jfkth...@gmail.com>>:
On 23/2/16 02:54, Andrew Cunningham wrote:
It would probably more than double, i was under the impression that
ActualText was a tag attrubute, so extensive tagging would be
needed,
and actual text added to the tags.
The ActualText tagging is highly compressible, so in practice the
increase in overall PDF size is not all that great.
But the question is how to practically make use of ActualText if
there
is a visible text layer.
PDF/UA for instance leaves the question deliberately ambigious.
ActualText is the way to make the content accessible, but developers
creating tools for PDF do not actually have to process the
ActualText.
So to index and search PDF files you need to build a discovery
system
utilising tools that allow you to specify the use of ActualText in
preference to a visible text layer.
Acrobat Reader uses it, if present, so that Copy/Paste from the PDF
results in the correct Unicode text (more or less), and Find behaves
as expected.
Other PDF readers (such as Apple's Preview) may well ignore the
ActualText tagging, in which case it doesn't help. I don't know
whether tools like Evince or Okular handle it....
I'm attaching two sample PDFs with a simple chunk of Hindi text
(from the Unicode web site). The first, dev-old.pdf, is what XeTeX
currently generates (using the "Annapurna SIL" OpenType font). In
general, Copy/Paste and text search don't work very well -- a few
characters may be OK, but others are junk.
The second sample, dev-actualtext.pdf, was generated with an
experimental new \XeTeXgenerateactualtext feature, which
automatically "tags" each word with an ActualText representation.
Some points to note:
- The file size is 24662 bytes, while dev-old was 22875 bytes. Not
too bad. Of course, a lot of that is the embedded font data; with
longer documents that have lots of text but only a few fonts, the
difference would presumably be somewhat greater.
- Copy/Paste and Search work pretty well in Acrobat Reader. Not in
Preview.app.
- Highlighting of selected text (in Acrobat Reader) is somewhat
broken, apparently due to the ActualText tagging (it looks better in
dev-old). This may be fixable by tweaking exactly how the tagging is
written into the PDF; I haven't investigated it further.
No guarantees at this point as to whether/when this feature will
actually be available. It was just a quick attempt to hack something
up, to see how promising the results might be...
JK
--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex