On 23/2/16 02:54, Andrew Cunningham wrote:
It would probably more than double, i was under the impression that ActualText was a tag attrubute, so extensive tagging would be needed, and actual text added to the tags.
The ActualText tagging is highly compressible, so in practice the increase in overall PDF size is not all that great.
But the question is how to practically make use of ActualText if there is a visible text layer. PDF/UA for instance leaves the question deliberately ambigious. ActualText is the way to make the content accessible, but developers creating tools for PDF do not actually have to process the ActualText. So to index and search PDF files you need to build a discovery system utilising tools that allow you to specify the use of ActualText in preference to a visible text layer.
Acrobat Reader uses it, if present, so that Copy/Paste from the PDF results in the correct Unicode text (more or less), and Find behaves as expected.
Other PDF readers (such as Apple's Preview) may well ignore the ActualText tagging, in which case it doesn't help. I don't know whether tools like Evince or Okular handle it....
I'm attaching two sample PDFs with a simple chunk of Hindi text (from the Unicode web site). The first, dev-old.pdf, is what XeTeX currently generates (using the "Annapurna SIL" OpenType font). In general, Copy/Paste and text search don't work very well -- a few characters may be OK, but others are junk.
The second sample, dev-actualtext.pdf, was generated with an experimental new \XeTeXgenerateactualtext feature, which automatically "tags" each word with an ActualText representation.
Some points to note:- The file size is 24662 bytes, while dev-old was 22875 bytes. Not too bad. Of course, a lot of that is the embedded font data; with longer documents that have lots of text but only a few fonts, the difference would presumably be somewhat greater.
- Copy/Paste and Search work pretty well in Acrobat Reader. Not in Preview.app.
- Highlighting of selected text (in Acrobat Reader) is somewhat broken, apparently due to the ActualText tagging (it looks better in dev-old). This may be fixable by tweaking exactly how the tagging is written into the PDF; I haven't investigated it further.
No guarantees at this point as to whether/when this feature will actually be available. It was just a quick attempt to hack something up, to see how promising the results might be...
JK
dev-old.pdf
Description: Adobe PDF document
dev-actualtext.pdf
Description: Adobe PDF document
-------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex