Hi Andy,
On 08/06/2010, at 11:39 AM, Andy Lin wrote:
It seems I misunderstood what exactly the TECkit mapping does. All it
does is change the input as instructed. All other "features" --
copy/paste and search compatibility -- I'd assumed was attributed to
TECkit is actually that of the PDF reader (in my case, Adobe Reader).
So, when Adobe Reader encounters the f-ligature, it knows to treat it
as 'f' and another character; they have specific Unicode code points
and thus any program can decompose them if they need to. However, the
'ch' and 'Th' ligatures in Linux Libertine are in the Private Use
Area, which are, by definition, non-standard, so they cannot be
anticipated by a PDF reader.
Yes, that is true.
However, PDF has two separate mechanisms to overcome this.
1. a CMap resource for the font
2. the /ActualText tagging construction
Now, I'm assuming it's possible to make these ligatures
copy/paste/search-able, just as it's possible to make small caps
searchable (although Charis SIL is the only I've found that's managed
it), but TECkit is not the way to do it. All TECkit does is take the
input, modify it based on the mapping, and pass the result to the
font/type engine without any additional information.
That seems to be accurate.
The reason why the TECkit mapping worked for the fonts I mentioned in
my previous post is because they had the ligatures at both the
standard Unicode codepoint and in the PUA, but for whatever reason,
had their ligature tables point to the PUA glyph. At least, I think
that's what was happening.
Concerning method 1. CMap resources:
With "Linux Libertine O" a CMap is created on-the-fly, using the
characters that are used in the document.
e.g. for the following text:
"Play in the field; riffle the deck."
(11 letters + 3 ligatures + 2 punctuation )
the CMap generated with a XeTeX run is:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapName /LinLibertineO/H/65536/0,000-UTF16 def
/CMapType 2 def
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
17 beginbfchar
<000F> <002E>
<0012> <0031>
<001C> <003B>
<0031> <0050>
<0042> <0061>
<0045> <0064>
<0046> <0065>
<0049> <0068>
<004A> <0069>
<004D> <006C>
<004F> <006E>
<0053> <0072>
<0055> <0074>
<005A> <0079>
<08A2> <E03A>
<0977> <FB01>
<097A> <FB04>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
Beware that '1' also occurs as the page number.
See how 3 font characters are mapped into the PUA area.
<08A2> <E03A>
<0977> <FB01>
<097A> <FB04>
For searchability, these really should be:
<08A2> <0063006B>
<0977> <00660069>
<097A> <00660066006C>
for the ck , fi and ffl ligatures respectively.
I don't know where that CMap resource is being constructed.
Presumably it is by xdvipdfmx as it subsets the font
for inclusion. Presumably it is getting information from
the complete font itself.
Is there a way to override some entries and get those
ligatures pointing to letter combinations?
Again, I don't know. Maybe someone else can comment.
Concerning method 2. /ActualText tagging:
Here is an example document that demonstrates how it
a. does work with pdfTeX
but
b. produces broken PDFs with XeTeX + xdvipdfmx .
\documentclass[11pt]{article}
\usepackage{geometry} % See geometry.pdf to learn
the layout options. There are lots.
\geometry{letterpaper} % ... or a4paper or
a5paper or ...
\usepackage{ifxetex,ifpdf}
\ifxetex
\usepackage{xltxtra}
%\setmainfont{Charis SIL}
\setmainfont{Linux Libertine O}
\newcommand{\XetexActualText}[2]{%
\special{pdf:literal BT /Span <</ActualText<#2>>> BDC}#1\special
{pdf:literal EMC ET}}
\newcommand{\FFI}{\XetexActualText{ffi}{006600660069}}
\newcommand{\FF}{\XetexActualText{ff}{00660066}}
\newcommand{\FI}{\XetexActualText{fi}{00660069}}
\newcommand{\FFL}{\XetexActualText{ffl}{00660066006c}}
\newcommand{\FL}{\XetexActualText{fl}{0066006c}}
\newcommand{\CK}{\XetexActualText{ck}{0063006b}}
\fi
\ifpdf
\pdfcompresslevel 0
\newcommand{\PDFTeXActualText}[2]{%
\pdfliteral direct {/Span<</ActualText<#2>>> BDC}#1\pdfliteral
direct {EMC}}
\newcommand{\FFI}{\PDFTeXActualText{ffi}{006600660069}}
\newcommand{\FF}{\PDFTeXActualText{ff}{00660066}}
\newcommand{\FI}{\PDFTeXActualText{fi}{00660069}}
\newcommand{\FFL}{\PDFTeXActualText{ffl}{00660066006C}}
\newcommand{\FL}{\PDFTeXActualText{fl}{0066006C}}
\newcommand{\CK}{\PDFTeXActualText{ck}{0063006b}}
\fi
\begin{document}
Play in the {\FI}eld; ri{\FFL}e the de\CK.
Play in the field; riffle the deck.
\end{document}
When processed by XeTeX this file produces a PDF that is readable
in both Apple's Preview, and in Adobe Reader and Acrobat Pro.
However, Acrobat Pro reports the content stream to be mal-formed.
It looks as follows:
stream
q 1 0 0 1 72 720 cm 0 G 0 g BT /F1 10.909 Tf 36.74 -34 Td
[<0031>-11<004d0042005a>-250<004a004f>-250<005500490046>]TJ ET 1 0 0
1 86.24 -34 cm BT /Span <</ActualText<00660069>>> BDC 1 0 0 1 -86.24
34 cm BT /F1 10.909 Tf 86.24 -34 Td[<0977>]TJ ET 1 0 0 1 92.2 -34 cm
EMC ET 1 0 0 1 -92.2 34 cm BT /F1 10.909 Tf 92.2 -34 Td
[<0046004d0045001c>-249<0053004a>]TJ ET 1 0 0 1 117.09 -34 cm BT /
Span <</ActualText<00660066006c>>> BDC 1 0 0 1 -117.09 34 cm BT /F1
10.909 Tf 117.09 -34 Td[<097a>]TJ ET 1 0 0 1 125.65 -34 cm EMC ET 1 0
0 1 -125.65 34 cm BT /F1 1 ...
Note how there is ... BT ... ET ... BT /Span ... BDC ... BT ...
ET ... EMC ET ...
when it really should be nested like:
... BT .... /Span ... BDC ... EMC ... ... ET ...
If a macro definition is changed to:
\newcommand{\XetexActualText}[2]{%
\special{pdf:literal /Span <</ActualText<#2>>> BDC}#1\special
{pdf:literal EMC}}
then the PDF content stream is still malformed.
So much so that Adobe software will not show anything,
even though Apple software does produce a display.
In neither case, using XeTeX, does Copy/Paste respect the /
ActualText .
So my conclusion is that xdvipdfmx does not provide the method
to put tagging directly into the content stream, thereby allowing
/ActualText --- and other forms of tagging --- to be used.
pdfTeX, on the other hand, does allow this to some extent.
That is, /ActualText works in some situations.
Other kinds of tagging are more delicate, requiring an especially
modified version of pdfTeX having extra primitives.
I gave a talk at the TUG 2009 meeting on this last year,
and will be giving another at TUG 2010 in a few weeks from now.
If I am mistaken, please correct me.
You are not mistaken in that XeTeX cannot use /ActualText
at present --- unless there have been some recent developments
to XeTeX or xdvipdfmx of which I am not aware.
(That's quite possibly the case.)
You are mistaken in that what you want is certainly doable,
so far as the PDF specifications are concerned.
-Andy Lin
I had noticed that the ligatures 'ch' and 'Th' are not searchable in
Linux Libertine. I added the following mappings:
U+0063 U+0068 <> U+E03B ; ch -> ch ligature
U+0054 U+0068 <> U+E049 ; Th -> Th ligature
But these do not make it possible to search or copy/paste as
uncompiled.
The .tec file is compiled correctly and XeTeX finds it. Any thoughts?
Hope this helps,
Ross
------------------------------------------------------------------------
Ross Moore ross.mo...@mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex