Hi all, several years ago I did some texts with pdflatex and the devnag package (XeTeX did not exist at that time), it is still here: http://icebearsoft.euweb.cz/dvngpdf/
The situation in the Indic scripts are much more complex and cannot be solved by a ToUnicode map. Half-consonants can be mapped to a consonant followed by a virama. Conjuncts as ksha can be mapped to ka + virama + sha. The problem is with reordering. I will make examples in Hindi only because I do not know other Indic languages. Take a word kitaab (= किताब, meaning a book). The correct character order is ka + i-matra + ta + aa-matra + ba but in the vizual representattion the glyphs are ordered as i-matra + ka + ta + aa-matra + ba. You cannot blindly move the i-matra behond the following consonant. Word shakti (= सहक्ति, force) is sha + ka + virama + ta + i-matra in the character order but sha + i-matra + {kta-conjunct | half-ka + ta} where the second form is usually preferred in nowadays Hindi. Even more weird reorderings exist, marzii is ma + ra + virama + za + ii-matra in character order but vizually ma + za + ii-matra + hook-repha. The case of two-part vowels in some scripts is difficult two. You have generally the following scheme: vowel-part-1 + consonant-group or conjunct + vowel-part-2 Both parts exist as a separate glyphs mapped to other characters so you must know whether the glyph represents a character or whether two glyphs compose a two-part vowel. These are not things that could be solved by simple ToUnicode maps. On the contrary, it is not necessary to put ActualText to each word but certainly to a great many words. Zdeněk Wagner http://ttsm.icpf.cas.cz/team/wagner.shtml http://icebearsoft.euweb.cz 2016-02-23 6:21 GMT+01:00 Andrew Cunningham <lang.supp...@gmail.com>: > Simon, > > On 23 February 2016 at 14:12, Simon Cozens <si...@simon-cozens.org> wrote: > >> On 23/02/2016 13:54, Andrew Cunningham wrote: >> > PDF/UA for instance leaves the question deliberately ambigious. >> > ActualText is the way to make the content accessible, but developers >> > creating tools for PDF do not actually have to process the ActualText. >> >> Yeah. (Sorry to keep banging the drum but) I've just done some tests >> with SILE, which includes some support for tagged/accessible PDFs. Even >> when the ActualText includes the correct Devanagari, I am still seeing >> the same problems with cut-and-paste. I'm not sure what needs to be done >> to get it right. >> >> > In terms of SILE ... supporting generation of other formats like XPS as an > alternative to PDF is probably the only way forward for complex script > languages. > > If SILE is tagging the PDFs and adding ActualText attributes , then it is > doing everything it should be doing. The problems are with the PDF > specification itself, what it was originally designed to be (a pre-print > format based on the Postscript language) and the limitations placed on it > by the developers of the spec. > > Andrew > > > > -------------------------------------------------- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex > >
-------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex