Hi Egmont, hi all,
This is a interesting discussion here. If only because I would have thought that there is only minimal interest by the actual target audience in supporting these scripts in a terminal, given the severe limitations of that environment. The most important limitation seems to me that a monospaced font must be used, which does not suite most scripts that do shaping. On the script-level I am familiar with Arabic, Syraic and Mandaic (I don't actually speak any of these languages, so if you want a real expert, I am not that person). Monospaced Arabic struggles and is not very elegant. I have not seen solutions for monospaced Syriac or Mandaic but I have trouble to even to imagine them. OTOH, that inelegance maybe can be an excuse (or a guide if you prefer) to make the implementation simpler in other respects, because expectations should be lower than for a graphical application. Anyway, as a concrete addition to the discussion, I have a simple Arabic shaping solution for Emacs on the terminal, especially on the Linux console, and this discussion finally made me make it public on Gitlab, see https://gitlab.com/cc_benny/termshape. The Gitlab CD is activated, so (mostly) ready-make Emacs packages can be downloaded as build artifacts. If anybody wants to discuss this implementation, we should probably move that discussion somewhere else, like to the Emacs mailing list (https://lists.gnu.org/mailman/listinfo/emacs-devel). Some specific technical points from thinking about the problem on my side: Presentation forms: Termshape uses the Arabic presentation forms available and so it is somewhat limited as mentioned by Eli. Given that we need to keep the implementation simple anyway, I am not sure that significantly more is really needed, at least given what Emacs provides already. Additional character forms could be added, where the Unicode repertoire is not sufficient. This could use PUA characters or other means like terminal control sequences. In both cases a common understanding would be needed between the terminal (or the font used by it) and the application, outside of Unicode. Ligatures: With most shaping one character is transformed into a character form that still only occupies one cell. A ligature like lam-alif OTOH only occupies one cell for two characters, so for justification etc. the application will have to know that the two characters together have a width of 1 on the screen. This is easier if the applicaton does the selection of ligatures. If you want to do this in the terminal, the application would probably need to have some way to measure the display width of a string, so that it can handle the situation. Be prepared though for the application to make quite a lot of these requests. For my own main use case for Emacs on a terminal, display over SSH, that could become a problem. Diacritics: The application can know what is a non-spacing character and what is not. So it can know that diacritics do not occupy their own cell and it should be able to ignore whether the terminal supports a specific diacritic or not. If the terminal does not support a diacritic the terminal can either just leave it out or the terminal can mess up the display more of less irreparably. In the first case, the worst is that the user does not see the character, in the second case the application cannot do anything about it with reasonable effort IMO. A real problem is a combination of diacritics and ligatures. Any diacritic applies to only one character in the ligature, and between the application and the terminal it is currently not possible to determine which one. This is one area where an implementation in the terminal would clearly have the advantage. But a terminal control sequence could also help. IMO we are talking about a luxury problem here, though. Do we want to set as our first goal showing complete quranic verses in all their glory, or are we satisfied with everyday Arabic like say the website of a modern Arabic newspaper? Thanks for your effort and for starting this discussion, benny