Tom Gillespie <tgb...@gmail.com> writes: > The way I have implemented this is by maintaining an explicit list of > characters that are safe for pre markup and another for post markup. > > It is not possible to use unicode punctuation for this because there > are a variety of punctuation marks that cannot appear in that position > and be considered markup, those include @, #, % to name just a few.
Not that bad. Unicode standard defines the following categories (I listed those that might be of use): Pc = Punctuation, connector Pd = Punctuation, dash Ps = Punctuation, open Pe = Punctuation, close Pi = Punctuation, initial quote (may behave like Ps or Pe depending on usage) Pf = Punctuation, final quote (may behave like Ps or Pe depending on usage) Po = Punctuation, other Zs = Separator, space Zl = Separator, line Zp = Separator, paragraph We currently use the following: PRE = <bol> <space> - ( ' " { POST = <space> - . ; : ! ? ' " ) } \ [ At least, ({ have (get-char-code-property ?{ 'general-category) ;=> Ps (punctuation, open) We might probably generalize to PRE = Zs Zl Pc Pd Ps Pi ' " POST = Zs Zl Pc Pd Pe Pf . ; : ! ? ' " \ [ Though we need to take care excluding zero-width spaces. I can find https://www.unicode.org/review/pr-23.html that defines punctuation terminals like .;:!? It looks like it is adopted, via special properties: https://www.unicode.org/reports/tr44/#STerm and https://www.unicode.org/reports/tr44/#Terminal_Punctuation Emacs does not support them though (yet?). > Therefore, if we want to do this we commit to extending and then > maintaining the lists of valid pre and post markup delimiters as > special cases. We certainly do not want to do this. It is out of scope of Org, when Unicode can be of use. > Note also this could produce changes from current behavior because > things that previously tokenized as a series of words connected by > e.g. underscores could become markup. Indeed. And we should study the feedback. However, most scenarios that will change will involve non-standard Unicode markup characters. The odds are low that users will use such Unicode at markup boundary and _also expect markup to be ignored_. At the end, it is the current ASCII limitation plus partially arbitrary choice of boundaries that keep some users confused (we are getting bug reports about confusing markup from time to time). Of course, we can, as usual, provide a linter to catch such scenarios and warn in the ORG_NEWS. I do believe that better Unicode support will benefit many Org users that use non-Latin scripts. -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92>