On 29/06/2021 10:47, James Harkins wrote:
So, it would make sense to add a rule to the exporter: if one of the
characters before or after a source-text line break is a Chinese,
Japanese or Korean character, do not add a space.

On 29/06/2021 11:43, tumashu wrote:
You can try the below config :-)
     (let ((regexp "[[:multibyte:]]")
           (string text))
       (setq string
             (replace-regexp-in-string
              (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
              "\\1\\2" string))

Notice that [[:multibyte:]] means almost any non-ASCII script, e.g. Cyrillic:

(let ((sample "abc абв def"))
  (and (string-match "[[:multibyte:]]\+" sample)
       (match-string 0 sample)))
"абв"

It seems, `org-fill-paragraph' M-q is smart enough to avoid a space before or after a CJK character, so it is possible to determine correct way to splice lines, despite e.g. "Script" Unicode property is not exposed to elisp: https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html (Anyway maintaining explicit list of scripts is not a straightforward approach.)

P.S.
JavaScript in browsers allows to filter characters that belong to particular script:

"abc абв def".match(/\p{Script=Cyrillic}+/u)
Array [ "абв" ]

I have not found such feature in regular expressions available in Emacs.


Reply via email to