On 29/06/2021 10:47, James Harkins wrote:
So, it would make sense to add a rule to the exporter: if one of the
characters before or after a source-text line break is a Chinese,
Japanese or Korean character, do not add a space.
On 29/06/2021 11:43, tumashu wrote:
You can try the below config :-)
(let ((regexp "[[:multibyte:]]")
(string text))
(setq string
(replace-regexp-in-string
(format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
"\\1\\2" string))
Notice that [[:multibyte:]] means almost any non-ASCII script, e.g.
Cyrillic:
(let ((sample "abc абв def"))
(and (string-match "[[:multibyte:]]\+" sample)
(match-string 0 sample)))
"абв"
It seems, `org-fill-paragraph' M-q is smart enough to avoid a space
before or after a CJK character, so it is possible to determine correct
way to splice lines, despite e.g. "Script" Unicode property is not
exposed to elisp:
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html
(Anyway maintaining explicit list of scripts is not a straightforward
approach.)
P.S.
JavaScript in browsers allows to filter characters that belong to
particular script:
"abc абв def".match(/\p{Script=Cyrillic}+/u)
Array [ "абв" ]
I have not found such feature in regular expressions available in Emacs.