Ihor Radchenko <[email protected]> 2025/12/26 02:44 0800 writes:
> >> There are still problems with the proposed approach. In particular,
> >> using Po Unicode character category might be problematic.
> >> "!?.," are all Po, but we currently just allow them as right boundary.
> >> This makes sense since !* is probably intentional - in English, ! is
> >> end of sentence and should be followed by a space. So, it is unlikely to
> >> be expected as a left boundary of emphasis.
> >> 、 。 ! , . ? are also Po, but I am not sure whether one may expect
> >> to write 您好。*我*叫Ihor。
> >
> > Yes, this is exactly what we expect.
> >
> > Unlike English, CJK languages do not use spaces to separate sentences or
> > phrases. The punctuation marks themselves act as the delimiters.
>
> What about other Po characters like in
> https://www.compart.com/en/unicode/category/Po
> ?

The Po category (Punctuation, other) is a vast collection that goes far
beyond the daily characters used in Chinese or English. It includes many
symbols from specialized scripts or historical contexts where the
spacing convention is effectively "undefined" for a general-purpose
markup parser.

I believe trying to define a universal spacing rule for every character
in the Po table might be over-engineering. Maybe the primary goal should
be to ensure that common CJK delimiters (like 。, ,, !) are treated as
valid boundaries for emphasis.

> Tests would be helpful, yes.
> But before that, I need to think hard about what to do with
> .,!? and similar English punctuation.
> We might possibly consult
> https://www.unicodecharacter.org/property/Line_Break/CL
> (Line_Break property = close punctuation), but 。 is also CL.
> But then ⁈ has Line_Break = Nonstarter, and we might want to allow ⁈
> after emphasis. So, Line_Break is a shaky metrics.

Yes, unicode categories (like Line_Break or Po) are indeed too broad and
blurry to effectively distinguish between CJK and English punctuation
for our purpose.

> Maybe "Terminal Punctuation" property.

Terminal Punctuation is indeed more promising. If we use it as a
baseline and then cherry-pick a specific subset—or exclude a few
problematic ones—to act as valid boundaries, the workload should be
quite manageable.

Reply via email to