Ihor Radchenko <[email protected]> 2025/12/26 02:44 0800 writes: > >> There are still problems with the proposed approach. In particular, > >> using Po Unicode character category might be problematic. > >> "!?.," are all Po, but we currently just allow them as right boundary. > >> This makes sense since !* is probably intentional - in English, ! is > >> end of sentence and should be followed by a space. So, it is unlikely to > >> be expected as a left boundary of emphasis. > >> 、 。 ! , . ? are also Po, but I am not sure whether one may expect > >> to write 您好。*我*叫Ihor。 > > > > Yes, this is exactly what we expect. > > > > Unlike English, CJK languages do not use spaces to separate sentences or > > phrases. The punctuation marks themselves act as the delimiters. > > What about other Po characters like in > https://www.compart.com/en/unicode/category/Po > ?
The Po category (Punctuation, other) is a vast collection that goes far beyond the daily characters used in Chinese or English. It includes many symbols from specialized scripts or historical contexts where the spacing convention is effectively "undefined" for a general-purpose markup parser. I believe trying to define a universal spacing rule for every character in the Po table might be over-engineering. Maybe the primary goal should be to ensure that common CJK delimiters (like 。, ,, !) are treated as valid boundaries for emphasis. > Tests would be helpful, yes. > But before that, I need to think hard about what to do with > .,!? and similar English punctuation. > We might possibly consult > https://www.unicodecharacter.org/property/Line_Break/CL > (Line_Break property = close punctuation), but 。 is also CL. > But then ⁈ has Line_Break = Nonstarter, and we might want to allow ⁈ > after emphasis. So, Line_Break is a shaky metrics. Yes, unicode categories (like Line_Break or Po) are indeed too broad and blurry to effectively distinguish between CJK and English punctuation for our purpose. > Maybe "Terminal Punctuation" property. Terminal Punctuation is indeed more promising. If we use it as a baseline and then cherry-pick a specific subset—or exclude a few problematic ones—to act as valid boundaries, the workload should be quite manageable.
