Re: svn commit: r1934426 - subversion/trunk/subversion/svn

Branko Čibej Sat, 23 May 2026 08:40:26 -0700

On 23. 5. 26 10:21, Branko Čibej wrote:

On 22. 5. 26 20:31, Branko Čibej wrote:

On 21. 5. 26 14:30, Branko Čibej wrote:

On 21. 5. 26 14:25, Branko Čibej wrote:

Sent to the wrong address... &*^#@! gmail...


On 21. 5. 26 09:57, Branko Čibej wrote:

On Thu, 21 May 2026, 08:39 Daniel Sahlberg,<[email protected]> wrote:


    Den tors 21 maj 2026 kl 07:22 skrev Branko Čibej
    <[email protected]>:

        On Thu, 21 May 2026, 05:56 Nathan Hartman,
        <[email protected]> wrote:

            DSahlberg wrote:

            > If I understand the idea it was to ensure that an
            author name containing a Unicode character which is
            encoded using more than one byte in UTF-8 is correctly
            aligned in the output. Using the Swedish character A
            with two dots (Ä, encoded as 0xC3, 0x84
            >
            > With the old code, note how the author name for Line
            6 is only showing 9 letters (but 10 bytes due to Ä
            encoded as two separate bytes):
            > [[[
            > dsg@devi-25-01:~/svn_trunk3$ svn blame ../wc/foo
            >      1        dsg 1
            >      1        dsg 2
            >      1        dsg 3
            >      2 averylonga Line 4
            >      1        dsg 5
            >      3 Äabcdefgh Line 6
            >      1        dsg 7
            >      1        dsg 8
            >      1        dsg 9
            > ]]]
            > (The above should probably be view with a monospace
            font, there it is clear that the text "Line 6" start
            one position to the left of all other lines).
            >
            > With the change, the code detects that Ä only occupy
            a single "position" and thus we can display an
            additional letter and in the author name and (with a
            monospace font) the columns are perfectly aligned:
            > [[[
            > dsg@devi-25-01:~/svn_trunk3$ ./subversion/svn/svn
            blame ../wc/foo
            >      1        dsg 1
            >      1        dsg 2
            >      1        dsg 3
            >      2 longauthor Line 4
            >      1        dsg 5
            >      3 Äabcdefghi Line 6
            >      1        dsg 7
            >      1        dsg 8
            >      1        dsg 9
            > ]]]
            >
            > However the nicer columnar layout of course have a
            drawback if we are counting bytes for start of the
            line. Previously `cut  -c19-` would extract the
            contents of the file but now it would add an extra
            space preceeding "Line 6". (With multiple double-byte
            UTF-8 characters, in Swedish ÅÄÖ would be 6 bytes) we
            would have part of the author name extracted by cut).
            >
            > TLDR; It all depends on if we want a pretty layout
            (which is probably a design goal for svnbrowse) or if
            we want to ensure correct column width.

            Ah! (A light bulb goes on somewhere!) This makes much
            more sense now.
            Thank you for this clarity.

            As already said, we try to keep command line output
            stable for 3rd
            party scripts that might rely on it.

            BUT, we have changed output formats in the past
            (example: [2]), so a
            change isn't unprecedented.

            Yes, it should be discussed first.

            It looks to me like the old code was:
            - wrong in showing 9 Unicode glyphs instead of 10
            - right in printing 10 bytes to stdout
            - wrong in that a glyph with multiple code points
            might get split up
              mid-glyph.

            So, good catch by Timofei!

            Now, what to do about it?

            The old code is backwards compatible, won't break
            scripts, and is
            wrong.

            Changing it means we can make it "right" (for some
            value of "right")
            but it's a breaking change.

            And what does "right" even mean?

            If we say: the column is 10 glyphs wide, truncate
            longer names at 10
            glyphs, then we run into all sorts of edge cases, such as:

            - what if a glyph is visually wider than 1 character,
            e.g., a glyph
              that uses 2 spaces in a terminal? Do we count it as
            2 to keep the
              columns aligned?

            - what if an author name has 9 single-width glyphs
            followed by a 10th
              double-width glyph? Then I suppose you have to
            truncate the name at
              9 glyphs and insert a space, right?


    Very good questions that we should have an answer to before
    doing something.


            Sheesh... text is hard!

            There is a point to all this rambling: 1.16 is
            bringing various utf8
            improvements. With that being front-and-center,
            breaking changes in
            stdout formats might seem reasonable if they improve
            utf8 correctness.
            They can be documented clearly under "compatibility
            considerations"
            (hopefully more clearly than in the 1.12 release notes
            where the
            breaking stdout change is buried here: [2]).

            Not saying we just push ahead and do it. Just saying
            we should think
            about it and make a decision.

            Nathan

            P.S., Circling back to my earlier reply...

            I wrote:

            > I also remember a discussion from several years
            back. It might be the same one you're thinking of. AFK
            right now but I'll try to find it.

            The discussion is at [1].

            Further, I wrote:

            > In fact, I'm also confused about the column width
            and truncation after 10 characters: I thought it
            starts with some column width and if a line is
            encountered which has a longer user name that doesn't
            fit, then the column width is increased for that line
            and all subsequent lines. (The rationale was, it's
            ugly, but better to be accurate than pretty.) Has that
            changed sometime in the last few years?

            And I stand corrected; it is 'svn ls -v' (not 'svn
            blame') that
            sacrifices column alignment to show full author names.
            Notice how my
            long name (doh!) messes up the formatting here:

            $ svn ls -v
            https://svn.apache.org/repos/asf/subversion/trunk |
            head -10
            1934444 ivan                  May 20 14:38 ./
            1922149 dsahlberg         1417 Nov 27  2024 .asf.yaml
            1922089 dsahlberg         1834 Nov 25  2024 .clang-format
            1921436 rinrab            1002 Oct 20  2024 .editorconfig
            1934358 ivan                   May 18 11:49 .github/
            1659509 rhuijben          3091 Feb 13  2015
            .ycm_extra_conf.py
            1903577 hartmannathan           95 Aug 19  2022 BUGS
            1933381 kotkov              382263 Apr 27 07:30 CHANGES
            1934149 rinrab               33342 May 12 14:18
            CMakeLists.txt
            1934118 dsahlberg            14780 May 11 14:40 COMMITTERS

            Starting at my name, the author column grows wider by
            4 for all
            subsequent lines. (Need fixed-width font to see it, or
            just take my
            word for it.)

            As Brane points out in that thread [2]:

            > This is by design, it was a change made in 1.12.
            >
            >
            
https://subversion.apache.org/docs/release-notes/1.12.html#client-server-improvements
            >
            > Yes, we realize that this makes simple output
            parsing harder, but the
            > alternative -- truncating author names at 8
            characters -- was considered
            > worse.


    Thanks for digging out the svn ls example!


            In contrast, 'svn blame' truncates author names, which are
            right-justified in the column. See how my long name
            gets truncat

            $ svn blame -r0:1926362 CMakeLists.txt
            [...]
            1926349     rinrab   endif()
            1926344     rinrab endif()
            1918878     rinrab
            1926362 hartmannat # APR and APR-Util include
            directories must be
            available to all our sources,
            1926360      brane # not just those that happen to
            link with one or
            the other of these libraries.
            1926360      brane get_target_property(_apr_include
            external-apr
            INTERFACE_INCLUDE_DIRECTORIES)
            1926360      brane get_target_property(_apu_include
            external-aprutil
            INTERFACE_INCLUDE_DIRECTORIES)
            [...]

            Column widths do not change.

            Not sure why I was mistaken about 'svn blame' acting
            like 'svn ls -v'.
            Maybe it was discussed and never implemented.

            Or maybe it's the Mandela Effect [3]. (Cue 'twilight
            zone' music...)

            [1] the dev@ thread
            '"svn list -v" column alignment issue' started 20 Dec
            2019:
            https://lists.apache.org/thread/3b03sbohwrcnqnyhs9gyb2r7hfphop75

            [2] same dev@ thread, message on 21 Dec 2019:
            https://lists.apache.org/thread/sglfobn66vt9pgdxfphygghjs9q7t00g

            [3]
            https://en.wikipedia.org/wiki/False_memory#Mandela_effect



        The plain text output of svn commands is meant for humans,
        not scripts. Always has been. This means that readability
        for people takes priority over ease of scripting. Scripts
        that need precision should use --xml.


    (y)


        The output of 'svn blame' is a good example of the
        trade-offs this implies. The width of the blame info
        column is fixed at the expense of truncating the username
        – people can interpret that well enough – so that the file
        contents' format is preserved.

        A script can't presume to find the username at a specific
        column offset, nor even in a specific space-separated
        field because usernames can contain spaces. A human,
        however, can expect a fixed visual column width.

        Now, what "fixed-width" means when it comes to general
        Unicode is an interesting question and that I can't
        answer. In most scripts I'm familiar with (latin,
        Cyrillic, modern greek, etc.) this seems like a simple
        question. But what about Arabic and Hebrew, for example?
        They're right-to-left and AFAIK we don't deal with that.
        Georgian? Thai? Sinhalese? I have no clue at all.


    I think this is what utf8proc_charwidth() tries to accomplish.
    But then there is the utf8_charwidth_ambiguous function:
    "Given a codepoint, return whether it has East Asian width
    class A (Ambiguous). Codepoints with this property are
    considered to have charwidth 1 (if they are printable) but
    some East Asian fonts render them as double width."

    RTL would probably require some really significant changes.
    And if someone combine Arabic and Enlish text in the same
    commit message...



        Even the "simple" scripts are tricky. For example,
        chopping off a ъ or ь will change the sound of the
        preceding consonant but I don't know if it could also
        change the assumed (by humans) meaning of the word...

        TL;DR: text is HARD. We'll never get it right, so we
        shouldn't try too hard. Our goal should be to make the
        visual width calculation good enough for most cases and to
        not split multi-codepoint glyphs.

        That includes Ä by the way, it can be encoded as one or
        two codepoints.


    Unless I'm mistaken (and I have only read the docs, not
    experimented with it), utf8proc_map should be able to combine
    all codepoints that have a composite codepoint. Maybe that
    would help to avoid splitting?


        Välkommen till Unicode.

        -- Brane


    Thanks! It seems to be a nice rabbit hole.

    Cheers,
     /Daniel

I think, but I'm not sure, we already normalize to NFC in someplaces – that's the normalisation form that pre-composes all thecharacters it can and leaves the remaining combining marks in awell-defined order. If we don't, we could do that – carefully andonly in certain places on output.


Utf8proc gives us all the tools we need for that.

What I wrote above is actually a bit of nonsense. There's no need tonormalise as long as we calculate the right visual width and findthe correct glyph boundaries.

I think utf8proc_grapheme_break_stateful() is what we're looking for.With it, we can iterate through an UCS-4 string to find graphemeboundaries, and we can safely truncate on those boundaries withoutmangling glyphs. It appears that counting graphemes is still notenough to precisely compute the visual width of the text evenassuming a monospace font. However, I think it will have to live withthat limitation. Or at lest, I can't think of a simple-ish way toimprove on this.

I tried this out in r1934528, and it seems to work well; seesvn_utf__cstring_utf8_grapheme_breaks().

I reimplemented svn_utf_cstring_utf8_width() on top of this function,and it (of course) yields the same result; because the string widthcalculation turns out to be exactly the same, just a bit less direct.In the interest of performance, we should revert to the previousimplementation but include int overflow checks.

We could now reimplement ..._align_left() and ..._align_right() to usethe grapheme array and avoid splitting graphemes. I'd also suggestthese two functions be renamed to ..._trim_left() and..._trim_right(), because that's what they actually do. I'd let thecaller deal with padding, either via *printf() or in any other waythey like, and have the trim functions return the (byte) length of thetransformed strings as well as their visual width.

I'd like to expand a bit on grapheme-aware trimming. The function Iadded is a proof of concept but is not suitable for general use becauseit allocates way too much data for graphemes. Both trim-right andtrim-left can be implemented without additional allocations for findingthe split point – though trim-left would have to calculate the width ofthe whole string first. I think we'd rather traverse the string twicethan allocate more memory, especially since trim-left probably won't beused often.


-- Brane

Re: svn commit: r1934426 - subversion/trunk/subversion/svn

Reply via email to