On Thu, 21 May 2026, 05:56 Nathan Hartman, <[email protected]> wrote:

> DSahlberg wrote:
>
> > If I understand the idea it was to ensure that an author name containing
> a Unicode character which is encoded using more than one byte in UTF-8 is
> correctly aligned in the output. Using the Swedish character A with two
> dots (Ä, encoded as 0xC3, 0x84
> >
> > With the old code, note how the author name for Line 6 is only showing 9
> letters (but 10 bytes due to Ä encoded as two separate bytes):
> > [[[
> > dsg@devi-25-01:~/svn_trunk3$ svn blame ../wc/foo
> >      1        dsg 1
> >      1        dsg 2
> >      1        dsg 3
> >      2 averylonga Line 4
> >      1        dsg 5
> >      3 Äabcdefgh Line 6
> >      1        dsg 7
> >      1        dsg 8
> >      1        dsg 9
> > ]]]
> > (The above should probably be view with a monospace font, there it is
> clear that the text "Line 6" start one position to the left of all other
> lines).
> >
> > With the change, the code detects that Ä only occupy a single "position"
> and thus we can display an additional letter and in the author name and
> (with a monospace font) the columns are perfectly aligned:
> > [[[
> > dsg@devi-25-01:~/svn_trunk3$ ./subversion/svn/svn blame ../wc/foo
> >      1        dsg 1
> >      1        dsg 2
> >      1        dsg 3
> >      2 longauthor Line 4
> >      1        dsg 5
> >      3 Äabcdefghi Line 6
> >      1        dsg 7
> >      1        dsg 8
> >      1        dsg 9
> > ]]]
> >
> > However the nicer columnar layout of course have a drawback if we are
> counting bytes for start of the line. Previously `cut  -c19-` would extract
> the contents of the file but now it would add an extra space preceeding
> "Line 6". (With multiple double-byte UTF-8 characters, in Swedish ÅÄÖ would
> be 6 bytes) we would have part of the author name extracted by cut).
> >
> > TLDR; It all depends on if we want a pretty layout (which is probably a
> design goal for svnbrowse) or if we want to ensure correct column width.
>
> Ah! (A light bulb goes on somewhere!) This makes much more sense now.
> Thank you for this clarity.
>
> As already said, we try to keep command line output stable for 3rd
> party scripts that might rely on it.
>
> BUT, we have changed output formats in the past (example: [2]), so a
> change isn't unprecedented.
>
> Yes, it should be discussed first.
>
> It looks to me like the old code was:
> - wrong in showing 9 Unicode glyphs instead of 10
> - right in printing 10 bytes to stdout
> - wrong in that a glyph with multiple code points might get split up
>   mid-glyph.
>
> So, good catch by Timofei!
>
> Now, what to do about it?
>
> The old code is backwards compatible, won't break scripts, and is
> wrong.
>
> Changing it means we can make it "right" (for some value of "right")
> but it's a breaking change.
>
> And what does "right" even mean?
>
> If we say: the column is 10 glyphs wide, truncate longer names at 10
> glyphs, then we run into all sorts of edge cases, such as:
>
> - what if a glyph is visually wider than 1 character, e.g., a glyph
>   that uses 2 spaces in a terminal? Do we count it as 2 to keep the
>   columns aligned?
>
> - what if an author name has 9 single-width glyphs followed by a 10th
>   double-width glyph? Then I suppose you have to truncate the name at
>   9 glyphs and insert a space, right?
>
> Sheesh... text is hard!
>
> There is a point to all this rambling: 1.16 is bringing various utf8
> improvements. With that being front-and-center, breaking changes in
> stdout formats might seem reasonable if they improve utf8 correctness.
> They can be documented clearly under "compatibility considerations"
> (hopefully more clearly than in the 1.12 release notes where the
> breaking stdout change is buried here: [2]).
>
> Not saying we just push ahead and do it. Just saying we should think
> about it and make a decision.
>
> Nathan
>
> P.S., Circling back to my earlier reply...
>
> I wrote:
>
> > I also remember a discussion from several years back. It might be the
> same one you're thinking of. AFK right now but I'll try to find it.
>
> The discussion is at [1].
>
> Further, I wrote:
>
> > In fact, I'm also confused about the column width and truncation after
> 10 characters: I thought it starts with some column width and if a line is
> encountered which has a longer user name that doesn't fit, then the column
> width is increased for that line and all subsequent lines. (The rationale
> was, it's ugly, but better to be accurate than pretty.) Has that changed
> sometime in the last few years?
>
> And I stand corrected; it is 'svn ls -v' (not 'svn blame') that
> sacrifices column alignment to show full author names. Notice how my
> long name (doh!) messes up the formatting here:
>
> $ svn ls -v https://svn.apache.org/repos/asf/subversion/trunk | head -10
> 1934444 ivan                  May 20 14:38 ./
> 1922149 dsahlberg         1417 Nov 27  2024 .asf.yaml
> 1922089 dsahlberg         1834 Nov 25  2024 .clang-format
> 1921436 rinrab            1002 Oct 20  2024 .editorconfig
> 1934358 ivan                   May 18 11:49 .github/
> 1659509 rhuijben          3091 Feb 13  2015 .ycm_extra_conf.py
> 1903577 hartmannathan           95 Aug 19  2022 BUGS
> 1933381 kotkov              382263 Apr 27 07:30 CHANGES
> 1934149 rinrab               33342 May 12 14:18 CMakeLists.txt
> 1934118 dsahlberg            14780 May 11 14:40 COMMITTERS
>
> Starting at my name, the author column grows wider by 4 for all
> subsequent lines. (Need fixed-width font to see it, or just take my
> word for it.)
>
> As Brane points out in that thread [2]:
>
> > This is by design, it was a change made in 1.12.
> >
> >
> https://subversion.apache.org/docs/release-notes/1.12.html#client-server-improvements
> >
> > Yes, we realize that this makes simple output parsing harder, but the
> > alternative -- truncating author names at 8 characters -- was considered
> > worse.
>
> In contrast, 'svn blame' truncates author names, which are
> right-justified in the column. See how my long name gets truncat
>
> $ svn blame -r0:1926362 CMakeLists.txt
> [...]
> 1926349     rinrab   endif()
> 1926344     rinrab endif()
> 1918878     rinrab
> 1926362 hartmannat # APR and APR-Util include directories must be
> available to all our sources,
> 1926360      brane # not just those that happen to link with one or
> the other of these libraries.
> 1926360      brane get_target_property(_apr_include external-apr
> INTERFACE_INCLUDE_DIRECTORIES)
> 1926360      brane get_target_property(_apu_include external-aprutil
> INTERFACE_INCLUDE_DIRECTORIES)
> [...]
>
> Column widths do not change.
>
> Not sure why I was mistaken about 'svn blame' acting like 'svn ls -v'.
> Maybe it was discussed and never implemented.
>
> Or maybe it's the Mandela Effect [3]. (Cue 'twilight zone' music...)
>
> [1] the dev@ thread
> '"svn list -v" column alignment issue' started 20 Dec 2019:
> https://lists.apache.org/thread/3b03sbohwrcnqnyhs9gyb2r7hfphop75
>
> [2] same dev@ thread, message on 21 Dec 2019:
> https://lists.apache.org/thread/sglfobn66vt9pgdxfphygghjs9q7t00g
>
> [3] https://en.wikipedia.org/wiki/False_memory#Mandela_effect



The plain text output of svn commands is meant for humans, not scripts.
Always has been. This means that readability for people takes priority over
ease of scripting. Scripts that need precision should use --xml.

The output of 'svn blame' is a good example of the trade-offs this implies.
The width of the blame info column is fixed at the expense of truncating
the username – people can interpret that well enough – so that the file
contents' format is preserved.

A script can't presume to find the username at a specific column offset,
nor even in a specific space-separated field because usernames can contain
spaces. A human, however, can expect a fixed visual column width.

Now, what "fixed-width" means when it comes to general Unicode is an
interesting question and that I can't answer. In most scripts I'm familiar
with (latin, Cyrillic, modern greek, etc.) this seems like a simple
question. But what about Arabic and Hebrew, for example? They're
right-to-left and AFAIK we don't deal with that. Georgian? Thai? Sinhalese?
I have no clue at all.

Even the "simple" scripts are tricky. For example, chopping off a ъ or ь
will change the sound of the preceding consonant but I don't know if it
could also change the assumed (by humans) meaning of the word...

TL;DR: text is HARD. We'll never get it right, so we shouldn't try too
hard. Our goal should be to make the visual width calculation good enough
for most cases and to not split multi-codepoint glyphs.

That includes Ä by the way, it can be encoded as one or two codepoints.

Välkommen till Unicode.

-- Brane

Reply via email to