On Thu, 21 May 2026, 08:39 Daniel Sahlberg,
<[email protected]> wrote:
Den tors 21 maj 2026 kl 07:22 skrev Branko Čibej <[email protected]>:
On Thu, 21 May 2026, 05:56 Nathan Hartman,
<[email protected]> wrote:
DSahlberg wrote:
> If I understand the idea it was to ensure that an
author name containing a Unicode character which is
encoded using more than one byte in UTF-8 is correctly
aligned in the output. Using the Swedish character A
with two dots (Ä, encoded as 0xC3, 0x84
>
> With the old code, note how the author name for Line 6
is only showing 9 letters (but 10 bytes due to Ä encoded
as two separate bytes):
> [[[
> dsg@devi-25-01:~/svn_trunk3$ svn blame ../wc/foo
> 1 dsg 1
> 1 dsg 2
> 1 dsg 3
> 2 averylonga Line 4
> 1 dsg 5
> 3 Äabcdefgh Line 6
> 1 dsg 7
> 1 dsg 8
> 1 dsg 9
> ]]]
> (The above should probably be view with a monospace
font, there it is clear that the text "Line 6" start one
position to the left of all other lines).
>
> With the change, the code detects that Ä only occupy a
single "position" and thus we can display an additional
letter and in the author name and (with a monospace
font) the columns are perfectly aligned:
> [[[
> dsg@devi-25-01:~/svn_trunk3$ ./subversion/svn/svn
blame ../wc/foo
> 1 dsg 1
> 1 dsg 2
> 1 dsg 3
> 2 longauthor Line 4
> 1 dsg 5
> 3 Äabcdefghi Line 6
> 1 dsg 7
> 1 dsg 8
> 1 dsg 9
> ]]]
>
> However the nicer columnar layout of course have a
drawback if we are counting bytes for start of the line.
Previously `cut -c19-` would extract the contents of
the file but now it would add an extra space preceeding
"Line 6". (With multiple double-byte UTF-8 characters,
in Swedish ÅÄÖ would be 6 bytes) we would have part of
the author name extracted by cut).
>
> TLDR; It all depends on if we want a pretty layout
(which is probably a design goal for svnbrowse) or if we
want to ensure correct column width.
Ah! (A light bulb goes on somewhere!) This makes much
more sense now.
Thank you for this clarity.
As already said, we try to keep command line output
stable for 3rd
party scripts that might rely on it.
BUT, we have changed output formats in the past
(example: [2]), so a
change isn't unprecedented.
Yes, it should be discussed first.
It looks to me like the old code was:
- wrong in showing 9 Unicode glyphs instead of 10
- right in printing 10 bytes to stdout
- wrong in that a glyph with multiple code points might
get split up
mid-glyph.
So, good catch by Timofei!
Now, what to do about it?
The old code is backwards compatible, won't break
scripts, and is
wrong.
Changing it means we can make it "right" (for some value
of "right")
but it's a breaking change.
And what does "right" even mean?
If we say: the column is 10 glyphs wide, truncate longer
names at 10
glyphs, then we run into all sorts of edge cases, such as:
- what if a glyph is visually wider than 1 character,
e.g., a glyph
that uses 2 spaces in a terminal? Do we count it as 2
to keep the
columns aligned?
- what if an author name has 9 single-width glyphs
followed by a 10th
double-width glyph? Then I suppose you have to
truncate the name at
9 glyphs and insert a space, right?
Very good questions that we should have an answer to before
doing something.
Sheesh... text is hard!
There is a point to all this rambling: 1.16 is bringing
various utf8
improvements. With that being front-and-center, breaking
changes in
stdout formats might seem reasonable if they improve
utf8 correctness.
They can be documented clearly under "compatibility
considerations"
(hopefully more clearly than in the 1.12 release notes
where the
breaking stdout change is buried here: [2]).
Not saying we just push ahead and do it. Just saying we
should think
about it and make a decision.
Nathan
P.S., Circling back to my earlier reply...
I wrote:
> I also remember a discussion from several years back.
It might be the same one you're thinking of. AFK right
now but I'll try to find it.
The discussion is at [1].
Further, I wrote:
> In fact, I'm also confused about the column width and
truncation after 10 characters: I thought it starts with
some column width and if a line is encountered which has
a longer user name that doesn't fit, then the column
width is increased for that line and all subsequent
lines. (The rationale was, it's ugly, but better to be
accurate than pretty.) Has that changed sometime in the
last few years?
And I stand corrected; it is 'svn ls -v' (not 'svn
blame') that
sacrifices column alignment to show full author names.
Notice how my
long name (doh!) messes up the formatting here:
$ svn ls -v
https://svn.apache.org/repos/asf/subversion/trunk | head -10
1934444 ivan May 20 14:38 ./
1922149 dsahlberg 1417 Nov 27 2024 .asf.yaml
1922089 dsahlberg 1834 Nov 25 2024 .clang-format
1921436 rinrab 1002 Oct 20 2024 .editorconfig
1934358 ivan May 18 11:49 .github/
1659509 rhuijben 3091 Feb 13 2015
.ycm_extra_conf.py
1903577 hartmannathan 95 Aug 19 2022 BUGS
1933381 kotkov 382263 Apr 27 07:30 CHANGES
1934149 rinrab 33342 May 12 14:18
CMakeLists.txt
1934118 dsahlberg 14780 May 11 14:40 COMMITTERS
Starting at my name, the author column grows wider by 4
for all
subsequent lines. (Need fixed-width font to see it, or
just take my
word for it.)
As Brane points out in that thread [2]:
> This is by design, it was a change made in 1.12.
>
>
https://subversion.apache.org/docs/release-notes/1.12.html#client-server-improvements
>
> Yes, we realize that this makes simple output parsing
harder, but the
> alternative -- truncating author names at 8 characters
-- was considered
> worse.
Thanks for digging out the svn ls example!
In contrast, 'svn blame' truncates author names, which are
right-justified in the column. See how my long name gets
truncat
$ svn blame -r0:1926362 CMakeLists.txt
[...]
1926349 rinrab endif()
1926344 rinrab endif()
1918878 rinrab
1926362 hartmannat # APR and APR-Util include
directories must be
available to all our sources,
1926360 brane # not just those that happen to link
with one or
the other of these libraries.
1926360 brane get_target_property(_apr_include
external-apr
INTERFACE_INCLUDE_DIRECTORIES)
1926360 brane get_target_property(_apu_include
external-aprutil
INTERFACE_INCLUDE_DIRECTORIES)
[...]
Column widths do not change.
Not sure why I was mistaken about 'svn blame' acting
like 'svn ls -v'.
Maybe it was discussed and never implemented.
Or maybe it's the Mandela Effect [3]. (Cue 'twilight
zone' music...)
[1] the dev@ thread
'"svn list -v" column alignment issue' started 20 Dec 2019:
https://lists.apache.org/thread/3b03sbohwrcnqnyhs9gyb2r7hfphop75
[2] same dev@ thread, message on 21 Dec 2019:
https://lists.apache.org/thread/sglfobn66vt9pgdxfphygghjs9q7t00g
[3]
https://en.wikipedia.org/wiki/False_memory#Mandela_effect
The plain text output of svn commands is meant for humans,
not scripts. Always has been. This means that readability
for people takes priority over ease of scripting. Scripts
that need precision should use --xml.
(y)
The output of 'svn blame' is a good example of the
trade-offs this implies. The width of the blame info column
is fixed at the expense of truncating the username – people
can interpret that well enough – so that the file contents'
format is preserved.
A script can't presume to find the username at a specific
column offset, nor even in a specific space-separated field
because usernames can contain spaces. A human, however, can
expect a fixed visual column width.
Now, what "fixed-width" means when it comes to general
Unicode is an interesting question and that I can't answer.
In most scripts I'm familiar with (latin, Cyrillic, modern
greek, etc.) this seems like a simple question. But what
about Arabic and Hebrew, for example? They're right-to-left
and AFAIK we don't deal with that. Georgian? Thai?
Sinhalese? I have no clue at all.
I think this is what utf8proc_charwidth() tries to accomplish.
But then there is the utf8_charwidth_ambiguous function: "Given
a codepoint, return whether it has East Asian width class A
(Ambiguous). Codepoints with this property are considered to
have charwidth 1 (if they are printable) but some East Asian
fonts render them as double width."
RTL would probably require some really significant changes. And
if someone combine Arabic and Enlish text in the same commit
message...
Even the "simple" scripts are tricky. For example, chopping
off a ъ or ь will change the sound of the preceding
consonant but I don't know if it could also change the
assumed (by humans) meaning of the word...
TL;DR: text is HARD. We'll never get it right, so we
shouldn't try too hard. Our goal should be to make the
visual width calculation good enough for most cases and to
not split multi-codepoint glyphs.
That includes Ä by the way, it can be encoded as one or two
codepoints.
Unless I'm mistaken (and I have only read the docs, not
experimented with it), utf8proc_map should be able to combine
all codepoints that have a composite codepoint. Maybe that would
help to avoid splitting?
Välkommen till Unicode.
-- Brane
Thanks! It seems to be a nice rabbit hole.
Cheers,
/Daniel
I think, but I'm not sure, we already normalize to NFC in some
places – that's the normalisation form that pre-composes all the
characters it can and leaves the remaining combining marks in a
well-defined order. If we don't, we could do that – carefully and
only in certain places on output.
Utf8proc gives us all the tools we need for that.