On 4. 6. 2026 17:29, Timofei Zhakov wrote:
On Thu, Jun 4, 2026 at 3:22 PM Branko Čibej <[email protected]> wrote:
On 4. 6. 2026 13:48, Ivan Zhakov wrote:
Hi,
I was reviewing r1931334 nomination and noticed that
svn__adler32() uses apr_off_t for buffer size. The apr_off_t
stands for an offset in a file. It can be larger or even smaller
than maximum addressable size (apr_size_t).
The diff code uses off_t because it deals with file sizes. The
svn_adler32 wrapper probably inherited that. According to POSIX.1
[1] there is no way that off_t can be smaller than size_t but it
can certainly be larger. It's perfectly normal to have a system
with 32-bit size_t and 64-bit off_t, so in that respect, your
proposed change is a regression.
The other major change here is that off_t is signed but size_t is
unsigned.
Taken together, the patch looks OK but -1 to the concept. I do not
see any improvement. The assumption that's the motivation for the
patch, that off_t can be smaller than size_t, is incorrect.
-- Brane
I have prepared patch that fixes this. Patch changes the diff
code which very complicated, so review will be much appreciated.
Log message:
[[[
Use apr_size_t instead of apr_off_t for buffer size in
svn__adler32().
The apr_off_t stands for an offset in a file. It can be larger or
even smaller
than maximum addressable size (apr_size_t).
Update all callers.
* subversion/include/private/svn_adler32.h
* subversion/libsvn_subr/adler32.c
(svn__adler32): Use apr_size_t instead of apr_off_t for SIZE
argument.
* subversion/libsvn_diff/diff.h
* subversion/libsvn_diff/util.c
(svn_diff__normalize_buffer): Use apr_size_t instead of
apr_off_t for
LENGTGP argument.
LENGTHP?
* subversion/libsvn_diff/diff_file.c
(datasource_get_next_token): Use apr_size_t instead of
apr_off_t for
local variable LENGTH.
(token_compare): Rewrite a code a bit to make it clear that size is
apr_size_t.
* subversion/libsvn_diff/diff_memory.c
(datasource_get_next_token, token_compare): Use apr_size_t
instead of
apr_off_t for local variables to match source type.
]]]
Looking at the patch itself:
I don't see the need to complicate things by introducing a
'read_length' variable in subversion/libsvn_diff/diff_file.c. It's
not really an optimisation and to my eyes it makes the code less
clear, not more.
-- Brane
[1] see attached image.
I would argue and support this change. We use size_t whenever we deal
with memory (including buffers, offsets in for loops, and anything of
this kind). off_t is only applicable for files in the context of a
file system. I'm pretty sure that most systems that work on 32 bit
architectures (if there are still any)
There are many.
allow for file sizes more than the amount of addressable space. So
these types could be different and size_t is IMO more correct.
The major change is in libsvn_diff/diff_file.c which deals with *files*.
Addressable space is not the issue here. Just by the way, there are
architectures where size_t is less than the addressable space and
smaller than ptrdiff_t.
Furthermore, the patch doesn't even touch the structures at the top of
the file that define the tokens that the diff algorithm deals with and,
surprise, use apr_off_t everywhere. They also explicitly contain
apr_file_t* which should be a hint that something is indeed dealing with
files and file offsets. At the very least, on systems where off_t and
size_t are the same size, the patch introduces a silent signed/unsigned
integer promotion, which may not be safe.
/* A token, i.e. a line read from a file. */
typedef struct svn_diff__file_token_t
{
/* Next token in free list. */
struct svn_diff__file_token_t *next;
svn_diff_datasource_e datasource;
/* Offset in the datasource. */
apr_off_t offset;
/* Offset of the normalized token (may skip leading whitespace) */
apr_off_t norm_offset;
/* Total length - before normalization. */
apr_off_t raw_length;
/* Total length - after normalization. */
apr_off_t length;
} svn_diff__file_token_t;
typedef struct svn_diff__file_baton_t
{
const svn_diff_file_options_t *options;
struct file_info {
const char *path; /* path to this file, absolute or relative to CWD */
/* All the following fields are active while this datasource is open */
apr_file_t *file; /* handle of this file */
apr_off_t size; /* total raw size in bytes of this file */
/* The current chunk: CHUNK_SIZE bytes except for the last chunk. */
int chunk; /* the current chunk number, zero-based */
char *buffer; /* a buffer containing the current chunk */
char *curp; /* current position in the current chunk */
char *endp; /* next memory address after the current chunk */
svn_diff__normalize_state_t normalize_state;
/* Where the identical suffix starts in this datasource */
int suffix_start_chunk;
apr_off_t suffix_offset_in_chunk;
} files[4];
/* List of free tokens that may be reused. */
svn_diff__file_token_t *tokens;
apr_pool_t *pool;
} svn_diff__file_baton_t;
Okay this is a situation where a smaller type could be converted to a
larger one which will technically never be an issue but it doesn't
mean it's correct...
Which smaller type converted to a larger type are we talking about? If
anything, the patch goes the other way.
Anyway: the most important questions to answer when discussing a change
are: 1) Is there a bug that needs fixing? 2) Is this change an
improvement? and 3) Does the change introduce a possible regression?
The answers I see are NO, NO and YES, respectively. Hence, -1.
I have trouble understanding this penchant for changing code that has
worked just fine since before there was an Apache in Subversion, and for
no valid reason that I can see. If someone can prove that the current
code has a bug because it uses apr_off_t, that would completely change
the context of this discussion.
-- Brane