On Mon, Jul 09, 2012 at 02:47:25PM +0200, Bert Huijben wrote:
> How do you check if the file you are merging is valid utf-8?

See the merge_chunks() function.

We convert data to UTF-8 from the native (locale) encoding.
This cannot fail (every encoding can be represented in UTF-8)
but the result might look funny in case the file uses some other encoding
than the native one. But that's OK -- this conversion happens only for
display purposes, data in the actual file is never changed, so you can
still edit individual chunks in their original form.

> I assumed that we currently just passed files to the console mostly 
> unmodified to allow the terminal to do the hard work.

That works fine as long as you don't care about the width of the
line you're printing. 

For the side-by-side display we make an effort to make it look nice.
If that doesn't work, the side-by-side display might look strange
because lines appear with varying lengths. That is the fallback mode
which assumes width=1 and one-byte-per-character for all characters.
 
> I'm pretty sure that we can assume at least many (if not most) text files 
> stored in Subversion are *not* utf-8 and will fail when tested for utf-8 
> validness.
>
> How does this library handle non-utf8 strings?

You mean the svn_utf_cstring_utf8_width() function? It will return
an error for invalid UTF-8.

In our usage of this API, the UTF-8 validness check in is performed
on data that the merge tool has converted to UTF-8. The API must fail for
invalid UTF-8 input since it cannot convert such input to UTF-32 in
order to run mk_wcwidth() on it.

Again, this is in-memory data which we're going to display to the user
in a formatted way to so we need to know its width.
None of this has anything to do with any versioned data in files.

Reply via email to