On Mon, Jul 09, 2012 at 02:47:25PM +0200, Bert Huijben wrote: > How do you check if the file you are merging is valid utf-8?
See the merge_chunks() function. We convert data to UTF-8 from the native (locale) encoding. This cannot fail (every encoding can be represented in UTF-8) but the result might look funny in case the file uses some other encoding than the native one. But that's OK -- this conversion happens only for display purposes, data in the actual file is never changed, so you can still edit individual chunks in their original form. > I assumed that we currently just passed files to the console mostly > unmodified to allow the terminal to do the hard work. That works fine as long as you don't care about the width of the line you're printing. For the side-by-side display we make an effort to make it look nice. If that doesn't work, the side-by-side display might look strange because lines appear with varying lengths. That is the fallback mode which assumes width=1 and one-byte-per-character for all characters. > I'm pretty sure that we can assume at least many (if not most) text files > stored in Subversion are *not* utf-8 and will fail when tested for utf-8 > validness. > > How does this library handle non-utf8 strings? You mean the svn_utf_cstring_utf8_width() function? It will return an error for invalid UTF-8. In our usage of this API, the UTF-8 validness check in is performed on data that the merge tool has converted to UTF-8. The API must fail for invalid UTF-8 input since it cannot convert such input to UTF-32 in order to run mk_wcwidth() on it. Again, this is in-memory data which we're going to display to the user in a formatted way to so we need to know its width. None of this has anything to do with any versioned data in files.