On Mon, Jul 09, 2012 at 04:04:42PM +0200, Johan Corveleyn wrote: > On Mon, Jul 9, 2012 at 3:30 PM, Stefan Sperling <s...@apache.org> wrote: > > On Mon, Jul 09, 2012 at 02:47:25PM +0200, Bert Huijben wrote: > >> How do you check if the file you are merging is valid utf-8? > > > > See the merge_chunks() function. > > > > We convert data to UTF-8 from the native (locale) encoding. > > This cannot fail (every encoding can be represented in UTF-8) > > but the result might look funny in case the file uses some other encoding > > than the native one. But that's OK -- this conversion happens only for > > display purposes, data in the actual file is never changed, so you can > > still edit individual chunks in their original form. > > I'm a bit confused (encoding issues always confuse me). If we only > care about the width of the string for display purposes, doesn't this > (also) depend on the encoding used by the console / terminal? How does > that actually work: if you have a UTF-8 encoded file, and you 'cat' it > to a terminal with LC_ALL=iso_8859_1 ... ?
Our cmdline output routines accept UTF-8 and try convert back to the locale's native encoding before printing. If this conversion fails, it falls back to svn_cmdline_cstring_from_utf8_fuzzy() which will create some ASCII-representation of the data. So what will happen in that case is that you'll see whatever unicode character latin1 can represent as-is, while others are converted in a fuzzy way. This might lead to mis-aligned side-by-side diff output. However if you're trying to display unicode data on a terminal that isn't unicode capable then such issues are the norm rather then the exception. In general, if your terminal can display your files, then the side-by-side diff will also be shown properly. Else, the side-by-side diff might look OK, or it might not, depending on how much longer the "fuzzy" representation of the string really is. Configure your locale properly and you want have an issue.