Troy Curtis Jr wrote on Thu, Oct 19, 2017 at 04:05:57 +0000: > The places where we'd expect this to actually need to happen are less on > the Python API side (since it should be str in and out), but more on the > binding's usage of the underlying Subversion API. All the 'char *' will be > considered 'bytes' by py3, and need to be decoded to str (unicode), before > it is given to the Python API so that the user gets the expected str > objects.
Ah! So you're thinking of data going from libsvn_* to user Python code, not the other way around. Now it makes sense. (I was thinking of user data passed into to the bindings.) In this case, I think the rule is: if it's NUL terminated, then it's in UTF-8 (except for some isolated exceptions such as svn_cmdline_*); if it's a counted-length string, then it's bytes. For example, a property hash is an apr_hash_t* mapping const char* to const svn_string_t*, corresponding to the data model where property names are UTF-8 strings and property values are opaque binary blobs. This is supposed to be explicitly documented, by the way. For example, svn_path.h states . * All incoming and outgoing paths are non-NULL and in UTF-8, unless * otherwise documented. . but, apparently, the newer svn_dirent_uri.h doesn't have such a statement. > So in general I think going with utf8 is the way to go. However, there a > few places I plan on looking carefully at: > 1. Anywhere raw data is "streamed in": I'm not sure if this exists in the > API or not, I am assuming it does somewhere to manually feed data into the > library to be used as content of a commit. Functions that add data, at various layers, are: - svn_client_add5() - svn_delta_editor_t - svn_repos_load_fs6() - svn_fs_make_file() Paths in the repository are always in UTF-8. File contents are treated as opaque binary blobs and are generally presented as streams (either svn_stream_t or an svndiff/txdelta stream). > 2. Filesystem paths: The general case is there are separate functions for > dealing with "the filesystem encoding" whichever it happens to be [1]. So > perhaps one concrete question is are paths provided to the API and > elsewhere within Subversion assumed to be UTF8? Subversion's functions generally take UTF-8, but there are exceptiosn, such as *_canonicalize() and *_internal_style(). They're generally implemented in terms of apr_* functions which expect a different encoding (see e.g. svn_io_check_file()). Cheers, Daniel