On 2020/12/15 5:06, Stefan Sperling wrote: > On Tue, Dec 15, 2020 at 03:51:03AM +0900, Yasuhito FUTATSUKI wrote: >> First of all, thank you for doing this. I worried that mailer.py didn't >> support Python 3 for a long time. >> >> In message <20201214165710.deb3517b...@svn01-us-east.apache.org>, >> stsp@apache.o >> rg writes: >>> Author: stsp >>> Date: Mon Dec 14 16:57:10 2020 >>> New Revision: 1884427 >>> >>> URL: http://svn.apache.org/viewvc?rev=1884427&view=rev >>> Log: >>> Make mailer.py work properly with Python 3, and drop Python 2 support. >>> >>> Most of the changes deal with the handling binary data vs Python strings. >>> >>> I've made sure that mailer.py will work in a UTF-8 environment. In general, >>> UTF-8 is recommended for hook scripts. See the SVNUseUTF8 mod_dav_svn >>> option. >>> Environments using other encodings may not work as expected, but those will >>> be problematic for hook scripts in general. >> >> Perhaps the problem on locale other than UTF-8 is only on iterpretation of >> REPOS-PATH argument. Other paths in the script are all Subversion's >> internal paths. > > Yes, I agree that the REPOS-PATH argument is an exceptional case. > Unforunately, we have to decode repos_dir because it is used with the > os.path.join() function.
As far as I read the code again, it seems this is already support locales other than C/C.UTF-8 if mailer.py is kicked with appropriate LC_CTYPE environment. The REPOS-PATH argment comes from sys.argv[2] already decoded in file system encoding (i.e. encoding corresponding to LC_CTYPE) and Python encodes it if access to file system. On the other hand, svn.repos.open() wrapper function accepts str and encode it to UTF-8 string, as the svn_repos_open() C API expects. Then I overlooked sys.argv[x] and cmd_args[x] in main() are already str. Perhaps it needs (although the last hunk is not nessesary): [[[ Index: tools/hook-scripts/mailer/mailer.py =================================================================== --- tools/hook-scripts/mailer/mailer.py (revision 1884434) +++ tools/hook-scripts/mailer/mailer.py (working copy) @@ -88,10 +88,10 @@ messenger = Commit(pool, cfg, repos) elif cmd == 'propchange' or cmd == 'propchange2': revision = int(cmd_args[0]) - author = cmd_args[1].decode('utf-8') - propname = cmd_args[2].decode('utf-8') + author = cmd_args[1] + propname = cmd_args[2] if cmd == 'propchange2' and cmd_args[3]: - action = cmd_args[3].decode('utf-8') + action = cmd_args[3] else: action = 'A' repos = Repository(repos_dir, revision, pool) @@ -103,7 +103,7 @@ }) messenger = PropChange(pool, cfg, repos, author, propname, action) elif cmd == 'lock' or cmd == 'unlock': - author = cmd_args[0].decode('utf-8') + author = cmd_args[0] repos = Repository(repos_dir, 0, pool) ### any old revision will do # Override the repos revision author with the author of the lock/unlock repos.author = author @@ -1510,7 +1510,7 @@ usage() cmd = sys.argv[1] - repos_dir = svn.core.svn_path_canonicalize(sys.argv[2]) + repos_dir = svn.core.svn_path_canonicalize(sys.argv[2].encode('utf-8')) repos_dir = repos_dir.decode('utf-8') try: expected_args = cmd_list[cmd] ]]] (Also, I found svn.repos.open() and svn.core.svn_path_canonicalize() already accepted str in commited code :)) > There is no way of knowing which character encoding is used for paths. > At least on UNIX a filename is just a string of bytes and it is independent > of the locale. > > Note that hook scripts will run in an empty environment by default. > Which means this script will run in the "C" locale by default. > If another locale is desired, the administrator needs to configure the > server as documented here: > https://subversion.apache.org/docs/release-notes/1.8.html#hooks-env > > My assumption is that for hook scripts like this one, the server will be > configured with SVNUseUTF8 and the hook script environment will contain > something like LANG=C.UTF-8. > > If svnadmin is then also used in the UTF-8 locale to create repositories, > the hook script should work fine. I assumed that hook scripts kicked by file:// or svn+ssh:// scheme. I think even if the repos path is neither UTF-8 nor ascii but appropriate encoding string, hook scripts themselves know it and can invoke mailer.py script with appropriate LC_CTYPE by using env(1) utility or directly setting environment variable within hook scripts themselves. Of course, we can easily move or make symlink of repository path from non UTF-8 path to UTF-8 or ascii path, so I don't insist that we should support non UTF-8 path. >>> data such as paths in UTF-8. Our Python3 bindings do not deal with encoding >>> or decoding of such data, and thus need to work with raw UTF-8 strings, not >>> Python strings. >> >> I have no objection the code itself, however it is a bit incorrect. >> >> Our Python3 bindings accept str objects as inputs for char *. >> ^ as well as bytes objects If str objects are passed, v >> It always >> convert them to UTF-8 C strings on all APIs without regarding preferred >> encoding or file system encoding on Python. As some of APIs accept >> local paths and they should be encoded as locale's charset/encoding, >> I also prefer to encode to bytes on caller side explicitly before call >> APIs, to avoid making bugs. Of course, return values of wrapper APIs >> corresponding char ** args and return value of C API are not decoded, >> raw bytes objects. > > Thank you, I did not know this. The patch below undoes changes which I > made in order to encode input to SVN APIs as UTF-8. > > The mailer.py test is still passing fine with this change. Yes, however as I wrote above I prefer not to revert them. I also found some bytes/str mixture still left around return values for each call of Repository.get_rev_prop(), however I just found, but not tested. So I continue to look over the code and I'll try to fix the problems if they still exist. Cheers, -- Yasuhito FUTATSUKI <futat...@yf.bsclub.org>