[ trimming CC ] Danny Trebbien wrote on Sun, Dec 26, 2010 at 15:39:05 -0800: > Attached are version 3 of the patch and corresponding log message. > Also attached is a TGZ archive of two Subversion repository dumps that > are used by a new test case.
> [[[ > Add a command line option (--source-encoding) to the svnsync init, sync, and > copy-revprops subcommands that allows the user to specify the character > encoding of translatable properties from the source repository. This is needed > to allow svnsync to sync some older Subversion repositories that have > properties that were not encoded in UTF-8. > > As discussed at: > http://thread.gmane.org/gmane.comp.version-control.subversion.user/100020 > http://thread.gmane.org/gmane.comp.version-control.subversion.devel/122518 > http://thread.gmane.org/gmane.comp.version-control.subversion.devel/122550 > > Around half of the changes are to rename the "normalized count" variables so > that it is more clear that the counters only count line ending normalizations > and not re-encoding normalizations. The other half of the changes are cosmetic > (adding comments or trimming trailing whitespace in lines) or exist to pass > the > argument to --source-encoding through to the functions that need it (mainly > normalize_string() in subversion/svnsync/sync.c). > +1. Refresh my memory please: did we decide in some thread that the recodings should be counted too? > * subversion/svnsync/main.c > (svnsync__opt) Add svnsync_opt_source_encoding. > (svnsync_cmd_table): Add svnsync_opt_source_encoding to the list of > acceptable options for the init, sync, and copy-revprops subcommands. > (svnsync_options): Add a description of the --source-encoding option. > (opt_baton_t, subcommand_baton_t): Add the SOURCE_ENCODING field. > (log_properties_normalized): Rename the NORMALIZED_REV_PROPS_COUNT parameter > to LE_NORMALIZED_REV_PROPS_COUNT and the NORMALIZED_NODE_PROPS_COUNT > parameter to LE_NORMALIZED_NODE_PROPS_COUNT. > (copy_revprops): Add the PROP_ENCODING parameter. Rename the > NORMALIZED_COUNT > parameter to LE_NORMALIZED_COUNT. > (make_subcommand_baton): Set the SOURCE_ENCODING field of the resulting > subcommand_baton_t object to the value of SOURCE_ENCODING from the > opt_baton_t object. > (do_initialize, do_synchronize, do_copy_revprops, replay_rev_started, > replay_rev_finished): Pass SOURCE_ENCODING to svnsync_* functions and > copy_revprops(). > (replay_baton_t): Rename the NORMALIZED_REV_PROPS_COUNT field to > LE_NORMALIZED_REV_PROPS_COUNT and the NORMALIZED_NODE_PROPS_COUNT field to > LE_NORMALIZED_NODE_PROPS_COUNT. > (main): Handle the case when the command line option is --source-encoding. > Set the SOURCE_ENCODING field of the opt_baton_t object to either OPT_ARG > or NULL. > > * subversion/svnsync/sync.c > (normalize_string): Rename WAS_NORMALIZED to WAS_LE_NORMALIZED to make clear > that the counter only counts line ending normalizations. Add the ENCODING > parameter. Handle the case when ENCODING is not NULL by calling > svn_subst_translate_string2(). Switch to the "two pools" (result & scratch > pools) pattern. Use memchr() instead of strchr() because the length of > (*STR)->data is known. > (svnsync_normalize_revprops): Rename NORMALIZED_COUNT to > NORMALIZED_LE_COUNT. Add the ENCODING parameter. Move the call to > apr_hash_set() outside of the if block (if (le_normalized)), as the > property value may have been changed because of re-encoding even when no > line ending was normalized. The sentence "Move.*" is too much detail. It's probably fine to just omit it --- there is no functional change, and the details of the hunk need no introduction. > * subversion/tests/cmdline/svnsync_tests_data/copy-bad-line-endings2.dump > A dump of a repository with the following features: > 1. The log message (`svn:log` revision property) of revision 1 has CRLF line > endings. > 2. The log message of revision 2 has CR line endings. > 3. Revision 3 introduces an `svn:ignore` node property with CRLF line > endings. > 4. Revision 4 introduces a custom node property, `x:related-to`, with CRLF > line endings. > I don't understand * What is the difference between copy_bad_line_endings() and copy_bad_line_endings2()? * How does copy_bad_line_endings2() relate to the subject --- which is non-UTF-8 log messages, as opposed to non-LF-only log messages? i.e., why aren't you adding a dump where the 'before' has an ISO-8859-1 property, and the 'after' has that property recoded to UTF-8? > * > subversion/tests/cmdline/svnsync_tests_data/copy-bad-line-endings2.expected.dump > A dump of the expected result of using svnsync to sync > copy-bad-line-endings2.dump. > ]]] > Index: subversion/svnsync/sync.h > =================================================================== > --- subversion/svnsync/sync.h (revision 1052903) > +++ subversion/svnsync/sync.h (working copy) > @@ -51,19 +56,25 @@ svnsync_normalize_revprops(apr_hash_t *rev_props, > * the commit. TO_URL is the URL of the root of the repository into > * which the commit is being made. > * > + * If PROP_ENCODING is NULL, then property values are presumed to be encoded > + * in UTF-8 and are not re-encoded. Otherwise, the property values are > + * presumed to be encoded in PROP_ENCODING, and are normalized to UTF-8. > + * +1 > * As the sync editor encounters property values, it might see the need to > - * normalize them (to LF line endings). Each carried out normalization adds 1 > - * to the *NORMALIZED_NODE_PROPS_COUNTER (for notification). > + * normalize them (re-encode and/or change to LF line endings). Each > carried-out > + * line ending normalization adds 1 to the *LE_NORMALIZED_NODE_PROPS_COUNTER > + * (for notification). > */ > svn_error_t * > svnsync_get_sync_editor(const svn_delta_editor_t *wrapped_editor, > void *wrapped_edit_baton, > svn_revnum_t base_revision, > const char *to_url, > + const char *prop_encoding, > svn_boolean_t quiet, > const svn_delta_editor_t **editor, > void **edit_baton, > - int *normalized_node_props_counter, > + int *le_normalized_node_props_counter, I wonder: it might be a good idea to split those renames to their own patch, so as to not hide the needle (meat of the change) in a haystack of hunks. (but unlike the anonymous struct naming thing, this one isn't just a simple global search/replace, so I can't do it non-manually) > apr_pool_t *pool); > > > Index: subversion/svnsync/main.c > =================================================================== > --- subversion/svnsync/main.c (revision 1052903) > +++ subversion/svnsync/main.c (working copy) > @@ -105,8 +106,9 @@ static const svn_opt_subcommand_desc2_t svnsync_cm > "the destination repository by any method other than 'svnsync'.\n" > "In other words, the destination repository should be a read-only\n" > "mirror of the source repository.\n"), > - { SVNSYNC_OPTS_DEFAULT, 'q', svnsync_opt_allow_non_empty, > - svnsync_opt_disable_locking, svnsync_opt_steal_lock } }, > + { SVNSYNC_OPTS_DEFAULT, svnsync_opt_source_encoding, 'q', > + svnsync_opt_allow_non_empty, svnsync_opt_disable_locking, > + svnsync_opt_steal_lock } }, > { "synchronize", synchronize_cmd, { "sync" }, Why does 'svnsync init' need to take the new option? > N_("usage: svnsync synchronize DEST_URL [SOURCE_URL]\n" > "\n" > @@ -226,7 +234,7 @@ static const apr_getopt_option_t svnsync_options[] > { 0, 0, 0, 0 } > }; > > -typedef struct { > +typedef struct { /* opt_baton_t */ I fixed that globally in r1053996, thanks for the reminder. > svn_boolean_t non_interactive; > svn_boolean_t trust_server_cert; > svn_boolean_t no_auth_cache; > @@ -441,7 +453,7 @@ check_if_session_is_at_repos_root(svn_ra_session_t > * revision REV of the repository associated with RA session SESSION. > * > * For REV zero, don't remove properties with the "svn:sync-" prefix. > - * > + * No whitespace-only hunks in a patch that makes a functional change, please. Thank you. > * All allocations will be done in a subpool of POOL. > */ > static svn_error_t * > @@ -1707,6 +1729,7 @@ main(int argc, const char *argv[]) > const char *username = NULL, *source_username = NULL, *sync_username = > NULL; > const char *password = NULL, *source_password = NULL, *sync_password = > NULL; > apr_array_header_t *config_options = NULL; > + opt_baton.source_encoding = NULL; > apr_allocator_t *allocator; Oops, you can't have a declaration (of ALLOCATOR) after statement (non-declaration assignment) in C89. Also, we memset(&opt_baton, 0), so this NULL may not be necessary. (all 0-v-NULL discussions aside, I'd leave it in only if we init to NULL the other pointer members of the struct) > > if (svn_cmdline_init("svnsync", stderr) != EXIT_SUCCESS) > @@ -1831,6 +1854,10 @@ main(int argc, const char *argv[]) > return svn_cmdline_handle_exit_error(err, pool, "svnsync: "); > break; > > + case svnsync_opt_source_encoding: > + opt_baton.source_encoding = apr_pstrdup(pool, opt_arg); > + break; Why do you need to pstrdup() it? Doesn't it have the proper lifetime already? Why don't you call svn_utf_cstring_to_utf8(), like is done in almost every other use of opt_arg? (Yes, ironic, isn't it.) > + > case svnsync_opt_disable_locking: > opt_baton.disable_locking = TRUE; > break; > Index: subversion/svnsync/sync.c > =================================================================== > --- subversion/svnsync/sync.c (revision 1052903) > +++ subversion/svnsync/sync.c (working copy) > @@ -45,56 +45,72 @@ > #include <apr_uuid.h> > > > -/* Normalize the line ending style of *STR, so that it contains only > - * LF (\n) line endings. After return, *STR may point at a new > - * svn_string_t* allocated from POOL. > +/* Normalize the encoding and line ending style of *STR, so that it contains > + * only LF (\n) line endings and is encoded in UTF-8. After return, *STR may > + * point at a new svn_string_t* allocated in RESULT_POOL. > * > + * If ENCODING is NULL, then *STR is presumed to be encoded in UTF-8. > + * > - * *WAS_NORMALIZED is set to TRUE when *STR needed to be normalized, > - * and to FALSE if *STR remains unchanged. > + * *WAS_LE_NORMALIZED is set to TRUE when *STR needed line ending > normalization. > + * You lost the "... and set to FALSE" otherwise bit. It's important --- the semantics of the contract are different without that statement --- so please restore it. > + * SCRATCH_POOL is used for temporary allocations. > */ > static svn_error_t * > normalize_string(const svn_string_t **str, > - svn_boolean_t *was_normalized, > - apr_pool_t *pool) > + svn_boolean_t *was_le_normalized, > + const char *encoding, > + apr_pool_t *result_pool, > + apr_pool_t *scratch_pool) > { > - *was_normalized = FALSE; > + *was_le_normalized = FALSE; > > if (*str == NULL) > return SVN_NO_ERROR; > > SVN_ERR_ASSERT((*str)->data != NULL); > > - /* Detect inconsistent line ending style simply by looking > - for carriage return (\r) characters. */ > - if (strchr((*str)->data, '\r') != NULL) > + if (encoding) > { > + svn_string_t *new_str = NULL; > + SVN_ERR(svn_subst_translate_string2(&new_str, NULL, was_le_normalized, > + *str, encoding, result_pool, > + scratch_pool)); > + *str = new_str; > + } > + > + /* Only detect inconsistent line ending style. This is accomplished by > simply > + looking for carriage return (\r) characters. */ > + else if (memchr((*str)->data, (unsigned char) '\r', (*str)->len) != NULL) I see the reason for the s/strchr/memchr/ in the log message: because the length is known. I suppose the underlying motivation is to make the condition slightly faster? We repeat that strchr() in a couple of other places --- svn_repos__validate_prop() is one of them --- so if you make the change here, you probably need to change there too. > + { > /* Found some. Normalize. */ > const char* cstring = NULL; > SVN_ERR(svn_subst_translate_cstring2((*str)->data, &cstring, > "\n", TRUE, > NULL, FALSE, > - pool)); Should this call be to svn_subst_translate_string2()? ^^ > - *str = svn_string_create(cstring, pool); > - *was_normalized = TRUE; > + scratch_pool)); > + *str = svn_string_create(cstring, result_pool); > + *was_le_normalized = TRUE; > } > > return SVN_NO_ERROR; > } > > > -/* Normalize the line ending style of the values of properties in REV_PROPS > - * that "need translation" (according to svn_prop_needs_translation(), > - * currently all svn:* props) so that they contain only LF (\n) line endings. > - * The number of properties that needed normalization is returned in > - * *NORMALIZED_COUNT. > +/* Normalize the encoding and line ending style of the values of properties > + * in REV_PROPS that "need translation" (according to > + * svn_prop_needs_translation(), which is currently all svn:* props) so that > + * they are encoded in UTF-8 and contain only LF (\n) line endings. > + * The number of properties that needed line ending normalization is > returned in > + * *NORMALIZED_LE_COUNT. No re-encoding is performed if ENCODING is NULL. > */ > svn_error_t * > svnsync_normalize_revprops(apr_hash_t *rev_props, > - int *normalized_count, > + int *normalized_le_count, > + const char *encoding, > apr_pool_t *pool) > { > apr_hash_index_t *hi; > - *normalized_count = 0; > + *normalized_le_count = 0; > > for (hi = apr_hash_first(pool, rev_props); > hi; > @@ -105,15 +121,15 @@ svnsync_normalize_revprops(apr_hash_t *rev_props, > > if (svn_prop_needs_translation(propname)) > { > - svn_boolean_t was_normalized; > - SVN_ERR(normalize_string(&propval, &was_normalized, pool)); > - if (was_normalized) > - { > - /* Replace the existing prop value. */ > - apr_hash_set(rev_props, propname, APR_HASH_KEY_STRING, > propval); > - /* And count this. */ > - (*normalized_count)++; > - } > + svn_boolean_t was_le_normalized; > + SVN_ERR(normalize_string(&propval, &was_le_normalized, encoding, > pool, > + pool)); > + > + /* Replace the existing prop value. */ > + apr_hash_set(rev_props, propname, APR_HASH_KEY_STRING, propval); > + > + if (was_le_normalized) > + (*normalized_le_count)++; /* Count it. */ > } > } > return SVN_NO_ERROR; > Index: subversion/tests/cmdline/svnsync_tests.py > =================================================================== > --- subversion/tests/cmdline/svnsync_tests.py (revision 1052903) > +++ subversion/tests/cmdline/svnsync_tests.py (working copy) > @@ -780,11 +780,17 @@ def info_not_synchronized(sbox): > #---------------------------------------------------------------------- > > def copy_bad_line_endings(sbox): > - "copy with inconsistent lineendings in svn:props" > + "copy with inconsistent line endings in svn:* props" Thanks, r1054320. > run_test(sbox, "copy-bad-line-endings.dump", > exp_dump_file_name="copy-bad-line-endings.expected.dump", > bypass_prop_validation=True) > > +def copy_bad_line_endings2(sbox): > + "copy with non-LF line endings in svn:* props" > + run_test(sbox, "copy-bad-line-endings2.dump", > + exp_dump_file_name="copy-bad-line-endings2.expected.dump", > + bypass_prop_validation=True) > + Most svnsync tests are used by svnrdump too. Could you check if a similar test should be added to svnrdump_tests.py? If yes, could you please add an XFail test in svnrdump_tests.py? (should be three lines of code) > #---------------------------------------------------------------------- > > def delete_svn_props(sbox): *phew*. This took a good while...