Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 3)

Daniel Shahaf Sat, 01 Jan 2011 16:06:10 -0800

[ trimming CC ]

Danny Trebbien wrote on Sun, Dec 26, 2010 at 15:39:05 -0800:
> Attached are version 3 of the patch and corresponding log message.
> Also attached is a TGZ archive of two Subversion repository dumps that
> are used by a new test case.


> [[[
> Add a command line option (--source-encoding) to the svnsync init, sync, and
> copy-revprops subcommands that allows the user to specify the character
> encoding of translatable properties from the source repository. This is needed
> to allow svnsync to sync some older Subversion repositories that have
> properties that were not encoded in UTF-8.
> 
> As discussed at:
>   http://thread.gmane.org/gmane.comp.version-control.subversion.user/100020
>   http://thread.gmane.org/gmane.comp.version-control.subversion.devel/122518
>   http://thread.gmane.org/gmane.comp.version-control.subversion.devel/122550
> 
> Around half of the changes are to rename the "normalized count" variables so
> that it is more clear that the counters only count line ending normalizations
> and not re-encoding normalizations. The other half of the changes are cosmetic
> (adding comments or trimming trailing whitespace in lines) or exist to pass 
> the
> argument to --source-encoding through to the functions that need it (mainly
> normalize_string() in subversion/svnsync/sync.c).
> 

+1.

Refresh my memory please: did we decide in some thread that the
recodings should be counted too?

> * subversion/svnsync/main.c
>   (svnsync__opt) Add svnsync_opt_source_encoding.
>   (svnsync_cmd_table): Add svnsync_opt_source_encoding to the list of
>     acceptable options for the init, sync, and copy-revprops subcommands.
>   (svnsync_options): Add a description of the --source-encoding option.
>   (opt_baton_t, subcommand_baton_t): Add the SOURCE_ENCODING field.
>   (log_properties_normalized): Rename the NORMALIZED_REV_PROPS_COUNT parameter
>     to LE_NORMALIZED_REV_PROPS_COUNT and the NORMALIZED_NODE_PROPS_COUNT
>     parameter to LE_NORMALIZED_NODE_PROPS_COUNT.
>   (copy_revprops): Add the PROP_ENCODING parameter. Rename the 
> NORMALIZED_COUNT
>     parameter to LE_NORMALIZED_COUNT.
>   (make_subcommand_baton): Set the SOURCE_ENCODING field of the resulting
>     subcommand_baton_t object to the value of SOURCE_ENCODING from the
>     opt_baton_t object.
>   (do_initialize, do_synchronize, do_copy_revprops, replay_rev_started,
>    replay_rev_finished): Pass SOURCE_ENCODING to svnsync_* functions and
>     copy_revprops().
>   (replay_baton_t): Rename the NORMALIZED_REV_PROPS_COUNT field to
>     LE_NORMALIZED_REV_PROPS_COUNT and the NORMALIZED_NODE_PROPS_COUNT field to
>     LE_NORMALIZED_NODE_PROPS_COUNT.
>   (main): Handle the case when the command line option is --source-encoding.
>     Set the SOURCE_ENCODING field of the opt_baton_t object to either OPT_ARG
>     or NULL.
> 
> * subversion/svnsync/sync.c
>   (normalize_string): Rename WAS_NORMALIZED to WAS_LE_NORMALIZED to make clear
>     that the counter only counts line ending normalizations. Add the ENCODING
>     parameter. Handle the case when ENCODING is not NULL by calling
>     svn_subst_translate_string2(). Switch to the "two pools" (result & scratch
>     pools) pattern. Use memchr() instead of strchr() because the length of
>     (*STR)->data is known.
>   (svnsync_normalize_revprops): Rename NORMALIZED_COUNT to
>     NORMALIZED_LE_COUNT. Add the ENCODING parameter. Move the call to
>     apr_hash_set() outside of the if block (if (le_normalized)), as the
>     property value may have been changed because of re-encoding even when no
>     line ending was normalized.

The sentence "Move.*" is too much detail.  It's probably fine to just
omit it --- there is no functional change, and the details of the hunk
need no introduction.

> * subversion/tests/cmdline/svnsync_tests_data/copy-bad-line-endings2.dump
>   A dump of a repository with the following features:
>   1. The log message (`svn:log` revision property) of revision 1 has CRLF line
>      endings.
>   2. The log message of revision 2 has CR line endings.
>   3. Revision 3 introduces an `svn:ignore` node property with CRLF line 
> endings.
>   4. Revision 4 introduces a custom node property, `x:related-to`, with CRLF
>      line endings.
> 

I don't understand

* What is the difference between copy_bad_line_endings() and
  copy_bad_line_endings2()?

* How does copy_bad_line_endings2() relate to the subject --- which is
  non-UTF-8 log messages, as opposed to non-LF-only log messages?

  i.e., why aren't you adding a dump where the 'before' has an
  ISO-8859-1 property, and the 'after' has that property recoded to
  UTF-8?

> * 
> subversion/tests/cmdline/svnsync_tests_data/copy-bad-line-endings2.expected.dump
>   A dump of the expected result of using svnsync to sync
>   copy-bad-line-endings2.dump.
> ]]]

> Index: subversion/svnsync/sync.h
> ===================================================================
> --- subversion/svnsync/sync.h (revision 1052903)
> +++ subversion/svnsync/sync.h (working copy)
> @@ -51,19 +56,25 @@ svnsync_normalize_revprops(apr_hash_t *rev_props,
>   * the commit.  TO_URL is the URL of the root of the repository into
>   * which the commit is being made.
>   *
> + * If PROP_ENCODING is NULL, then property values are presumed to be encoded
> + * in UTF-8 and are not re-encoded. Otherwise, the property values are
> + * presumed to be encoded in PROP_ENCODING, and are normalized to UTF-8.
> + *

+1

>   * As the sync editor encounters property values, it might see the need to
> - * normalize them (to LF line endings). Each carried out normalization adds 1
> - * to the *NORMALIZED_NODE_PROPS_COUNTER (for notification).
> + * normalize them (re-encode and/or change to LF line endings). Each 
> carried-out
> + * line ending normalization adds 1 to the *LE_NORMALIZED_NODE_PROPS_COUNTER
> + * (for notification).
>   */
>  svn_error_t *
>  svnsync_get_sync_editor(const svn_delta_editor_t *wrapped_editor,
>                          void *wrapped_edit_baton,
>                          svn_revnum_t base_revision,
>                          const char *to_url,
> +                        const char *prop_encoding,
>                          svn_boolean_t quiet,
>                          const svn_delta_editor_t **editor,
>                          void **edit_baton,
> -                        int *normalized_node_props_counter,
> +                        int *le_normalized_node_props_counter,

I wonder: it might be a good idea to split those renames to their own
patch, so as to not hide the needle (meat of the change) in a haystack
of hunks.

(but unlike the anonymous struct naming thing, this one isn't just
a simple global search/replace, so I can't do it non-manually)

>                          apr_pool_t *pool);
>  
>  
> Index: subversion/svnsync/main.c
> ===================================================================
> --- subversion/svnsync/main.c (revision 1052903)
> +++ subversion/svnsync/main.c (working copy)
> @@ -105,8 +106,9 @@ static const svn_opt_subcommand_desc2_t svnsync_cm
>           "the destination repository by any method other than 'svnsync'.\n"
>           "In other words, the destination repository should be a read-only\n"
>           "mirror of the source repository.\n"),
> -      { SVNSYNC_OPTS_DEFAULT, 'q', svnsync_opt_allow_non_empty,
> -        svnsync_opt_disable_locking, svnsync_opt_steal_lock } },
> +      { SVNSYNC_OPTS_DEFAULT, svnsync_opt_source_encoding, 'q',
> +        svnsync_opt_allow_non_empty, svnsync_opt_disable_locking,
> +        svnsync_opt_steal_lock } },
>      { "synchronize", synchronize_cmd, { "sync" },

Why does 'svnsync init' need to take the new option?

>        N_("usage: svnsync synchronize DEST_URL [SOURCE_URL]\n"
>           "\n"
> @@ -226,7 +234,7 @@ static const apr_getopt_option_t svnsync_options[]
>      { 0, 0, 0, 0 }
>    };
>  
> -typedef struct {
> +typedef struct { /* opt_baton_t */

I fixed that globally in r1053996, thanks for the reminder.

>    svn_boolean_t non_interactive;
>    svn_boolean_t trust_server_cert;
>    svn_boolean_t no_auth_cache;
> @@ -441,7 +453,7 @@ check_if_session_is_at_repos_root(svn_ra_session_t
>   * revision REV of the repository associated with RA session SESSION.
>   *
>   * For REV zero, don't remove properties with the "svn:sync-" prefix.
> - * 
> + *

No whitespace-only hunks in a patch that makes a functional change,
please.  Thank you.

>   * All allocations will be done in a subpool of POOL.
>   */
>  static svn_error_t *
> @@ -1707,6 +1729,7 @@ main(int argc, const char *argv[])
>    const char *username = NULL, *source_username = NULL, *sync_username = 
> NULL;
>    const char *password = NULL, *source_password = NULL, *sync_password = 
> NULL;
>    apr_array_header_t *config_options = NULL;
> +  opt_baton.source_encoding = NULL;
>    apr_allocator_t *allocator;

Oops, you can't have a declaration (of ALLOCATOR) after statement
(non-declaration assignment) in C89.

Also, we memset(&opt_baton, 0), so this NULL may not be necessary.

(all 0-v-NULL discussions aside, I'd leave it in only if we init to NULL
the other pointer members of the struct)

>  
>    if (svn_cmdline_init("svnsync", stderr) != EXIT_SUCCESS)
> @@ -1831,6 +1854,10 @@ main(int argc, const char *argv[])
>                return svn_cmdline_handle_exit_error(err, pool, "svnsync: ");
>              break;
>  
> +          case svnsync_opt_source_encoding:
> +            opt_baton.source_encoding = apr_pstrdup(pool, opt_arg);
> +            break;

Why do you need to pstrdup() it?  Doesn't it have the proper lifetime
already?

Why don't you call svn_utf_cstring_to_utf8(), like is done in almost
every other use of opt_arg?

(Yes, ironic, isn't it.)

> +
>            case svnsync_opt_disable_locking:
>              opt_baton.disable_locking = TRUE;
>              break;
> Index: subversion/svnsync/sync.c
> ===================================================================
> --- subversion/svnsync/sync.c (revision 1052903)
> +++ subversion/svnsync/sync.c (working copy)
> @@ -45,56 +45,72 @@
>  #include <apr_uuid.h>
>  
>  
> -/* Normalize the line ending style of *STR, so that it contains only
> - * LF (\n) line endings. After return, *STR may point at a new
> - * svn_string_t* allocated from POOL.
> +/* Normalize the encoding and line ending style of *STR, so that it contains
> + * only LF (\n) line endings and is encoded in UTF-8. After return, *STR may
> + * point at a new svn_string_t* allocated in RESULT_POOL.
>   *
> + * If ENCODING is NULL, then *STR is presumed to be encoded in UTF-8.
> + *
> - * *WAS_NORMALIZED is set to TRUE when *STR needed to be normalized,
> - * and to FALSE if *STR remains unchanged.
> + * *WAS_LE_NORMALIZED is set to TRUE when *STR needed line ending 
> normalization.
> + *

You lost the "... and set to FALSE" otherwise bit.  It's important --- the
semantics of the contract are different without that statement --- so please
restore it.

> + * SCRATCH_POOL is used for temporary allocations.
>   */
>  static svn_error_t *
>  normalize_string(const svn_string_t **str,
> -                 svn_boolean_t *was_normalized,
> -                 apr_pool_t *pool)
> +                 svn_boolean_t *was_le_normalized,
> +                 const char *encoding,
> +                 apr_pool_t *result_pool,
> +                 apr_pool_t *scratch_pool)
>  {
> -  *was_normalized = FALSE;
> +  *was_le_normalized = FALSE;
>  
>    if (*str == NULL)
>      return SVN_NO_ERROR;
>  
>    SVN_ERR_ASSERT((*str)->data != NULL);
>  
> -  /* Detect inconsistent line ending style simply by looking
> -     for carriage return (\r) characters. */
> -  if (strchr((*str)->data, '\r') != NULL)
> +  if (encoding)
>      {
> +      svn_string_t *new_str = NULL;
> +      SVN_ERR(svn_subst_translate_string2(&new_str, NULL, was_le_normalized,
> +                                          *str, encoding, result_pool,
> +                                          scratch_pool));
> +      *str = new_str;
> +    }
> +
> +  /* Only detect inconsistent line ending style. This is accomplished by 
> simply
> +     looking for carriage return (\r) characters. */
> +  else if (memchr((*str)->data, (unsigned char) '\r', (*str)->len) != NULL)

I see the reason for the s/strchr/memchr/ in the log message: because
the length is known.  I suppose the underlying motivation is to make the
condition slightly faster?

We repeat that strchr() in a couple of other places ---
svn_repos__validate_prop() is one of them ---  so if you make the change
here, you probably need to change there too.

> +    {
>        /* Found some. Normalize. */
>        const char* cstring = NULL;
>        SVN_ERR(svn_subst_translate_cstring2((*str)->data, &cstring,
>                                             "\n", TRUE,
>                                             NULL, FALSE,
> -                                           pool));

Should this call be to svn_subst_translate_string2()?
                                          ^^

> -      *str = svn_string_create(cstring, pool);
> -      *was_normalized = TRUE;
> +                                           scratch_pool));
> +      *str = svn_string_create(cstring, result_pool);
> +      *was_le_normalized = TRUE;
>      }
>  
>    return SVN_NO_ERROR;
>  }
>  
>  
> -/* Normalize the line ending style of the values of properties in REV_PROPS
> - * that "need translation" (according to svn_prop_needs_translation(),
> - * currently all svn:* props) so that they contain only LF (\n) line endings.
> - * The number of properties that needed normalization is returned in
> - * *NORMALIZED_COUNT.
> +/* Normalize the encoding and line ending style of the values of properties
> + * in REV_PROPS that "need translation" (according to
> + * svn_prop_needs_translation(), which is currently all svn:* props) so that
> + * they are encoded in UTF-8 and contain only LF (\n) line endings.
> + * The number of properties that needed line ending normalization is 
> returned in
> + * *NORMALIZED_LE_COUNT. No re-encoding is performed if ENCODING is NULL.
>   */
>  svn_error_t *
>  svnsync_normalize_revprops(apr_hash_t *rev_props,
> -                           int *normalized_count,
> +                           int *normalized_le_count,
> +                           const char *encoding,
>                             apr_pool_t *pool)
>  {
>    apr_hash_index_t *hi;
> -  *normalized_count = 0;
> +  *normalized_le_count = 0;
>  
>    for (hi = apr_hash_first(pool, rev_props);
>         hi;
> @@ -105,15 +121,15 @@ svnsync_normalize_revprops(apr_hash_t *rev_props,
>  
>        if (svn_prop_needs_translation(propname))
>          {
> -          svn_boolean_t was_normalized;
> -          SVN_ERR(normalize_string(&propval, &was_normalized, pool));
> -          if (was_normalized)
> -            {
> -              /* Replace the existing prop value. */
> -              apr_hash_set(rev_props, propname, APR_HASH_KEY_STRING, 
> propval);
> -              /* And count this. */
> -              (*normalized_count)++;
> -            }
> +          svn_boolean_t was_le_normalized;
> +          SVN_ERR(normalize_string(&propval, &was_le_normalized, encoding, 
> pool,
> +                  pool));
> +
> +          /* Replace the existing prop value. */
> +          apr_hash_set(rev_props, propname, APR_HASH_KEY_STRING, propval);
> +
> +          if (was_le_normalized)
> +            (*normalized_le_count)++; /* Count it. */
>          }
>      }
>    return SVN_NO_ERROR;
> Index: subversion/tests/cmdline/svnsync_tests.py
> ===================================================================
> --- subversion/tests/cmdline/svnsync_tests.py (revision 1052903)
> +++ subversion/tests/cmdline/svnsync_tests.py (working copy)
> @@ -780,11 +780,17 @@ def info_not_synchronized(sbox):
>  #----------------------------------------------------------------------
>  
>  def copy_bad_line_endings(sbox):
> -  "copy with inconsistent lineendings in svn:props"
> +  "copy with inconsistent line endings in svn:* props"

Thanks, r1054320.

>    run_test(sbox, "copy-bad-line-endings.dump",
>             exp_dump_file_name="copy-bad-line-endings.expected.dump",
>             bypass_prop_validation=True)
>  
> +def copy_bad_line_endings2(sbox):
> +  "copy with non-LF line endings in svn:* props"
> +  run_test(sbox, "copy-bad-line-endings2.dump",
> +           exp_dump_file_name="copy-bad-line-endings2.expected.dump",
> +           bypass_prop_validation=True)
> +

Most svnsync tests are used by svnrdump too.  Could you check if
a similar test should be added to svnrdump_tests.py?

If yes, could you please add an XFail test in svnrdump_tests.py?
(should be three lines of code)

>  #----------------------------------------------------------------------
>  
>  def delete_svn_props(sbox):

*phew*.  This took a good while...

Re: [PATCH] allow svnsync to translate non-UTF-8 log messages to UTF-8 (v. 3)

Reply via email to