Re: [PATCH v1 3/5] list-objects-filter: implement composite filters

Emily Shaffer Tue, 28 May 2019 14:54:12 -0700

On Tue, May 21, 2019 at 05:21:52PM -0700, Matthew DeVore wrote:
> Allow combining filters such that only objects accepted by all filters
> are shown. The motivation for this is to allow getting directory
> listings without also fetching blobs. This can be done by combining
> blob:none with tree:<depth>. There are massive repositories that have
> larger-than-expected trees - even if you include only a single commit.
> 
> The current usage requires passing the filter to rev-list, or sending
> it over the wire, as:
> 
>       combine:<FILTER1>+<FILTER2>
> 
> (i.e.: git rev-list --filter=combine:tree:2+blob:limit=32k). This is
> potentially awkward because individual filters must be URL-encoded if
> they contain + or %. This can potentially be improved by supporting a
> repeated flag syntax, e.g.:
> 
>       $ git rev-list --filter=tree:2 --filter=blob:limit=32k
> 
> Such usage is currently an error, so giving it a meaning is backwards-
> compatible.
> 
> Signed-off-by: Matthew DeVore <matv...@google.com>
> ---
>  Documentation/rev-list-options.txt     |  12 ++
>  contrib/completion/git-completion.bash |   2 +-
>  list-objects-filter-options.c          | 161 ++++++++++++++++++++++++-
>  list-objects-filter-options.h          |  14 ++-
>  list-objects-filter.c                  | 114 +++++++++++++++++
>  t/t6112-rev-list-filters-objects.sh    | 159 +++++++++++++++++++++++-
>  6 files changed, 455 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/rev-list-options.txt 
> b/Documentation/rev-list-options.txt
> index ddbc1de43f..4fb0c4fbb0 100644
> --- a/Documentation/rev-list-options.txt
> +++ b/Documentation/rev-list-options.txt
> @@ -730,20 +730,32 @@ specification contained in <path>.
>  +
>  The form '--filter=tree:<depth>' omits all blobs and trees whose depth
>  from the root tree is >= <depth> (minimum depth if an object is located
>  at multiple depths in the commits traversed). <depth>=0 will not include
>  any trees or blobs unless included explicitly in the command-line (or
>  standard input when --stdin is used). <depth>=1 will include only the
>  tree and blobs which are referenced directly by a commit reachable from
>  <commit> or an explicitly-given object. <depth>=2 is like <depth>=1
>  while also including trees and blobs one more level removed from an
>  explicitly-given commit or tree.
> ++
> +The form '--filter=combine:<filter1>+<filter2>+...<filterN>' combines
> +several filters. Only objects which are accepted by every filter are
> +included. Filters are joined by '{plus}' and individual filters are %-encoded
> +(i.e. URL-encoded). Besides the '{plus}' and '%' characters, the following
> +characters are reserved and also must be encoded:
> +`~!@#$^&*()[]{}\;",<>?`+&#39;&#96;+ as well as all characters with ASCII code
> +&lt;= `0x20`, which includes space and newline.
> ++
> +Other arbitrary characters can also be encoded. For instance,
> +'combine:tree:3+blob:none' and 'combine:tree%3A2+blob%3Anone' are
> +equivalent.
>  
>  --no-filter::
>       Turn off any previous `--filter=` argument.
>  
>  --filter-print-omitted::
>       Only useful with `--filter=`; prints a list of the objects omitted
>       by the filter.  Object IDs are prefixed with a ``~'' character.
>  
>  --missing=<missing-action>::
>       A debug option to help with future "partial clone" development.
> diff --git a/contrib/completion/git-completion.bash 
> b/contrib/completion/git-completion.bash
> index 3eefbabdb1..0fd0a10d0c 100644
> --- a/contrib/completion/git-completion.bash
> +++ b/contrib/completion/git-completion.bash
> @@ -1529,21 +1529,21 @@ _git_difftool ()
>  __git_fetch_recurse_submodules="yes on-demand no"
>  
>  _git_fetch ()
>  {
>       case "$cur" in
>       --recurse-submodules=*)
>               __gitcomp "$__git_fetch_recurse_submodules" "" 
> "${cur##--recurse-submodules=}"
>               return
>               ;;
>       --filter=*)
> -             __gitcomp "blob:none blob:limit= sparse:oid= sparse:path=" "" 
> "${cur##--filter=}"
> +             __gitcomp "blob:none blob:limit= sparse:oid= sparse:path= 
> combine: tree:" "" "${cur##--filter=}"
>               return
>               ;;
>       --*)
>               __gitcomp_builtin fetch
>               return
>               ;;
>       esac
>       __git_complete_remote_or_refspec
>  }
>  
> diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
> index e46ea467bc..d7a1516188 100644
> --- a/list-objects-filter-options.c
> +++ b/list-objects-filter-options.c
> @@ -1,19 +1,24 @@
>  #include "cache.h"
>  #include "commit.h"
>  #include "config.h"
>  #include "revision.h"
>  #include "argv-array.h"
>  #include "list-objects.h"
>  #include "list-objects-filter.h"
>  #include "list-objects-filter-options.h"
>  
> +static int parse_combine_filter(
> +     struct list_objects_filter_options *filter_options,
> +     const char *arg,
> +     struct strbuf *errbuf);
> +
>  /*
>   * Parse value of the argument to the "filter" keyword.
>   * On the command line this looks like:
>   *       --filter=<arg>
>   * and in the pack protocol as:
>   *       "filter" SP <arg>
>   *
>   * The filter keyword will be used by many commands.
>   * See Documentation/rev-list-options.txt for allowed values for <arg>.
>   *
> @@ -31,22 +36,20 @@ static int gently_parse_list_objects_filter(
>  
>       if (filter_options->choice) {
>               if (errbuf) {
>                       strbuf_addstr(
>                               errbuf,
>                               _("multiple filter-specs cannot be combined"));
>               }
>               return 1;
>       }
>  
> -     filter_options->filter_spec = strdup(arg);
> -
>       if (!strcmp(arg, "blob:none")) {
>               filter_options->choice = LOFC_BLOB_NONE;
>               return 0;
>  
>       } else if (skip_prefix(arg, "blob:limit=", &v0)) {
>               if (git_parse_ulong(v0, &filter_options->blob_limit_value)) {
>                       filter_options->choice = LOFC_BLOB_LIMIT;
>                       return 0;
>               }
>  
> @@ -74,37 +77,183 @@ static int gently_parse_list_objects_filter(
>               if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB,
>                                         &sparse_oid, &oc))
>                       filter_options->sparse_oid_value = oiddup(&sparse_oid);
>               filter_options->choice = LOFC_SPARSE_OID;
>               return 0;
>  
>       } else if (skip_prefix(arg, "sparse:path=", &v0)) {
>               filter_options->choice = LOFC_SPARSE_PATH;
>               filter_options->sparse_path_value = strdup(v0);
>               return 0;
> +
> +     } else if (skip_prefix(arg, "combine:", &v0)) {
> +             int sub_parse_res = parse_combine_filter(
> +                     filter_options, v0, errbuf);
> +             if (sub_parse_res)
> +                     return sub_parse_res;
> +             return 0;


Couldn't the three lines above be said more succinctly as "return
sub_parse_res;"?

> +
>       }
>       /*
>        * Please update _git_fetch() in git-completion.bash when you
>        * add new filters
>        */
>  
>       if (errbuf)
>               strbuf_addf(errbuf, _("invalid filter-spec '%s'"), arg);
>  
>       memset(filter_options, 0, sizeof(*filter_options));
>       return 1;
>  }
>  
> +static int digit_value(int c, struct strbuf *errbuf) {
> +     if (c >= '0' && c <= '9')
> +             return c - '0';
> +     if (c >= 'a' && c <= 'f')
> +             return c - 'a' + 10;
> +     if (c >= 'A' && c <= 'F')
> +             return c - 'A' + 10;

I'm sure there's something I'm missing here. But why are you manually
decoding hex instead of using strtol or sscanf or something?

> +
> +     if (!errbuf)
> +             return -1;
> +
> +     strbuf_addf(errbuf, _("error in filter-spec - "));
> +     if (c)
> +             strbuf_addf(
> +                     errbuf,
> +                     _("expect two hex digits after %%, but got: '%c'"),
> +                     c);
> +     else
> +             strbuf_addf(
> +                     errbuf,
> +                     _("not enough hex digits after %%; expected two"));
> +
> +     return -1;
> +}
> +
> +static int url_decode(struct strbuf *s, struct strbuf *errbuf) {
> +     char *dest = s->buf;
> +     char *src = s->buf;
> +     size_t new_len;
> +
> +     while (*src) {
> +             int digit_value_0, digit_value_1;
> +
> +             if (src[0] != '%') {
> +                     *dest++ = *src++;
> +                     continue;
> +             }
> +             src++;
> +
> +             digit_value_0 = digit_value(*src++, errbuf);
> +             if (digit_value_0 < 0)
> +                     return 1;
> +             digit_value_1 = digit_value(*src++, errbuf);
> +             if (digit_value_1 < 0)
> +                     return 1;
> +             *dest++ = digit_value_0 * 16 + digit_value_1;
> +     }
> +     new_len = dest - s->buf;
> +     strbuf_remove(s, new_len, s->len - new_len);
> +
> +     return 0;
> +}
> +
> +static const char *RESERVED_NON_WS = "~`!@#$^&*()[]{}\\;'\",<>?";
> +
> +static int has_reserved_character(
> +     struct strbuf *sub_spec, struct strbuf *errbuf)
> +{
> +     const char *c = sub_spec->buf;
> +     while (*c) {
> +             if (*c <= ' ' || strchr(RESERVED_NON_WS, *c))
> +                     goto found_reserved;
> +             c++;
> +     }
> +
> +     return 0;
> +
> +found_reserved:

What's the value of doing this in a goto instead of embedded in the
while loop?

> +     if (errbuf)
> +             strbuf_addf(errbuf,
> +                         "must escape char in sub-filter-spec: '%c'",
> +                         *c);
> +     return 1;
> +}
> +
> +static int parse_combine_filter(
> +     struct list_objects_filter_options *filter_options,
> +     const char *arg,
> +     struct strbuf *errbuf)
> +{
> +     struct strbuf **sub_specs = strbuf_split_str(arg, '+', 2);
> +     int result;
> +
> +     if (!sub_specs[0]) {
> +             if (errbuf)
> +                     strbuf_addf(errbuf,
> +                                 _("expected something after combine:"));
> +             result = 1;
> +             goto cleanup;
> +     }
> +
> +     result = has_reserved_character(sub_specs[0], errbuf);
> +     if (result)
> +             goto cleanup;
> +
> +     /*
> +      * Only decode the first sub-filter, since the rest will be decoded on
> +      * the recursive call.
> +      */
> +     result = url_decode(sub_specs[0], errbuf);
> +     if (result)
> +             goto cleanup;
> +
> +     if (!sub_specs[1]) {
> +             /*
> +              * There is only one sub-filter, so we don't need the
> +              * combine: - just parse it as a non-composite filter.
> +              */
> +             result = gently_parse_list_objects_filter(
> +                     filter_options, sub_specs[0]->buf, errbuf);
> +             goto cleanup;
> +     }
> +
> +     /* Remove trailing "+" so we can parse it. */
> +     assert(sub_specs[0]->buf[sub_specs[0]->len - 1] == '+');
> +     strbuf_remove(sub_specs[0], sub_specs[0]->len - 1, 1);
> +
> +     filter_options->choice = LOFC_COMBINE;
> +     filter_options->lhs = xcalloc(1, sizeof(*filter_options->lhs));
> +     filter_options->rhs = xcalloc(1, sizeof(*filter_options->rhs));
> +
> +     result = gently_parse_list_objects_filter(filter_options->lhs,
> +                                               sub_specs[0]->buf,
> +                                               errbuf) ||
> +             parse_combine_filter(filter_options->rhs,
> +                                   sub_specs[1]->buf,
> +                                   errbuf);

I guess you're recursing to combine filter 2 onto filter 1 which has
been combined onto filter 0 here. But why not just use a list or array?

> +
> +cleanup:
> +     strbuf_list_free(sub_specs);
> +     if (result) {
> +             list_objects_filter_release(filter_options);
> +             memset(filter_options, 0, sizeof(*filter_options));
> +     }
> +     return result;
> +}
> +
>  int parse_list_objects_filter(struct list_objects_filter_options 
> *filter_options,
>                             const char *arg)
>  {
>       struct strbuf buf = STRBUF_INIT;
> +     filter_options->filter_spec = strdup(arg);
>       if (gently_parse_list_objects_filter(filter_options, arg, &buf))
>               die("%s", buf.buf);
>       return 0;
>  }
>  
>  int opt_parse_list_objects_filter(const struct option *opt,
>                                 const char *arg, int unset)
>  {
>       struct list_objects_filter_options *filter_options = opt->value;
>  
> @@ -127,23 +276,29 @@ void expand_list_objects_filter_spec(
>       else if (filter->choice == LOFC_TREE_DEPTH)
>               strbuf_addf(expanded_spec, "tree:%lu",
>                           filter->tree_exclude_depth);
>       else
>               strbuf_addstr(expanded_spec, filter->filter_spec);
>  }
>  
>  void list_objects_filter_release(
>       struct list_objects_filter_options *filter_options)
>  {
> +     if (!filter_options)
> +             return;
>       free(filter_options->filter_spec);
>       free(filter_options->sparse_oid_value);
>       free(filter_options->sparse_path_value);
> +     list_objects_filter_release(filter_options->lhs);
> +     free(filter_options->lhs);
> +     list_objects_filter_release(filter_options->rhs);
> +     free(filter_options->rhs);

Is there a reason that the free shouldn't be included in
list_objects_filter_release()? Maybe this is a common style guideline
I've missed, but it seems to me like I'd expect a magic memory cleanup
function to do it all, and not leave it to me to free.

>       memset(filter_options, 0, sizeof(*filter_options));
>  }
>  
>  void partial_clone_register(
>       const char *remote,
>       const struct list_objects_filter_options *filter_options)
>  {
>       /*
>        * Record the name of the partial clone remote in the
>        * config and in the global variable -- the latter is
> @@ -171,14 +326,16 @@ void partial_clone_register(
>  }
>  
>  void partial_clone_get_default_filter_spec(
>       struct list_objects_filter_options *filter_options)
>  {
>       /*
>        * Parse default value, but silently ignore it if it is invalid.
>        */
>       if (!core_partial_clone_filter_default)
>               return;
> +
> +     filter_options->filter_spec = strdup(core_partial_clone_filter_default);
>       gently_parse_list_objects_filter(filter_options,
>                                        core_partial_clone_filter_default,
>                                        NULL);
>  }
> diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
> index e3adc78ebf..6c0f0ecd08 100644
> --- a/list-objects-filter-options.h
> +++ b/list-objects-filter-options.h
> @@ -7,20 +7,21 @@
>  /*
>   * The list of defined filters for list-objects.
>   */
>  enum list_objects_filter_choice {
>       LOFC_DISABLED = 0,
>       LOFC_BLOB_NONE,
>       LOFC_BLOB_LIMIT,
>       LOFC_TREE_DEPTH,
>       LOFC_SPARSE_OID,
>       LOFC_SPARSE_PATH,
> +     LOFC_COMBINE,
>       LOFC__COUNT /* must be last */
>  };
>  
>  struct list_objects_filter_options {
>       /*
>        * 'filter_spec' is the raw argument value given on the command line
>        * or protocol request.  (The part after the "--keyword=".)  For
>        * commands that launch filtering sub-processes, or for communication
>        * over the network, don't use this value; use the result of
>        * expand_list_objects_filter_spec() instead.
> @@ -32,28 +33,35 @@ struct list_objects_filter_options {
>        * the filtering algorithm to use.
>        */
>       enum list_objects_filter_choice choice;
>  
>       /*
>        * Choice is LOFC_DISABLED because "--no-filter" was requested.
>        */
>       unsigned int no_filter : 1;
>  
>       /*
> -      * Parsed values (fields) from within the filter-spec.  These are
> -      * choice-specific; not all values will be defined for any given
> -      * choice.
> +      * BEGIN choice-specific parsed values from within the filter-spec. Only
> +      * some values will be defined for any given choice.
>        */
> +
>       struct object_id *sparse_oid_value;
>       char *sparse_path_value;
>       unsigned long blob_limit_value;
>       unsigned long tree_exclude_depth;
> +
> +     /* LOFC_COMBINE values */
> +     struct list_objects_filter_options *lhs, *rhs;
> +
> +     /*
> +      * END choice-specific parsed values.
> +      */
>  };
>  
>  /* Normalized command line arguments */
>  #define CL_ARG__FILTER "filter"
>  
>  int parse_list_objects_filter(
>       struct list_objects_filter_options *filter_options,
>       const char *arg);
>  
>  int opt_parse_list_objects_filter(const struct option *opt,
> diff --git a/list-objects-filter.c b/list-objects-filter.c
> index 8e8616b9b8..b97277a46f 100644
> --- a/list-objects-filter.c
> +++ b/list-objects-filter.c
> @@ -453,34 +453,148 @@ static void filter_sparse_path__init(
>  
>       ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc);
>       d->array_frame[d->nr].defval = 0; /* default to include */
>       d->array_frame[d->nr].child_prov_omit = 0;
>  
>       ctx->filter_fn = filter_sparse;
>       ctx->free_fn = filter_sparse_free;
>       ctx->data = d;
>  }
>  
> +struct filter_combine_data {
> +     /* sub[0] corresponds to lhs, sub[1] to rhs. */

Jeff H had a comment about this too, but this seems unwieldy for >2
filters. (I also personally don't like using set index to incidate
lhs/rhs.) Why not an array of multiple `struct sub`? There's a macro
utility to generate types and helpers for an array of arbitrary struct
that may suit...

> +     struct {
> +             struct filter_context ctx;
> +             struct oidset seen;
> +             struct object_id skip_tree;
> +             unsigned is_skipping_tree : 1;
> +     } sub[2];
> +
> +     struct oidset rhs_omits;
> +};
> +
> +static void add_all(struct oidset *dest, struct oidset *src) {
> +     struct oidset_iter iter;
> +     struct object_id *src_oid;
> +
> +     oidset_iter_init(src, &iter);
> +     while ((src_oid = oidset_iter_next(&iter)) != NULL)
> +             oidset_insert(dest, src_oid);
> +}
> +
> +static void filter_combine_free(void *filter_data)
> +{
> +     struct filter_combine_data *d = filter_data;
> +     int i;
> +
> +     /* Anything omitted by rhs should be added to the overall omits set. */
> +     if (d->sub[0].ctx.omits)
> +             add_all(d->sub[0].ctx.omits, d->sub[1].ctx.omits);
> +
> +     for (i = 0; i < 2; i++) {
> +             list_objects_filter__release(&d->sub[i].ctx);
> +             oidset_clear(&d->sub[i].seen);
> +     }
> +     oidset_clear(&d->rhs_omits);
> +     free(d);
> +}
> +
> +static int should_delegate(enum list_objects_filter_situation 
> filter_situation,
> +                        struct object *obj,
> +                        struct filter_combine_data *d,
> +                        int side)
> +{
> +     if (!d->sub[side].is_skipping_tree)
> +             return 1;
> +     if (filter_situation == LOFS_END_TREE &&
> +             oideq(&obj->oid, &d->sub[side].skip_tree)) {
> +             d->sub[side].is_skipping_tree = 0;
> +             return 1;
> +     }
> +     return 0;
> +}
> +
> +static enum list_objects_filter_result filter_combine(
> +     struct repository *r,
> +     enum list_objects_filter_situation filter_situation,
> +     struct object *obj,
> +     const char *pathname,
> +     const char *filename,
> +     struct filter_context *ctx)
> +{
> +     struct filter_combine_data *d = ctx->data;
> +     enum list_objects_filter_result result[2];
> +     enum list_objects_filter_result combined_result = LOFR_ZERO;
> +     int i;
> +
> +     for (i = 0; i < 2; i++) {

I suppose your lhs and rhs are in sub[0] and sub[1] in part for the sake
of this loop. But I think it would be easier to understand what is going
on if you were to perform the loop contents in a helper function (as the
name of the function would provide some more documentation).

> +             if (oidset_contains(&d->sub[i].seen, &obj->oid) ||
> +                     !should_delegate(filter_situation, obj, d, i)) {
> +                     result[i] = LOFR_ZERO;
> +                     continue;
> +             }
> +
> +             result[i] = d->sub[i].ctx.filter_fn(
> +                     r, filter_situation, obj, pathname, filename,
> +                     &d->sub[i].ctx);
> +
> +             if (result[i] & LOFR_MARK_SEEN)
> +                     oidset_insert(&d->sub[i].seen, &obj->oid);
> +
> +             if (result[i] & LOFR_SKIP_TREE) {
> +                     d->sub[i].is_skipping_tree = 1;
> +                     d->sub[i].skip_tree = obj->oid;
> +             }
> +     }
> +
> +     if ((result[0] & LOFR_DO_SHOW) && (result[1] & LOFR_DO_SHOW))
> +             combined_result |= LOFR_DO_SHOW;
> +     if (d->sub[0].is_skipping_tree && d->sub[1].is_skipping_tree)
> +             combined_result |= LOFR_SKIP_TREE;
> +
> +     return combined_result;
> +}

I see that you tested that >2 filters works okay. But by doing it the
way you have it seems like you're setting up to need recursion all over
the place to check against all the filters. I suppose I don't see the
benefit of doing all this recursively, as compared to doing it
iteratively.

 - Emily

Re: [PATCH v1 3/5] list-objects-filter: implement composite filters

Reply via email to