Re: Query expressions for schema fields

Reuven Lax Mon, 07 Jan 2019 10:05:35 -0800

On Mon, Jan 7, 2019 at 1:44 AM Robert Bradshaw <[email protected]> wrote:


> On Sun, Jan 6, 2019 at 12:46 PM Reuven Lax <[email protected]> wrote:
> >
> > Some time ago, @Jean-Baptiste Onofré made the excellent suggestion that
> we look into using JsonPath as a selector format for schema fields. This
> provides a simple and natural way for users to select nested schema fields,
> as well as wildcards. This would allow users to more simply select nested
> fields using the Select transform, e.g.:
> >
> > p.apply(Select.fields("event.userid", "event.location.*");
> >
> > It would also fit into NewDoFn (Java) like this:
> >
> > @ProcessElement
> > public void process(@Field("userid") String userId,
> >                     @Field("action.location.*") Location location) {
> > }
> >
> > After some investigation, I believe that we're better off with something
> very close to a subset of JsonPath, but not precisely JsonPath.
>
> I am very wary of creating something that's very close to, but not
> quite, a (subset of) a well established standard. Is there
> disadvantage to not being a strict actual subset? If we go this route,
> we should at least ensure that any divergence is illegal JsonPath
> rather than having different semantic meaning.
>

As far as I can tell, JsonPath isn't much of a "standard." There doesn't
seem to be much of a spec other than implementation.

For the most part, I am speaking of a strict subset of JsonPath. The only
incompatibility is that JsonPath expressions all start with a '$' (which
represents the root node). So in the above expression you would write
"$.action.location.*" instead. I think staying closer to BeamSql syntax
makes more sense here, and I would like to dispense with the need to begin
with a $ character. JsonPath also assumes that each object is also a
JavaScript object (which makes no sense here), and some of the JsonPath
features are based on that.


> > JsonPath has many features that are Javascript specific (e.g. the
> ability to embed Javascript expressions), JsonPath also includes the
> ability to do complex filtering and aggregation, which I don't think we
> want here; Beam already provides the ability to do such filtering and
> aggregation, and it's not needed here. One example of a change: JsonPath
> queries always begin with $ (representing the root node), and I think we're
> better off not requiring that so that these queries look more like BeamSql
> queries.
> >
> > I've created a small ANTLR grammar (which has the advantage that it's
> easy to extend) for these expressions and have everything working in a
> branch. However there are a few more features of JsonPath that might be
> useful here, and I wanted community feedback to see whether it's worth
> implementing them.
> >
> > The first are array/map slices and selectors. Currently if a schema
> contains an array (or map) field, you can only select all elements of the
> array or map. JsonPath however supports selecting and slicing the array.
> For example, consider the following:
> >
> > @DefaultSchema(JavaFieldSchema.class)
> > public class Event {
> >   public final String userId;
> >   public final List<Action> actions;
> > }
> >
> > Currently you can apply Select.fields("actions.location"), and that will
> return a schema containing a list of Locations, one for every action in the
> original event. If we allowed slicing,  you could instead write
> Select.fields("actions[0:9].locations"), which would do the same but only
> for the first 10 elements of the array.
> >
> > Is this useful in Beam? It would not be hard to implement, but I want to
> see what folks think first.
> >
> > The second feature is recursive field selection. The example often given
> in JsonPath is a Json document containing the inventory for a store. There
> are lists of subobjects representing books, bicycles, tables, chairs, etc.
> etc. The JsonPath query "$..price" recursively finds every object that has
> a field named price, and returns those prices; in this case it returns the
> price of every element in the store.
> >
> > I'm a bit less convinced that recursive field selection is useful in
> Beam. The usual example for Json involves a document that represents an
> entire corpus, e.g. a store inventory. In Beam, the schemas are applied to
> individual records, and I don't know how often there will be a use for this
> sort of recursive selection. However I could be wrong here, so if anyone
> has a good use case for this sort of selection, please let me know.
>
> Records often contain lists, e.g. the record could be an order, and it
> could be useful to select on the price of the items (just to throw it
> out there).
>

BTW, that already works. The .. operator in JsonPath is a recursive field
search, across any lists or records that are lower in the tree.

Re: Query expressions for schema fields

Reply via email to