On Sun, Jan 6, 2019 at 12:46 PM Reuven Lax <[email protected]> wrote:
>
> Some time ago, @Jean-Baptiste Onofré made the excellent suggestion that we 
> look into using JsonPath as a selector format for schema fields. This 
> provides a simple and natural way for users to select nested schema fields, 
> as well as wildcards. This would allow users to more simply select nested 
> fields using the Select transform, e.g.:
>
> p.apply(Select.fields("event.userid", "event.location.*");
>
> It would also fit into NewDoFn (Java) like this:
>
> @ProcessElement
> public void process(@Field("userid") String userId,
>                     @Field("action.location.*") Location location) {
> }
>
> After some investigation, I believe that we're better off with something very 
> close to a subset of JsonPath, but not precisely JsonPath.

I am very wary of creating something that's very close to, but not
quite, a (subset of) a well established standard. Is there
disadvantage to not being a strict actual subset? If we go this route,
we should at least ensure that any divergence is illegal JsonPath
rather than having different semantic meaning.

> JsonPath has many features that are Javascript specific (e.g. the ability to 
> embed Javascript expressions), JsonPath also includes the ability to do 
> complex filtering and aggregation, which I don't think we want here; Beam 
> already provides the ability to do such filtering and aggregation, and it's 
> not needed here. One example of a change: JsonPath queries always begin with 
> $ (representing the root node), and I think we're better off not requiring 
> that so that these queries look more like BeamSql queries.
>
> I've created a small ANTLR grammar (which has the advantage that it's easy to 
> extend) for these expressions and have everything working in a branch. 
> However there are a few more features of JsonPath that might be useful here, 
> and I wanted community feedback to see whether it's worth implementing them.
>
> The first are array/map slices and selectors. Currently if a schema contains 
> an array (or map) field, you can only select all elements of the array or 
> map. JsonPath however supports selecting and slicing the array. For example, 
> consider the following:
>
> @DefaultSchema(JavaFieldSchema.class)
> public class Event {
>   public final String userId;
>   public final List<Action> actions;
> }
>
> Currently you can apply Select.fields("actions.location"), and that will 
> return a schema containing a list of Locations, one for every action in the 
> original event. If we allowed slicing,  you could instead write 
> Select.fields("actions[0:9].locations"), which would do the same but only for 
> the first 10 elements of the array.
>
> Is this useful in Beam? It would not be hard to implement, but I want to see 
> what folks think first.
>
> The second feature is recursive field selection. The example often given in 
> JsonPath is a Json document containing the inventory for a store. There are 
> lists of subobjects representing books, bicycles, tables, chairs, etc. etc. 
> The JsonPath query "$..price" recursively finds every object that has a field 
> named price, and returns those prices; in this case it returns the price of 
> every element in the store.
>
> I'm a bit less convinced that recursive field selection is useful in Beam. 
> The usual example for Json involves a document that represents an entire 
> corpus, e.g. a store inventory. In Beam, the schemas are applied to 
> individual records, and I don't know how often there will be a use for this 
> sort of recursive selection. However I could be wrong here, so if anyone has 
> a good use case for this sort of selection, please let me know.

Records often contain lists, e.g. the record could be an order, and it
could be useful to select on the price of the items (just to throw it
out there).

Reply via email to