Re: Query expressions for schema fields

Reuven Lax Mon, 07 Jan 2019 14:31:03 -0800

Some parts of schemas need a spec that is standardized cross language (in
particular, the definition of what a schema is). Other things can be
language specific. The next step with schemas will need to be formalizing
those answer.


On Mon, Jan 7, 2019 at 2:13 PM Robert Burke <[email protected]> wrote:

> In the eventual future where the Go SDK supports schemas, it should be
> possible to use struct Field Tags to specify paths for extraction from
> schema data, for usage similar to what Java uses parameter Annotations.
>
> eg.
>
> type MyKey struct {
>     K string `jsonpath:userid`
> }
> type MyValue struct{
>   K string `jsonpath:userid`
>   Loc []Location  `jsonpath:action.location.*`
> }
>
> func MyDoFn(k MyKey, v MyValue) (...) {...}
>
> One could likely access any number of schema fields this way, and it would
> be statically analyisable, so fast extraction would be possible at runtime,
> rather than the default reflection paths.
>
> The would be agnostic to whichever path approach is decided on as the beam
> standard approach.
>
> On Mon, Jan 7, 2019, 11:59 AM Reuven Lax <[email protected]> wrote:
>
>> I'll take a look.
>>
>> Honestly though, if we leave out features such as array slices this is a
>> dirt-simple path syntax, that pretty much matches what SQL does.  It's
>> basically just field1.field2, or field1.*.
>>
>> JMESPath along with JsonPath also supports various aggregations, which I
>> think is beyond the scope of what we want here; all that's needed here is a
>> selector expression. AFAICT what I have is already a strict subset of
>> JMESPath, though I'll take a closer look to make sure there are no semantic
>> incompatibilities.
>>
>> Reuven
>>
>> On Mon, Jan 7, 2019 at 10:21 AM Jeff Klukas <[email protected]> wrote:
>>
>>> There is also JMESPath (http://jmespath.org/) which is quite similar to
>>> JsonPath, but does have a spec and lacks the leading $ character. The AWS
>>> CLI uses JMESPath for defining queries.
>>>
>>>
>>>
>>> On Mon, Jan 7, 2019 at 1:05 PM Reuven Lax <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Jan 7, 2019 at 1:44 AM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> On Sun, Jan 6, 2019 at 12:46 PM Reuven Lax <[email protected]> wrote:
>>>>> >
>>>>> > Some time ago, @Jean-Baptiste Onofré made the excellent suggestion
>>>>> that we look into using JsonPath as a selector format for schema fields.
>>>>> This provides a simple and natural way for users to select nested schema
>>>>> fields, as well as wildcards. This would allow users to more simply select
>>>>> nested fields using the Select transform, e.g.:
>>>>> >
>>>>> > p.apply(Select.fields("event.userid", "event.location.*");
>>>>> >
>>>>> > It would also fit into NewDoFn (Java) like this:
>>>>> >
>>>>> > @ProcessElement
>>>>> > public void process(@Field("userid") String userId,
>>>>> >                     @Field("action.location.*") Location location) {
>>>>> > }
>>>>> >
>>>>> > After some investigation, I believe that we're better off with
>>>>> something very close to a subset of JsonPath, but not precisely JsonPath.
>>>>>
>>>>> I am very wary of creating something that's very close to, but not
>>>>> quite, a (subset of) a well established standard. Is there
>>>>> disadvantage to not being a strict actual subset? If we go this route,
>>>>> we should at least ensure that any divergence is illegal JsonPath
>>>>> rather than having different semantic meaning.
>>>>>
>>>>
>>>> As far as I can tell, JsonPath isn't much of a "standard." There
>>>> doesn't seem to be much of a spec other than implementation.
>>>>
>>>> For the most part, I am speaking of a strict subset of JsonPath. The
>>>> only incompatibility is that JsonPath expressions all start with a '$'
>>>> (which represents the root node). So in the above expression you would
>>>> write "$.action.location.*" instead. I think staying closer to BeamSql
>>>> syntax makes more sense here, and I would like to dispense with the need to
>>>> begin with a $ character. JsonPath also assumes that each object is also a
>>>> JavaScript object (which makes no sense here), and some of the JsonPath
>>>> features are based on that.
>>>>
>>>>
>>>>> > JsonPath has many features that are Javascript specific (e.g. the
>>>>> ability to embed Javascript expressions), JsonPath also includes the
>>>>> ability to do complex filtering and aggregation, which I don't think we
>>>>> want here; Beam already provides the ability to do such filtering and
>>>>> aggregation, and it's not needed here. One example of a change: JsonPath
>>>>> queries always begin with $ (representing the root node), and I think 
>>>>> we're
>>>>> better off not requiring that so that these queries look more like BeamSql
>>>>> queries.
>>>>> >
>>>>> > I've created a small ANTLR grammar (which has the advantage that
>>>>> it's easy to extend) for these expressions and have everything working in 
>>>>> a
>>>>> branch. However there are a few more features of JsonPath that might be
>>>>> useful here, and I wanted community feedback to see whether it's worth
>>>>> implementing them.
>>>>> >
>>>>> > The first are array/map slices and selectors. Currently if a schema
>>>>> contains an array (or map) field, you can only select all elements of the
>>>>> array or map. JsonPath however supports selecting and slicing the array.
>>>>> For example, consider the following:
>>>>> >
>>>>> > @DefaultSchema(JavaFieldSchema.class)
>>>>> > public class Event {
>>>>> >   public final String userId;
>>>>> >   public final List<Action> actions;
>>>>> > }
>>>>> >
>>>>> > Currently you can apply Select.fields("actions.location"), and that
>>>>> will return a schema containing a list of Locations, one for every action
>>>>> in the original event. If we allowed slicing,  you could instead write
>>>>> Select.fields("actions[0:9].locations"), which would do the same but only
>>>>> for the first 10 elements of the array.
>>>>> >
>>>>> > Is this useful in Beam? It would not be hard to implement, but I
>>>>> want to see what folks think first.
>>>>> >
>>>>> > The second feature is recursive field selection. The example often
>>>>> given in JsonPath is a Json document containing the inventory for a store.
>>>>> There are lists of subobjects representing books, bicycles, tables, 
>>>>> chairs,
>>>>> etc. etc. The JsonPath query "$..price" recursively finds every object 
>>>>> that
>>>>> has a field named price, and returns those prices; in this case it returns
>>>>> the price of every element in the store.
>>>>> >
>>>>> > I'm a bit less convinced that recursive field selection is useful in
>>>>> Beam. The usual example for Json involves a document that represents an
>>>>> entire corpus, e.g. a store inventory. In Beam, the schemas are applied to
>>>>> individual records, and I don't know how often there will be a use for 
>>>>> this
>>>>> sort of recursive selection. However I could be wrong here, so if anyone
>>>>> has a good use case for this sort of selection, please let me know.
>>>>>
>>>>> Records often contain lists, e.g. the record could be an order, and it
>>>>> could be useful to select on the price of the items (just to throw it
>>>>> out there).
>>>>>
>>>>
>>>> BTW, that already works. The .. operator in JsonPath is a recursive
>>>> field search, across any lists or records that are lower in the tree.
>>>>
>>>

Re: Query expressions for schema fields

Reply via email to