Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Alex Van Boxel
It's indeed the first Logical identifier with Row base type. The UUID is generated from the name of the class, but doing it in code (from a string) you need to create bytes from the string, then a UUID. _/ _/ Alex Van Boxel On Mon, Jan 13, 2020 at 10:40 PM Brian Hulette wrote: > I guess these

Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Brian Hulette
I guess these are the first logical types we've defined with a base type of row. It does seem reasonable that a static schema for a logical type could have some fixed id, but it feels odd to have a fixed UUID, it would be nice if we could give the schema some meaningful static identifier. I think

Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Alex Van Boxel
Fix in this PR: [BEAM-9113] Fix serialization proto logical types https://github.com/apache/beam/pull/10569 or we all agree to *promote* the logical types to top-level logical types (as described in the design document, see ticket): [BEAM-9037] Instant and duration as logical type https://github

Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Alex Van Boxel
So I think the following happens: 1. the schema tree is initialized at construction time. The tree get serialized and send to the workers 2. the workers deserialize the tree, but as the Timestamp logical type have a logical type with a *static* schema the schema will be *re-initialized

Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Reuven Lax
SchemaCoder today recursively sets UUIDs for all schemas, including logical types, in setSchemaIds. Is it possible that your changes modified that logic somehow? On Mon, Jan 13, 2020 at 9:39 AM Alex Van Boxel wrote: > This is the stacktrace: > > > java.lang.IllegalStateException at > org.apache.

Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Alex Van Boxel
This is the stacktrace: java.lang.IllegalStateException at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:491) at org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:380) at org.apache.beam.sdk.coders.RowCoderGene

Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Reuven Lax
I don't think that should be the case. Also SchemaCoder will automatically set the UUID for such logical types. On Mon, Jan 13, 2020 at 8:24 AM Alex Van Boxel wrote: > OK, I've rechecked everything and eventually found the problem. The > problem is when you use a LogicalType backed back a Row, t

Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Alex Van Boxel
OK, I've rechecked everything and eventually found the problem. The problem is when you use a LogicalType backed back a Row, then the UUID needs to be set to make it work. (this is the case for Proto based Timestamps). I'll create a fix. _/ _/ Alex Van Boxel On Mon, Jan 13, 2020 at 8:36 AM Reuv

Re: master on Dataflow with schema aware PCollections stuck

2020-01-12 Thread Reuven Lax
Can you elucidate? All BeamSQL pipelines use schemas and I believe those test are working just fine on the Dataflow runner. In addition, there are a number of ValidatesRunner schema-aware pipelines that are running regularly on the Dataflow runner. On Sun, Jan 12, 2020 at 1:43 AM Alex Van Boxel w

Re: master on Dataflow with schema aware PCollections stuck

2020-01-12 Thread Alex Van Boxel
BTW. This is not a support ticket, I'm wondering if we are aware and we're missing schema aware integration tests as well. _/ _/ Alex Van Boxel On Sun, Jan 12, 2020 at 10:43 AM Alex Van Boxel wrote: > Hey all, > > anyone tried master with a *schema aware pipeline* on Dataflow? I'm > testing s

master on Dataflow with schema aware PCollections stuck

2020-01-12 Thread Alex Van Boxel
Hey all, anyone tried master with a *schema aware pipeline* on Dataflow? I'm testing some PR's to see if the run on Dataflow (as they are working on Direct) but they got: Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. You ca

Re: Schema Aware PCollections

2018-08-08 Thread Anton Kedin
ava>), but it's still experimental and there are no good end-to-end examples yet. Regards, Anton On Wed, Aug 8, 2018 at 5:45 AM Akanksha Sharma B < akanksha.b.sha...@ericsson.com> wrote: > Hi, > > > (changed the email-subject to make it generic) > > > It is

Re: Schema-Aware PCollections revisited

2018-03-05 Thread Reuven Lax
Of course! I think some BeamSQL folks should be involved as well, as this directly affects SQL work. Anton especially has expressed interest in Row and schemas. Reuven On Mon, Mar 5, 2018 at 4:30 AM Jean-Baptiste Onofré wrote: > Cool, > > can I work with you on this (sharing a branch for insta

Re: Schema-Aware PCollections revisited

2018-03-05 Thread Jean-Baptiste Onofré
Cool, can I work with you on this (sharing a branch for instance) ? Thanks ! Regards JB On 03/05/2018 01:01 PM, Reuven Lax wrote: > Yes, I do have a PoC in progress. The Beam Row class was being refactored, so > I > paused to wait for that to finish. > > > On Sun, Mar 4, 2018 at 8:24 PM Jean-

Re: Schema-Aware PCollections revisited

2018-03-05 Thread Reuven Lax
Yes, I do have a PoC in progress. The Beam Row class was being refactored, so I paused to wait for that to finish. On Sun, Mar 4, 2018 at 8:24 PM Jean-Baptiste Onofré wrote: > Hi Reuven, > > I revive this discussion as I think it would be a great addition. > > We had discussion on the fly, but

Re: Schema-Aware PCollections revisited

2018-03-04 Thread Jean-Baptiste Onofré
Hi Reuven, I revive this discussion as I think it would be a great addition. We had discussion on the fly, but I think now, as base for discussion, it would be great to have a feature branch where we can start some sketch/impl and discuss. @Reuven, did you start a PoC with what you proposed: -

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Reuven Lax
On Mon, Feb 5, 2018 at 9:06 PM, Kenneth Knowles wrote: > Joining late, but very interested. Commented on the doc. Since there's a > forked discussion between doc and thread, I want to say this on the thread: > > 1. I have used JSON schema in production for describing the structure of > analytics

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
I would add a use case: single serialization mecanism accross a pipeline. JSON allows to handle generic records (JsonObject) as well as POJO serialization and both are compatible. Compared to avro built-in mecanism, it is not intrusive in the models which is a key feature of an API. It also increas

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Kenneth Knowles
Joining late, but very interested. Commented on the doc. Since there's a forked discussion between doc and thread, I want to say this on the thread: 1. I have used JSON schema in production for describing the structure of analytics events and it is OK but not great. If you are sure your data is on

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
None, Json-p - the spec so no strong impl requires - as record API and a custom light wrapping for schema - like https://github.com/Talend/component-runtime/blob/master/component-form/component-form-model/src/main/java/org/talend/sdk/component/form/model/jsonschema/JsonSchema.java (note this code i

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Reuven Lax
Which json library are you thinking of? At least in Java, there's always been a problem of no good standard Json library. On Mon, Feb 5, 2018 at 12:03 PM, Romain Manni-Bucau wrote: > > > Le 5 févr. 2018 19:54, "Reuven Lax" a écrit : > > multiplying by 1.0 doesn't really solve the right proble

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
Le 5 févr. 2018 19:54, "Reuven Lax" a écrit : multiplying by 1.0 doesn't really solve the right problems. The number type used by Javascript (and by extension, they standard for json) only has 53 bits of precision. I've seen many, many bugs caused because of this - the input data may easily conta

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Reuven Lax
multiplying by 1.0 doesn't really solve the right problems. The number type used by Javascript (and by extension, they standard for json) only has 53 bits of precision. I've seen many, many bugs caused because of this - the input data may easily contain numbers too large for 53 bits. In addition,

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
Im off tonight but can we try to do it next week (tomorrow)? If not please answer to this thread with outcomes and Ill catch up tmr morning. Le 4 févr. 2018 20:23, "Reuven Lax" a écrit : Cool, let's chat about this on slack for a bit (which I realized I've been signed out of for some time). Reu

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Reuven Lax
Cool, let's chat about this on slack for a bit (which I realized I've been signed out of for some time). Reuven On Sun, Feb 4, 2018 at 9:21 AM, Jean-Baptiste Onofré wrote: > Sorry guys, I was off today. Happy to be part of the party too ;) > > Regards > JB > > On 02/04/2018 06:19 PM, Reuven Lax

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
You can handle integers using multipleOf: 1.0 IIRC. Yes limitations are still here but it is a good starting model and to be honest it is good enough - not a single model will work good enough even if you can go a little bit further with other models a bit more complex. That said the idea is to enr

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Jean-Baptiste Onofré
Sorry guys, I was off today. Happy to be part of the party too ;) Regards JB On 02/04/2018 06:19 PM, Reuven Lax wrote: > Romain, since you're interested maybe the two of us should put together a > proposal for how to set this things (hints, schema) on PCollections? I don't > think it'll be hard -

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Reuven Lax
Romain, since you're interested maybe the two of us should put together a proposal for how to set this things (hints, schema) on PCollections? I don't think it'll be hard - the previous list thread on hints already agreed on a general approach, and we would just need to flesh it out. BTW in the pa

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
2018-02-04 17:53 GMT+01:00 Reuven Lax : > > > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau > wrote: > >> >> 2018-02-04 17:37 GMT+01:00 Reuven Lax : >> >>> I'm not sure where proto comes from here. Proto is one example of a type >>> that has a schema, but only one example. >>> >>> 1. In the

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Reuven Lax
On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau wrote: > > 2018-02-04 17:37 GMT+01:00 Reuven Lax : > >> I'm not sure where proto comes from here. Proto is one example of a type >> that has a schema, but only one example. >> >> 1. In the initial prototype I want to avoid modifying the PCollecti

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
2018-02-04 17:37 GMT+01:00 Reuven Lax : > I'm not sure where proto comes from here. Proto is one example of a type > that has a schema, but only one example. > > 1. In the initial prototype I want to avoid modifying the PCollection API. > So I think it's best to create a special SchemaCoder, and p

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Reuven Lax
I'm not sure where proto comes from here. Proto is one example of a type that has a schema, but only one example. 1. In the initial prototype I want to avoid modifying the PCollection API. So I think it's best to create a special SchemaCoder, and pass the schema into this coder. Later we might tar

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
@Reuven: is the proto only about passing schema or also the generic type? There are 2.5 topics to solve this issue: 1. How to pass schema 1.a. hints? 2. What is the generic record type associated to a schema and how to express a schema relatively to it I would be happy to help on 1.a and 2 someh

Re: Schema-Aware PCollections revisited

2018-02-03 Thread Reuven Lax
One more thing. If anyone here has experience with various OSS metadata stores (e.g. Kafka Schema Registry is one example), would you like to collaborate on implementation? I want to make sure that source schemas can be stored in a variety of OSS metadata stores, and be easily pulled into a Beam pi

Re: Schema-Aware PCollections revisited

2018-02-03 Thread Reuven Lax
Hi all, If there are no concerns, I would like to start working on a prototype. It's just a prototype, so I don't think it will have the final API (e.g. for the prototype I'm going to avoid change the API of PCollection, and use a "special" Coder instead). Also even once we go beyond prototype, it

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
If you need help on the json part I'm happy to help. To give a few hints on what is very doable: we can add an avro module to johnzon (asf json{p,b} impl) to back jsonp by avro (guess it will be one of the first to be asked) for instance. Romain Manni-Bucau @rmannibucau

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Reuven Lax
Agree. The initial implementation will be a prototype. On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré wrote: > Hi Reuven, > > Agree to be able to describe the schema with different format. The good > point about json schemas is that they are described by a spec. My point is > also to avo

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Jean-Baptiste Onofré
Hi Reuven, Agree to be able to describe the schema with different format. The good point about json schemas is that they are described by a spec. My point is also to avoid the reinvent the wheel. Just an abstract to be able to use Avro, Json, Calcite, custom schema descriptors would be great.

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
Hmm, it is a hint semantically or it is deducable from the transform. Doing the union of both you cover all cases. Then how it is forwarded from the transform to the runtime is in runner API not the user (pipeline) API so I'm not sure I see the case you reference where it has a semantic API. Can yo

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Reuven Lax
I don't think "hint" is the right API, as schema is not a hint (it has semantic meaning). However I think the API for schema should look similar to any "hint" API. On Wed, Jan 31, 2018 at 11:40 AM, Romain Manni-Bucau wrote: > > > Le 31 janv. 2018 20:16, "Reuven Lax" a écrit : > > As to the ques

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
Le 31 janv. 2018 20:16, "Reuven Lax" a écrit : As to the question of how a schema should be specified, I want to support several common schema formats. So if a user has a Json schema, or an Avro schema, or a Calcite schema, etc. there should be adapters that allow setting a schema from any of the

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Reuven Lax
As to the question of how a schema should be specified, I want to support several common schema formats. So if a user has a Json schema, or an Avro schema, or a Calcite schema, etc. there should be adapters that allow setting a schema from any of them. I don't think we should prefer one over the ot

Re: Schema-Aware PCollections revisited

2018-01-30 Thread Jean-Baptiste Onofré
Hi, I think we should avoid to mix two things in the discussion (and so the document): 1. The element of the collection and the schema itself are two different things. By essence, Beam should not enforce any schema. That's why I think it's a good idea to set the schema optionally on the PCollect

Re: Schema-Aware PCollections revisited

2018-01-29 Thread Romain Manni-Bucau
Le 30 janv. 2018 01:09, "Reuven Lax" a écrit : On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau wrote: > Hi > > I have some questions on this: how hierarchic schemas would work? Seems it > is not really supported by the ecosystem (out of custom stuff) :(. How > would it integrate smoothly

Re: Schema-Aware PCollections revisited

2018-01-29 Thread Reuven Lax
On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau wrote: > Hi > > I have some questions on this: how hierarchic schemas would work? Seems it > is not really supported by the ecosystem (out of custom stuff) :(. How > would it integrate smoothly with other generic record types - N bridges? > Do

Re: Schema-Aware PCollections revisited

2018-01-29 Thread Romain Manni-Bucau
Hi I have some questions on this: how hierarchic schemas would work? Seems it is not really supported by the ecosystem (out of custom stuff) :(. How would it integrate smoothly with other generic record types - N bridges? Concretely I wonder if using json API couldnt be beneficial: json-p is a ni

Re: Schema-Aware PCollections revisited

2018-01-28 Thread Jean-Baptiste Onofré
Hi Reuven, Thanks for the update ! As I'm working with you on this, I fully agree and great doc gathering the ideas. It's clearly something we have to add asap in Beam, because it would allow new use cases for our users (in a simple way) and open new areas for the runners (for instance dataframe

Schema-Aware PCollections revisited

2018-01-28 Thread Reuven Lax
Previously I submitted a proposal for adding schemas as a first-class concept on Beam PCollections. The proposal engendered quite a bit of discussion from the community - more discussion than I've seen from almost any of our proposals to date! Based on the feedback and comments, I reworked the pro

Re: Schema-Aware PCollections

2017-12-04 Thread Kenneth Knowles
Nice. Commented a bit on the doc a bit. +1 to working up the Python, Go, portability implications. Kenn On Thu, Nov 30, 2017 at 1:06 PM, Reuven Lax wrote: > Thanks! > > > On Thu, Nov 30, 2017 at 11:25 AM, Holden Karau > wrote: > >> Rocking, I'll start leaving some comments on this. I'm excited

Re: Schema-Aware PCollections

2017-11-30 Thread Reuven Lax
Thanks! On Thu, Nov 30, 2017 at 11:25 AM, Holden Karau wrote: > Rocking, I'll start leaving some comments on this. I'm excited to see work > being done in this area as well :) > > On Thu, Nov 30, 2017 at 9:20 AM, Tyler Akidau wrote: > >> On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax wrote: >> >>

Re: Schema-Aware PCollections

2017-11-30 Thread Holden Karau
Rocking, I'll start leaving some comments on this. I'm excited to see work being done in this area as well :) On Thu, Nov 30, 2017 at 9:20 AM, Tyler Akidau wrote: > On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax wrote: > >> There has been a lot of conversation about schemas on PCollections >> recen

Re: Schema-Aware PCollections

2017-11-30 Thread Tyler Akidau
On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax wrote: > There has been a lot of conversation about schemas on PCollections > recently. There are a number of reasons for this. Schemas as first-class > objects in Beam provide a nice base for building BeamSQL. Spark has > provided schema-support via Dat

Schema-Aware PCollections

2017-11-29 Thread Reuven Lax
There has been a lot of conversation about schemas on PCollections recently. There are a number of reasons for this. Schemas as first-class objects in Beam provide a nice base for building BeamSQL. Spark has provided schema-support via Dataframes for over two years, and it has proved to be very pop