Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Robert Bradshaw
On Thu, Jan 26, 2017 at 6:58 PM, Kenneth Knowles wrote: > On Thu, Jan 26, 2017 at 4:15 PM, Robert Bradshaw < > rober...@google.com.invalid> wrote: > >> On Thu, Jan 26, 2017 at 3:42 PM, Eugene Kirpichov >> wrote: >> > >> > you can't wrap DoFn's, period >> >> As a simple example, given a DoFn it's

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread Kobi Salant
Congrats! Well deserved Stas בתאריך 27 בינו' 2017 7:26,‏ "Frances Perry" כתב: > Woohoo! Congrats ;-) > > On Thu, Jan 26, 2017 at 9:05 PM, Jean-Baptiste Onofré > wrote: > > > Welcome aboard !⁣ > > > > Regards > > JB > > > > On Jan 27, 2017, 01:27, at 01:27, Davor Bonaci wrote: > > >Please join

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Jean-Baptiste Onofré
⁣Hi Eugene A simple way would be to create a BatchedDoFn in an extension. WDYT ? Regards JB On Jan 26, 2017, 21:48, at 21:48, Eugene Kirpichov wrote: >I don't think we should make batching a core feature of the Beam >programming model (by adding it to DoFn as this code snippet implies). >I'm

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread Frances Perry
Woohoo! Congrats ;-) On Thu, Jan 26, 2017 at 9:05 PM, Jean-Baptiste Onofré wrote: > Welcome aboard !⁣ > > Regards > JB > > On Jan 27, 2017, 01:27, at 01:27, Davor Bonaci wrote: > >Please join me and the rest of Beam PMC in welcoming the following > >contributors as our newest committers. They h

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread Jean-Baptiste Onofré
Welcome aboard !⁣ Regards JB On Jan 27, 2017, 01:27, at 01:27, Davor Bonaci wrote: >Please join me and the rest of Beam PMC in welcoming the following >contributors as our newest committers. They have significantly >contributed >to the project in different ways, and we look forward to many more

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread Aviem Zur
Congrats! On Fri, Jan 27, 2017, 06:25 Thomas Weise wrote: > Congrats! > > > On Thu, Jan 26, 2017 at 7:49 PM, María García Herrero < > mari...@google.com.invalid> wrote: > > > Congratulations and thank you for your contributions thus far! > > > > On Thu, Jan 26, 2017 at 6:00 PM, Robert Bradshaw <

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread Thomas Weise
Congrats! On Thu, Jan 26, 2017 at 7:49 PM, María García Herrero < mari...@google.com.invalid> wrote: > Congratulations and thank you for your contributions thus far! > > On Thu, Jan 26, 2017 at 6:00 PM, Robert Bradshaw < > rober...@google.com.invalid> wrote: > > > Welcome and congratulations! >

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread María García Herrero
Congratulations and thank you for your contributions thus far! On Thu, Jan 26, 2017 at 6:00 PM, Robert Bradshaw < rober...@google.com.invalid> wrote: > Welcome and congratulations! > > On Thu, Jan 26, 2017 at 5:05 PM, Sourabh Bajaj > wrote: > > Congrats!! > > > > On Thu, Jan 26, 2017 at 5:02 PM

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Kenneth Knowles
On Thu, Jan 26, 2017 at 4:15 PM, Robert Bradshaw < rober...@google.com.invalid> wrote: > On Thu, Jan 26, 2017 at 3:42 PM, Eugene Kirpichov > wrote: > > > > you can't wrap DoFn's, period > > As a simple example, given a DoFn it's perfectly natural to want > to "wrap" this as a DoFn, KV>. State, si

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread Robert Bradshaw
Welcome and congratulations! On Thu, Jan 26, 2017 at 5:05 PM, Sourabh Bajaj wrote: > Congrats!! > > On Thu, Jan 26, 2017 at 5:02 PM Jason Kuster > wrote: > >> Congrats all! Very exciting. :) >> >> On Thu, Jan 26, 2017 at 4:48 PM, Jesse Anderson >> wrote: >> >> > Welcome! >> > >> > On Thu, Jan 2

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Robert Bradshaw
On Thu, Jan 26, 2017 at 5:04 PM, Eugene Kirpichov wrote: > It would be nice to start with an inventory of the batching use cases we > already have implemented manually, and see what kind of API would be > sufficient to replace all of them. Sounds like a good way to make forward progress. > E.g.:

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Robert Bradshaw
On Thu, Jan 26, 2017 at 4:20 PM, Ben Chambers wrote: > Here's an example API that would make this part of a DoFn. The idea here is > that it would still be run as `ParDo.of(new MyBatchedDoFn())`, but the > runner (and DoFnRunner) could see that it has asked for batches, so rather > than calling a

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread Sourabh Bajaj
Congrats!! On Thu, Jan 26, 2017 at 5:02 PM Jason Kuster wrote: > Congrats all! Very exciting. :) > > On Thu, Jan 26, 2017 at 4:48 PM, Jesse Anderson > wrote: > > > Welcome! > > > > On Thu, Jan 26, 2017, 7:27 PM Davor Bonaci wrote: > > > > > Please join me and the rest of Beam PMC in welcoming

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Eugene Kirpichov
It would be nice to start with an inventory of the batching use cases we already have implemented manually, and see what kind of API would be sufficient to replace all of them. E.g.: - both fixed and dynamic batch size as specified above are insufficient for what PubsubIO and ElasticsearchIO do (t

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread Jason Kuster
Congrats all! Very exciting. :) On Thu, Jan 26, 2017 at 4:48 PM, Jesse Anderson wrote: > Welcome! > > On Thu, Jan 26, 2017, 7:27 PM Davor Bonaci wrote: > > > Please join me and the rest of Beam PMC in welcoming the following > > contributors as our newest committers. They have significantly > c

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread Jesse Anderson
Welcome! On Thu, Jan 26, 2017, 7:27 PM Davor Bonaci wrote: > Please join me and the rest of Beam PMC in welcoming the following > contributors as our newest committers. They have significantly contributed > to the project in different ways, and we look forward to many more > contributions in the

[ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread Davor Bonaci
Please join me and the rest of Beam PMC in welcoming the following contributors as our newest committers. They have significantly contributed to the project in different ways, and we look forward to many more contributions in the future. * Stas Levin Stas has contributed across the breadth of the

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Ben Chambers
Here's an example API that would make this part of a DoFn. The idea here is that it would still be run as `ParDo.of(new MyBatchedDoFn())`, but the runner (and DoFnRunner) could see that it has asked for batches, so rather than calling a `processElement` on every input `I`, it assembles a `Collectio

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Robert Bradshaw
On Thu, Jan 26, 2017 at 3:42 PM, Eugene Kirpichov wrote: > I agree that wrapping the DoFn is probably not the way to go, because the > DoFn may be quite tricky due to all the reflective features: e.g. how do > you automatically "batch" a DoFn that uses state and timers? What about a > DoFn that us

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Ben Chambers
The third option for batching: - Functionality within the DoFn and DoFnRunner built as part of the SDK. I haven't thought through Batching, but at least for the IntraBundleParallelization use case this actually does make sense to expose as a part of the model. Knowing that a DoFn supports paralle

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Kenneth Knowles
On Thu, Jan 26, 2017 at 3:42 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > The class for invoking DoFn's, > DoFnInvokers, is absent from the SDK (and present in runners-core) for a > good reason. > This would be true if it weren't for that pesky DoFnTester :-) And even if we solv

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Eugene Kirpichov
I agree that wrapping the DoFn is probably not the way to go, because the DoFn may be quite tricky due to all the reflective features: e.g. how do you automatically "batch" a DoFn that uses state and timers? What about a DoFn that uses a BoundedWindow parameter? What about a splittable DoFn? What a

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Robert Bradshaw
On Thu, Jan 26, 2017 at 3:31 PM, Ben Chambers wrote: > I think that wrapping the DoFn is tricky -- we backed out > IntraBundleParallelization because it did that, and it has weird > interactions with both the reflective DoFn and windowing. We could maybe > make some kind of "DoFnDelegatingDoFn" th

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Ben Chambers
I think that wrapping the DoFn is tricky -- we backed out IntraBundleParallelization because it did that, and it has weird interactions with both the reflective DoFn and windowing. We could maybe make some kind of "DoFnDelegatingDoFn" that could act as a base class and get some of that right, but..

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Robert Bradshaw
On Thu, Jan 26, 2017 at 12:48 PM, Eugene Kirpichov wrote: > I don't think we should make batching a core feature of the Beam > programming model (by adding it to DoFn as this code snippet implies). I'm > reasonably sure there are less invasive ways of implementing it. +1, either as a PTransform,

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Eugene Kirpichov
I don't think we should make batching a core feature of the Beam programming model (by adding it to DoFn as this code snippet implies). I'm reasonably sure there are less invasive ways of implementing it. On Thu, Jan 26, 2017 at 12:22 PM Jean-Baptiste Onofré wrote: > Agree, I'm curious as well. >

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Jean-Baptiste Onofré
Agree, I'm curious as well. I guess it would be something like: .apply(ParDo(new DoFn() { @Override public long batchSize() { return 1000; } @ProcessElement public void processElement(ProcessContext context) { ... } })); If batchSize (overrided by user) returns a p

Re: Default Timestamp and Watermark

2017-01-26 Thread Shen Li
Thank you Kenn! Shen On Thu, Jan 26, 2017 at 1:33 PM, Kenneth Knowles wrote: > On Thu, Jan 26, 2017 at 9:48 AM, Thomas Groh > wrote: > > > > The default watermark policy for a bounded source should be negative > > infinity until all of the data is read, then positive infinity. > > > Just to el

Re: Default Timestamp and Watermark

2017-01-26 Thread Kenneth Knowles
On Thu, Jan 26, 2017 at 9:48 AM, Thomas Groh wrote: > > The default watermark policy for a bounded source should be negative > infinity until all of the data is read, then positive infinity. Just to elaborate - there isn't a way for a bounded source to communicate a watermark. Runners each do th

Re: Default Timestamp and Watermark

2017-01-26 Thread Shen Li
Hi Thomas, Thanks a lot for explaining. Best, Shen On Thu, Jan 26, 2017 at 12:48 PM, Thomas Groh wrote: > The default timestamp should be BoundedWindow.TIMESTAMP_MIN_VALUE, which > is > equivalent to -2**63 microseconds. We also occasionally refer to this > timestamp as "negative infinity". >

Re: Default Timestamp and Watermark

2017-01-26 Thread Thomas Groh
The default timestamp should be BoundedWindow.TIMESTAMP_MIN_VALUE, which is equivalent to -2**63 microseconds. We also occasionally refer to this timestamp as "negative infinity". The default watermark policy for a bounded source should be negative infinity until all of the data is read, then posi

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Robert Bradshaw
First off, let me say that a *correctly* batching DoFn is a lot of value, especially because it's (too) easy to (often unknowingly) implement it incorrectly. My take is that a BatchingParDo should be a PTransform, PCollection> that takes a DoFn, ? extends Iterable> as a parameter, as well as some

Re: Committed vs. attempted metrics results

2017-01-26 Thread Ben Chambers
It think relaxing the query to not be an exact match is reasonable. I'm wondering if it should be substring or regex. either one preserves the existing behavior of, when passed a full step path returning only the metrics for that specific step, but it adds the ability to just know approximately the

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Eugene Kirpichov
Hi Etienne, Could you post some snippets of how your transform is to be used in a pipeline? I think that would make it easier to discuss on this thread and could save a lot of churn if the discussion ends up leading to a different API. On Thu, Jan 26, 2017 at 8:29 AM Etienne Chauchot wrote: > W

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Etienne Chauchot
Wonderful ! Thanks Kenn ! Etienne Le 26/01/2017 à 15:34, Kenneth Knowles a écrit : Hi Etienne, I was drafting a proposal about @OnWindowExpiration when this email arrived. I thought I would try to quickly unblock you by responding with a TL;DR: you can achieve your goals with state & timers

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Jean-Baptiste Onofré
Fantastic ! Let me take a look on the Spark runner ;) Thanks ! Regards JB On 01/26/2017 03:34 PM, Kenneth Knowles wrote: Hi Etienne, I was drafting a proposal about @OnWindowExpiration when this email arrived. I thought I would try to quickly unblock you by responding with a TL;DR: you can ac

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Kenneth Knowles
Hi Etienne, I was drafting a proposal about @OnWindowExpiration when this email arrived. I thought I would try to quickly unblock you by responding with a TL;DR: you can achieve your goals with state & timers as they currently exist. You'll set a timer for window.maxTimestamp().plus(allowedLatenes

Re: Committed vs. attempted metrics results

2017-01-26 Thread Aviem Zur
Ben - yes, there is still some ambiguity regarding the querying of the metrics results. You've discussed in this thread the notion that metrics step names should be converted to unique names when aggregating metrics, so that each step will aggregate its own metrics, and not join with other steps b

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Etienne Chauchot
Hi, I have started to implement this ticket. For now it is implemented as a PTransform that simply does ParDo.of(new DoFn) and all the processing related to batching is done in the DoFn. I'm starting to deal with windows and bundles (starting to take a look at the State API to process trans-

Re: Better developer instructions for using Maven?

2017-01-26 Thread Aljoscha Krettek
+1 to what Dan said On Wed, 25 Jan 2017 at 21:40 Kenneth Knowles wrote: > +1 > > On Jan 25, 2017 11:15, "Jean-Baptiste Onofré" wrote: > > > +1 > > > > It sounds good to me. > > > > Thanks Dan ! > > > > Regards > > JB⁣​ > > > > On Jan 25, 2017, 19:39, at 19:39, Dan Halperin > > > wrote: > > >He

Re: Conceptually, what are bundles?

2017-01-26 Thread Jean-Baptiste Onofré
It makes sense. Agreed. Regards JB On 01/25/2017 08:34 PM, Kenneth Knowles wrote: There's actually not a JIRA filed beyond BEAM-25 for what Eugene is referring to. Context: Prior to windowing and streaming it was safe to buffer elements in @ProcessElement and then actually perform output in @F

Re: Conceptually, what are bundles?

2017-01-26 Thread Etienne Chauchot
Le 25/01/2017 à 20:34, Kenneth Knowles a écrit : There's actually not a JIRA filed beyond BEAM-25 for what Eugene is referring to. Context: Prior to windowing and streaming it was safe to buffer elements in @ProcessElement and then actually perform output in @FinishBundle. This pattern is only