Re: Flatten input data streams with skewed watermark progress

2018-03-12 Thread Reuven Lax
Ah, that's fine I think. What's not fine is .for an on-time element to later turn into a late element. On Mon, Mar 12, 2018 at 4:05 PM Shen Li wrote: > Sure. Consider the following case, where I have two input streams A and B. > (ts = timestamp, wm = watermark) > > processing timest

Re: Flatten input data streams with skewed watermark progress

2018-03-12 Thread Shen Li
Thomas and Reuven, Thank you for the explanation. Shen On Mon, Mar 12, 2018 at 7:05 PM, Thomas Groh wrote: > That one would be, for example, having a PCollection with a highly > advanced watermark and a PCollection with a much earlier watermark, and > have an input that is behind the watermark

Re: Flatten input data streams with skewed watermark progress

2018-03-12 Thread Shen Li
Sure. Consider the following case, where I have two input streams A and B. (ts = timestamp, wm = watermark) processing timestream A stream B 0 elem=x, ts=1 wm=3 1 wm=1

Re: Flatten input data streams with skewed watermark progress

2018-03-12 Thread Thomas Groh
That one would be, for example, having a PCollection with a highly advanced watermark and a PCollection with a much earlier watermark, and have an input that is behind the watermark of the former PCollection go through the flatten - at which point it moves to being ahead of the watermark. That's f

Re: Flatten input data streams with skewed watermark progress

2018-03-12 Thread Reuven Lax
Logically a Flatten is just a way to create a multi-input transform downstream of the flatten (you can imagine a model in which Flatten was not explicit, we just allowed multiple main inputs). This means that yes, the watermark is the minimum of all inputs. I don't see how a late tuple can become

Re: Configuring file-based transforms with different options

2018-03-12 Thread Romain Manni-Bucau
Le 12 mars 2018 23:05, "Chamikara Jayalath" a écrit : On Mon, Mar 12, 2018 at 2:36 PM Romain Manni-Bucau wrote: > > > Le 12 mars 2018 22:22, "Chamikara Jayalath" a > écrit : > > > > On Mon, Mar 12, 2018 at 12:42 PM Romain Manni-Bucau > wrote: > >> >> >> Le 12 mars 2018 18:56, "Chamikara Jay

Re: Configuring file-based transforms with different options

2018-03-12 Thread Chamikara Jayalath
On Mon, Mar 12, 2018 at 2:36 PM Romain Manni-Bucau wrote: > > > Le 12 mars 2018 22:22, "Chamikara Jayalath" a > écrit : > > > > On Mon, Mar 12, 2018 at 12:42 PM Romain Manni-Bucau > wrote: > >> >> >> Le 12 mars 2018 18:56, "Chamikara Jayalath" a >> écrit : >> >> Agree. We need file-system abst

Re: Configuring file-based transforms with different options

2018-03-12 Thread Romain Manni-Bucau
Le 12 mars 2018 22:22, "Chamikara Jayalath" a écrit : On Mon, Mar 12, 2018 at 12:42 PM Romain Manni-Bucau wrote: > > > Le 12 mars 2018 18:56, "Chamikara Jayalath" a > écrit : > > Agree. We need file-system abstractions in all languages since (1) users > may need to directly access file-syste

Re: Configuring file-based transforms with different options

2018-03-12 Thread Chamikara Jayalath
On Mon, Mar 12, 2018 at 12:42 PM Romain Manni-Bucau wrote: > > > Le 12 mars 2018 18:56, "Chamikara Jayalath" a > écrit : > > Agree. We need file-system abstractions in all languages since (1) users > may need to directly access file-systems from DoFns (2) common file-based > sources/sinks will p

Re: Flatten input data streams with skewed watermark progress

2018-03-12 Thread Shen Li
Hi Reuven, What about watermark? Should Flatten emit the min watermark of all input data streams? If that is the case, one late tuple can become early after Flatten, right? Will that cause any problem? Shen On Mon, Mar 12, 2018 at 4:09 PM, Reuven Lax wrote: > No, I don't think it makes sense f

Re: Flatten input data streams with skewed watermark progress

2018-03-12 Thread Reuven Lax
No, I don't think it makes sense for the Flatten operator to cache element. On Mon, Mar 12, 2018 at 11:55 AM Shen Li wrote: > If multiple inputs of Flatten proceed at different speeds, should the > Flatten operator cache tuples before emitting output watermarks? This can > prevent a late tuple

Re: Configuring file-based transforms with different options

2018-03-12 Thread Reuven Lax
I think a way to have transform-specific options could be useful, regardless of this use case. On Mon, Mar 12, 2018 at 12:42 PM Romain Manni-Bucau wrote: > > > Le 12 mars 2018 18:56, "Chamikara Jayalath" a > écrit : > > Agree. We need file-system abstractions in all languages since (1) users >

Re: Configuring file-based transforms with different options

2018-03-12 Thread Romain Manni-Bucau
Le 12 mars 2018 18:56, "Chamikara Jayalath" a écrit : Agree. We need file-system abstractions in all languages since (1) users may need to directly access file-systems from DoFns (2) common file-based sources/sinks will probably will be available in multiple languages even with portability API an

Flatten input data streams with skewed watermark progress

2018-03-12 Thread Shen Li
If multiple inputs of Flatten proceed at different speeds, should the Flatten operator cache tuples before emitting output watermarks? This can prevent a late tuple from becoming early. But if the watermark gap (i.e., cache size) becomes too large among inputs, can the application tell Beam/runner

Re: Advice on parallelizing network calls in DoFn

2018-03-12 Thread Romain Manni-Bucau
No more but can try to gather some figures and compare it to beam dofn overhead which should be at the same level or a bit more here since it is never unwrapped whereas completionfuture is a raw code chain without beam in the middle. Le 12 mars 2018 18:18, "Lukasz Cwik" a écrit : > Do you have d

Re: Configuring file-based transforms with different options

2018-03-12 Thread Romain Manni-Bucau
Agree and since all languages will support options and strings (didnt check this last one but i hope so ;)) then prefix is by design portable :). Passing directly pipeline options works too but still requires a portable way to read options and requires a way to loosely typed it too without enforci

Re: Configuring file-based transforms with different options

2018-03-12 Thread Chamikara Jayalath
Agree. We need file-system abstractions in all languages since (1) users may need to directly access file-systems from DoFns (2) common file-based sources/sinks will probably will be available in multiple languages even with portability API and cross language IO (these are usually the first sources

Re: Configuring file-based transforms with different options

2018-03-12 Thread Lukasz Cwik
There is still a lot of work before we get to supporting cross language transforms and hence get access to filesystems written in different languages but how the options are passed through from one to the other will need to be well understood and it would be best if the way a user defines these fil

Re: Advice on parallelizing network calls in DoFn

2018-03-12 Thread Lukasz Cwik
Do you have data that supports this? Note that in reality for something like passing an element between DoFns, the constant in o(1) actually matters. Decreasing SDK harness overhead is a good thing though. On Mon, Mar 12, 2018 at 10:14 AM, Romain Manni-Bucau wrote: > By itself just the overhead

Re: Advice on parallelizing network calls in DoFn

2018-03-12 Thread Romain Manni-Bucau
By itself just the overhead of instantiating a wrapper (so nothing with the recent JVM GC improvement done for the stream/optional usages). After if you use the chaining you have a light overhead but still o(1) you can desire to skip when doing sync code but which will enable you to run way faster

Re: Advice on parallelizing network calls in DoFn

2018-03-12 Thread Lukasz Cwik
It is expected that SDKs will have all their cores fully utilized by processing bundles in parallel and not by performing intrabundle parallelization. This allows for DoFns to be chained together via regular method calls because the overhead to pass a single element through all the DoFn's should be

Python Users List

2018-03-12 Thread Albina Jones
Hi, I was curious to know if you would be interested in Python Users List 2018? Let me know your target criteria and we will revert back to you with further details. Target Users: Geography: Looking forward to continued success with you. Regards, Albina Jones | Marketing Analyst Note: If you

Re: NoSuchElementException in reader.getCurrent*.

2018-03-12 Thread Eugene Kirpichov
I think I'm actually with Romain here and I'm somewhat in favor of dropping this. The only entity who can violate the start/advance/getCurrent contract is the runner or a test utility, and they have their own unit tests to ensure they don't - there is no reason to re-enforce their contract via ever

Re: NoSuchElementException in reader.getCurrent*.

2018-03-12 Thread Romain Manni-Bucau
Hmm, wrong link? Iterator contract doesn't require hasNext() to be called compared to beam (even if it is highly recommanded). Aligned on beam it means the runner can acll getCurrent() without calling start or advanced and consider it is done if there is a NSEE which will likely never works since s

Re: NoSuchElementException in reader.getCurrent*.

2018-03-12 Thread Thomas Groh
The correct sequencing and respecting return values of `start` and `advance` is a precondition, and `NoSuchElementException` is the failure mode if the precondition isn't met. Documenting the behavior in the case of precondition failures is entirely reasonable. For example, look at Java's `Iterator

Re: NoSuchElementException in reader.getCurrent*.

2018-03-12 Thread Romain Manni-Bucau
I agree Thomas but I kind of read it as "yes we can drop that constraint". If not we should also check we are used in a thread safe context etc which will likely never hit the user sdk API so why doing that case a particular case? Am I missing something? Romain Manni-Bucau @rmannibucau

Re: NoSuchElementException in reader.getCurrent*.

2018-03-12 Thread Thomas Groh
If a call to `getCurrentWhatever` happens after `start` or `advance` has returned false, it's a bug in the runner, but the reader needs to be able to fail, otherwise you'll get a synthetic element that doesn't really exist. If a reader throws `NoSuchElementException` after the most recent call retu

NoSuchElementException in reader.getCurrent*.

2018-03-12 Thread Romain Manni-Bucau
Hi guys, why reader#getCurrent* can throw NoSuchElementException, my understanding is that the runner will guarantee that start or advance was called and returned true when calling getCurrent so this is a case which shouldn't happen, no? Romain Manni-Bucau @rmannibucau

Re: [YouTube channel] Add video: Apache Beam meetup London 2: use case in finance + IO in Beam and Splittable DoFns

2018-03-12 Thread Matthias Baetens
Hey Ismaël, Thanks for the feedback. I think that is a great idea tbh. When getting to know Beam it was more of a coincidence bumping into these and having one repository with all of them is kind of what I was looking to do with the Beam YouTube channel. Either linking them on the page or some oth

Re: [PROPOSITION] schedule some sanity tests on a daily basis

2018-03-12 Thread Etienne Chauchot
Thanks everyone for your comments and support. Le vendredi 09 mars 2018 à 21:28 +, Alan Myrvold a écrit : > Great ideas. I want to see a daily signal for anything that could prevent a > release from happening, and precommits > that are fast and reliable for areas that are commonly broken by co

Re: [PROPOSITION] schedule some sanity tests on a daily basis

2018-03-12 Thread Etienne Chauchot
Le samedi 10 mars 2018 à 12:57 +0100, Łukasz Gajowy a écrit : > > - Integration tests: AFAIK we only run the ones in examples module and only > > on demand. What about running all the IT > > (in > > particular IO IT) as a cron job on a daily basis with direct runner? Please > > note that it will

Re: [PROPOSITION] schedule some sanity tests on a daily basis

2018-03-12 Thread Etienne Chauchot
Hi JB, Le samedi 10 mars 2018 à 06:59 +0100, Jean-Baptiste Onofré a écrit : > Good ideas ! > > Validates runner tests and Integration tests should be nightly executed. > > For the Performance tests, it's a great idea, but not sure daily basis  > is required. Maybe two times per week ? As these te

Re: [PROPOSITION] schedule some sanity tests on a daily basis

2018-03-12 Thread Etienne Chauchot
Le vendredi 09 mars 2018 à 20:57 +, Kenneth Knowles a écrit : > On Fri, Mar 9, 2018 at 3:08 AM Etienne Chauchot wrote: > > Hi guys, > > > > I was looking at the various jenkins jobs and I wanted to submit a > > proposition: > > > > - Validates runner tests: currently run at PostCommit for a