Pipeline design including (sub / nested)-pipelines

2018-08-16 Thread Pascal Gula
Hello, I am currently evaluating Apache Beam (later executing on Google DataFlow), and for the first use-case I am working on, I have a kinda design question to see if any of you already had a similar one. Namely, we have a DB describing dashboards views, and for each views, we would like to perfor

Re: Pipeline design including (sub / nested)-pipelines

2018-08-16 Thread Pascal Gula
As a bonus, here is a simplified diagram view of the use-case: Cheers, Pascal On Thu, Aug 16, 2018 at 3:12 PM, Pascal Gula wrote: > Hello, > I am currently evaluating Apache Beam (later executing on Google > DataFlow), and for the first use-case I am working on, I have a kinda > design questio

Re: Retry on failures when using KafkaIO and Google Dataflow

2018-08-16 Thread Lukasz Cwik
Raghu, based upon your description, do you think it would be a good for KafkaIO to checkpoint on the first read without producing any actual records? On Wed, Aug 15, 2018 at 11:49 AM Raghu Angadi wrote: > > It is due to "enable.autocommit=true". Auto commit is an option to Kafka > client and ho

Re: Pipeline design including (sub / nested)-pipelines

2018-08-16 Thread Robin Qiu
Hi Pascal, As far as I know, you can't create sub-pipeline within a DoFn, i.e. nested pipelines are not supported. Best, Robin On Thu, Aug 16, 2018 at 7:03 AM Pascal Gula wrote: > As a bonus, here is a simplified diagram view of the use-case: > > Cheers, > Pascal > > > On Thu, Aug 16, 2018 at

Re: Retry on failures when using KafkaIO and Google Dataflow

2018-08-16 Thread Raghu Angadi
I am not sure if it helps much. External checkpoints like Kafka 'autocommit' outside Beam's own checkpoint domain will always have quirks like this. I wanted to ask what their use case for using autocommit was (over commitOffsetsInFinalize()) If we wanted to, it is not clear to how an unbounded so

Re: Pipeline design including (sub / nested)-pipelines

2018-08-16 Thread Pascal Gula
Hi Robin, this is unfortunate news, but I already anticipated such answer with an alternative implementation. It would be however interesting to support such feature since I am probably not the first person asking for this. Best regards, Pascal On Thu, Aug 16, 2018 at 6:20 PM, Robin Qiu wrote: >

Re: Retry on failures when using KafkaIO and Google Dataflow

2018-08-16 Thread André Missaglia
Hello everyone. Thanks for your answers. We've managed to solve our problem using the solutions proposed here. First, there was no need for using autocommit. But switching to "commitOffsetsInFinalize()" also didn't help. In our example, if a failure occured when processing message #25, the runner

coder issue?

2018-08-16 Thread Mahesh Vangala
Hello all - I am trying to run a barebone beam pipeline to understand the "combine" logic. I am from python world trying to learn java beam sdk due to my use case of ETL with spark cluster. So, pardon me for my grotesque java code :) I appreciate if you could nudge me in the right path with this

Re: Retry on failures when using KafkaIO and Google Dataflow

2018-08-16 Thread Raghu Angadi
On Thu, Aug 16, 2018 at 10:39 AM André Missaglia < andre.missag...@arquivei.com.br> wrote: > Hello everyone. > > Thanks for your answers. We've managed to solve our problem using the > solutions proposed here. > > First, there was no need for using autocommit. But switching to > "commitOffsetsInFi

Re: coder issue?

2018-08-16 Thread Robin Qiu
Hello Mahesh, You can add "implements Serializable" to the Accum class, then it should work. By the way, in Java String is immutable, so in order to change, for example, accum.line, you need to write accum.line = accum.line.concat(line). Best, Robin On Thu, Aug 16, 2018 at 10:42 AM Mahesh Vanga

Re: Pipeline design including (sub / nested)-pipelines

2018-08-16 Thread Lukasz Cwik
You can launch another Dataflow job from within an existing Dataflow job. For all intensive purposes, Dataflow won't know that the jobs are related in any way so they will only be "nested" because your outer pipeline knows about the inner pipeline. You should be able to do this for all runners (gr

Re: coder issue?

2018-08-16 Thread Mahesh Vangala
Hello Robin - Thank you so much for your help. I added Serializable to Accum, and I got the following error. (Sorry, for being a pain. I hope once I get past the initial hump ...) Aug 16, 2018 3:00:09 PM org.apache.beam.sdk.io.FileBasedSource getEstimatedSizeBytes INFO: Filepattern test_in.csv m

Re: coder issue?

2018-08-16 Thread Rui Wang
Hi Mahesh, I think I had the same NPE when I explored self defined combineFn. I think your combineFn might still need to define a coder to help Beam run it in distributed environment. Beam tries to invoke coder somewhere and then throw a NPE as there is no one defined. Here is a PR I wrote that d

Re: coder issue?

2018-08-16 Thread Rui Wang
Sorry I forgot to attach the PR link: https://github.com/apache/beam/pull/6154/files#diff-7358f3f0511940ea565e6584f652ed02R342 -Rui On Thu, Aug 16, 2018 at 12:13 PM Rui Wang wrote: > Hi Mahesh, > > I think I had the same NPE when I explored self defined combineFn. I think > your combineFn might

Dynamic topologies / replay

2018-08-16 Thread Thomas Browne
I just watched the excellent presentation by Markku Leppisto in Singapore. I consult for a financial broker in London. Our use case is streaming financial data on which various analytics are performed to find relative value trading opportunities in the fixed income markets. I have two questions a

Re: Retry on failures when using KafkaIO and Google Dataflow

2018-08-16 Thread Raghu Angadi
The only way reshuffle could help is if you are able to separate processing that is prone to errors from processing that produces side effects i.e. IO --> DoFn_B_prone_to_exceptions --> Reshuffle --> (B) DoFn_B_producing_side_effects This way, it might look like (B) processes each record exactly o

Re: Retry on failures when using KafkaIO and Google Dataflow

2018-08-16 Thread Lukasz Cwik
Raghu, yes, I was thinking that advance would continuously return false. This would force a runner to checkpoint and then resume at which point we would have a stable reading point. On Thu, Aug 16, 2018 at 3:17 PM Raghu Angadi wrote: > The only way reshuffle could help is if you are able to sepa

Re: Retry on failures when using KafkaIO and Google Dataflow

2018-08-16 Thread Raghu Angadi
On Thu, Aug 16, 2018 at 4:54 PM Lukasz Cwik wrote: > Raghu, yes, I was thinking that advance would continuously return false. > This would force a runner to checkpoint and then resume at which point we > would have a stable reading point. > When the actual checkpoint occurs is still transparent