Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Harshvardhan Agrawal
It’s more like I have a window and I create a side input for that window. Once I am done with processing the window I want to discard that side input and create a new ones for subsequent windows. On Tue, May 15, 2018 at 21:54 Harshvardhan Agrawal < harshvardhan.ag...@gmail.com> wrote: > In my cas

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Harshvardhan Agrawal
In my case since I am performing a lookup for some reference data that can change periodically I can’t really have a global window. I would want to update/re-lookup data per window. On Tue, May 15, 2018 at 21:06 Raghu Angadi wrote: > You are applying windowing to 'billingDataPairs' in the exampl

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Raghu Angadi
You are applying windowing to 'billingDataPairs' in the example above. Side input pairs with all the main input windows that exactly match or completely fall within the side input window. Common use case is a side input defined in default global window and it matches all the main input windows.

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Harshvardhan Agrawal
Thank you for your help! On Tue, May 15, 2018 at 20:05 Lukasz Cwik wrote: > Yes, that is correct. > > On Tue, May 15, 2018 at 4:40 PM Harshvardhan Agrawal < > harshvardhan.ag...@gmail.com> wrote: > >> Got it. >> >> Since I am not applying any windowing strategy to the side input, does >> beam au

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Lukasz Cwik
Yes, that is correct. On Tue, May 15, 2018 at 4:40 PM Harshvardhan Agrawal < harshvardhan.ag...@gmail.com> wrote: > Got it. > > Since I am not applying any windowing strategy to the side input, does > beam automatically pickup the windowing strategy for the side inputs from > the main input? By t

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Harshvardhan Agrawal
Got it. Since I am not applying any windowing strategy to the side input, does beam automatically pickup the windowing strategy for the side inputs from the main input? By that I mean the scope of the side input would be a per window one and it would be different for every window. Is that correct?

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Lukasz Cwik
Using deduplicate + side inputs will allow you to have a consistent view of the account information for the entire window which can be nice since it gives consistent processing semantics but using a simple in memory cache to reduce the amount of lookups will likely be much easier to debug and simpl

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Harshvardhan Agrawal
Thanks Raghu! Lukasz, Do you think lookups would be a better option than side inputs in my case? On Tue, May 15, 2018 at 16:33 Raghu Angadi wrote: > It should work. I think you need apply Distinct before looking up account > info : > billingDataPairs.apply(Keys.create()).apply(Distinct.create

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Raghu Angadi
It should work. I think you need apply Distinct before looking up account info : billingDataPairs.apply(Keys.create()).apply(Distinct.create()).apply("LookupAccounts", ...). Note that all of the accounts are stored in single in-memory map. It should be small enough for that. On Tue, May 15, 2018 a

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Harshvardhan Agrawal
Well ideally, I actually made the example a little easy. In the actual example I have multiple reference datasets. Say, I have a tuple of Account and Product as the key. The reason we don’t do the lookup in the DoFn directly is that we don’t want to lookup the data for the same account or same prod

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Lukasz Cwik
Is there a reason you don't want to read the accounting information within the DoFn directly from the datastore, it seems like that would be your simplest approach. On Tue, May 15, 2018 at 12:43 PM Harshvardhan Agrawal < harshvardhan.ag...@gmail.com> wrote: > Hi, > > No we don’t receive any such

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Harshvardhan Agrawal
Hi, No we don’t receive any such information from Kafka. The account information in the external store does change. Every time we have a change in the account information we will have to recompute all the billing info. Our source systems will make sure that they publish messages for those account

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Lukasz Cwik
For each BillingModel you receive over Kafka, how "fresh" should the account information be? Does the account information in the external store change? On Tue, May 15, 2018 at 11:22 AM Harshvardhan Agrawal < harshvardhan.ag...@gmail.com> wrote: > Hi, > > We have certain billing data that arrives

Fwd: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Harshvardhan Agrawal
Hi, We have certain billing data that arrives to us from Kafka. The billing data is in json and it contains an account ID. In order for us to generate the final report we need to use some account data associated with the account id and is stored in an external database. It is possible that we get

Beam PTransform names not showing on Flink UI

2018-05-15 Thread Harshvardhan Agrawal
Hi, I am currently working on building a pipeline using Apache Beam on top of Flink. One of the things we observed was that the DAG that gets created on the Flink UI doesn't contain the names we gave to the PTransforms. As a result of that it becomes hard to debug the pipeline or even understand t

Controlling parallelism of a ParDo Transform while writing to DB

2018-05-15 Thread Harshvardhan Agrawal
Hi Guys, I am currently in the process of developing a pipeline using Apache Beam with Flink as an execution engine. As a part of the process I read data from Kafka and perform a bunch of transformations that involve joins, aggregations as well as lookups to an external DB. The idea is that we wa

Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

2018-05-15 Thread chandan prakash
Also, 3. with spark 2.x structured streaming , if we want to switch across different modes like from micro-batching to continuous streaming mode, how it can be done while using Beam? These are some of the initial questions which I am not able to understand currently. Regards, Chandan On Tue, M

Normal Spark Streaming vs Streaming on Beam with Spark Runner

2018-05-15 Thread chandan prakash
Hi Everyone, I have just started exploring and understanding Apache Beam for new project in my firm. In particular, we have to take decision whether to implement our product over spark streaming (as spark batch is already in our eco system) or should we use Beam over spark runner to have future lib