Re: preserve order of records while writing in a file

2018-08-21 Thread Ankur Goenka
In case of multiple files, you can use Dataflow to parallelize processing to individual files. However, as mentioned earlier, records within in a single file is not worth parallelizing in this case. Your pipeline can start with a fixed set of file names followed by GroupBy (to shuffle the file nam

Re: preserve order of records while writing in a file

2018-08-21 Thread asharma . gd
On 2018/08/21 16:20:13, Lukasz Cwik wrote: > I would agree with Eugene. A simple application that does this is probably > what your looking for. > > There are ways to make this work with parallel processing systems but its > quite a hassle and only worthwhile if your computation is very expen

Re: Beam Summit London 2018

2018-08-21 Thread Griselda Cuevas
Hi there, We'll have 20min talks with 10min for Q&A, so 30min total. G On Tue, 21 Aug 2018 at 11:45, javier ramirez < javier.ramirez.gom...@gmail.com> wrote: > Hi, > > What'd be the duration of the talks? So I can scope the contents of my > proposal. > > Looking forward to the summit! > > J

Re: Controlling Kafka Checkpoint Persistence

2018-08-21 Thread Raghu Angadi
On Tue, Aug 21, 2018 at 2:49 PM Micah Whitacre wrote: > > Is there a reason you can't trust the runner to be durable storage for > inprocess work? > > That's a fair question. Are there any good resources documenting the > durability/stability of the different runners? I assume there are some >

Re: Controlling Kafka Checkpoint Persistence

2018-08-21 Thread Micah Whitacre
Thanks for the response, I was talking about using, "commitOffsetsInFinalize". > It is invoked after the first stage ('Simple Transformation' in your case). Does the Group By Key step cause it? > Is there a reason you can't trust the runner to be durable storage for inprocess work? That's a fa

Re: Controlling Kafka Checkpoint Persistence

2018-08-21 Thread Raghu Angadi
On Tue, Aug 21, 2018 at 2:04 PM Lukasz Cwik wrote: > Is there a reason you can't trust the runner to be durable storage for > inprocess work? > > I can understand that the DirectRunner only stores things in memory but > other runners have stronger durability guarantees. > I think the requirement

Re: Controlling Kafka Checkpoint Persistence

2018-08-21 Thread Lukasz Cwik
Is there a reason you can't trust the runner to be durable storage for inprocess work? I can understand that the DirectRunner only stores things in memory but other runners have stronger durability guarantees. On Tue, Aug 21, 2018 at 9:58 AM Raghu Angadi wrote: > I think by 'KafkaUnboundedSourc

Re: Beam Summit London 2018

2018-08-21 Thread javier ramirez
Hi, What'd be the duration of the talks? So I can scope the contents of my proposal. Looking forward to the summit! J On Tue, 21 Aug 2018, 14:47 Pascal Gula, wrote: > Hi Matthias, > we (Peat / Plantix) might be interested by submitting a talk and I would > like to know if we can get access to

Re: Controlling Kafka Checkpoint Persistence

2018-08-21 Thread Raghu Angadi
I think by 'KafkaUnboundedSource checkpointing' you mean enabling 'commitOffsetsInFinalize()' on KafkaIO source. It is better option than enable.auto.commit, but does not exactly do what you want in this moment. It is invoked after the first stage ('Simple Transformation' in your case). This is cer

Re: preserve order of records while writing in a file

2018-08-21 Thread Lukasz Cwik
I would agree with Eugene. A simple application that does this is probably what your looking for. There are ways to make this work with parallel processing systems but its quite a hassle and only worthwhile if your computation is very expensive and want the additional computational power of multip

Controlling Kafka Checkpoint Persistence

2018-08-21 Thread Micah Whitacre
I'm starting with a very simple pipeline that will read from Kafka -> Simple Transformation -> GroupByKey -> Persist the data. We are also applying some simple windowing/triggering that will persist the data after every 100 elements or every 60 seconds to balance slow trickles of data as well as n

Re: preserve order of records while writing in a file

2018-08-21 Thread Eugene Kirpichov
It sounds like you want to sequentially read a file, sequentially process the records and sequentially write them. The best way to do this is likely without using Beam, just write some Java or Python code using standard file APIs (use Beam's FileSystem APIs if you need to access data on a non-local

preserve order of records while writing in a file

2018-08-21 Thread asharma . gd
Hi I have to process a big file and call several Pardo's to do some transformations. Records in file dont have any unique key. Lets say file 'testfile' has 1 million records. After processing , I want to generate only one output file same as my input 'testfile' and also i have a requirement

Re: Beam Summit London 2018

2018-08-21 Thread Pascal Gula
Hi Matthias, we (Peat / Plantix) might be interested by submitting a talk and I would like to know if we can get access to the list of already submitted "Title" to avoid submitting on similar topic! Cheers, Pascal On Tue, Aug 21, 2018 at 1:59 PM, Matthias Baetens wrote: > Hi everyone, > > We are

Beam Summit London 2018

2018-08-21 Thread Matthias Baetens
Hi everyone, We are happy to invite you to the first Beam Summit in London. The summit will be held in London at Level39 on *October 1 and 2.* You can register to attend for free on the Eventbrite page