Re: Bundling in ParDos

2018-05-23 Thread Etienne Chauchot
Hi Abdul, Going back to your use case, if the use case is to do batching of the elements on a unbounded source, then you can use GroupIntoBatches transform that groups elements in batches (Iterables) of the size you specify. You can then process the batch downstream in your pipeline. PS: to add

Re: Bundling in ParDos

2018-05-22 Thread Eugene Kirpichov
Different runners decide it differently. E.g. for the Dataflow runner: in batch mode, bundles are usually quite large, e.g. something like several-dozen-MB chunks of files, or pretty big key ranges of something like BigTable or GroupByKey output. The bundle sizes are not known in advance (e.g. whe

Re: Bundling in ParDos

2018-05-22 Thread Abdul Qadeer
Hi Eugene! I had gone through that link before sending an email here. It does a decent job explaining when to use which method and what kind of optimisations we are looking at, but didn’t really answer the question I had i.e. the controlling granularity of elements of PCollection in a bundle. K

Re: Bundling in ParDos

2018-05-22 Thread Abdul Qadeer
Thanks for the insight Kenneth. It would surprise me if the the decision made by runner about latency vs amortized cost is non deterministic. Are there any benchmarking results with respect to bundling kicking in somewhere? > On May 21, 2018, at 8:52 PM, Kenneth Knowles wrote: > > Hi Abdul, >

Re: Bundling in ParDos

2018-05-21 Thread Eugene Kirpichov
Hi Abdul, Please see https://stackoverflow.com/questions/45985753/what-is-the-difference-between-dofn-setup-and-dofn-startbundle - let me know if it answers your question sufficiently. On Mon, May 21, 2018 at 7:04 PM Abdul Qadeer wrote: > Hi! > > I was trying to understand the behavior of StartB

Re: Bundling in ParDos

2018-05-21 Thread Kenneth Knowles
Hi Abdul, The bundle is chosen by the runner in order to best balance low latency with amortized cost of FinishBundle and committing transactions, so you cannot generally control it, by design. If you have a very large amount of data coming through, or are running a batch job, then the runner will

Bundling in ParDos

2018-05-21 Thread Abdul Qadeer
Hi! I was trying to understand the behavior of StartBundle and FinishBundle w.r.t. DoFns. I have an unbounded data source and I am trying to leverage bundling to achieve batching. >From the docs of ParDo: "when a ParDo transform is executed, the elements of the input PCollection are first divided