Re: Bundling in ParDos

2018-05-22 Thread Eugene Kirpichov
Different runners decide it differently. E.g. for the Dataflow runner: in batch mode, bundles are usually quite large, e.g. something like several-dozen-MB chunks of files, or pretty big key ranges of something like BigTable or GroupByKey output. The bundle sizes are not known in advance (e.g. whe

Re: I'm back and ready to help grow our community!

2018-05-22 Thread Matthias Baetens
Same here - shame on me. Congratulations on the graduation Gris, very happy to have you back! On Tue, 22 May 2018 at 09:19 Ismaël Mejía wrote: > I missed somehow this email thread. > Congratulations Gris and welcome back! > > On Fri, May 18, 2018 at 5:34 AM Jesse Anderson > wrote: > > > Congrat

Re: I'm back and ready to help grow our community!

2018-05-22 Thread Ismaël Mejía
I missed somehow this email thread. Congratulations Gris and welcome back! On Fri, May 18, 2018 at 5:34 AM Jesse Anderson wrote: > Congrats! > On Thu, May 17, 2018, 6:44 PM Robert Burke wrote: >> Congrats & welcome back! >> On Thu, May 17, 2018, 5:44 PM Huygaa Batsaikhan wrote: >>> Welcome

Re: Bundling in ParDos

2018-05-22 Thread Abdul Qadeer
Hi Eugene! I had gone through that link before sending an email here. It does a decent job explaining when to use which method and what kind of optimisations we are looking at, but didn’t really answer the question I had i.e. the controlling granularity of elements of PCollection in a bundle. K

Re: Bundling in ParDos

2018-05-22 Thread Abdul Qadeer
Thanks for the insight Kenneth. It would surprise me if the the decision made by runner about latency vs amortized cost is non deterministic. Are there any benchmarking results with respect to bundling kicking in somewhere? > On May 21, 2018, at 8:52 PM, Kenneth Knowles wrote: > > Hi Abdul, >