Hi Abdul,
Going back to your use case, if the use case is to do batching of the elements
on a unbounded source, then you can use
GroupIntoBatches transform that groups elements in batches (Iterables) of the
size you specify. You can then process
the batch downstream in your pipeline.
PS: to add
Different runners decide it differently.
E.g. for the Dataflow runner: in batch mode, bundles are usually quite
large, e.g. something like several-dozen-MB chunks of files, or pretty big
key ranges of something like BigTable or GroupByKey output. The bundle
sizes are not known in advance (e.g. whe
Hi Eugene!
I had gone through that link before sending an email here. It does a decent job
explaining when to use which method and what kind of optimisations we are
looking at, but didn’t really answer the question I had i.e. the controlling
granularity of elements of PCollection in a bundle. K
Thanks for the insight Kenneth. It would surprise me if the the decision made
by runner about latency vs amortized cost is non deterministic. Are there any
benchmarking results with respect to bundling kicking in somewhere?
> On May 21, 2018, at 8:52 PM, Kenneth Knowles wrote:
>
> Hi Abdul,
>
Hi Abdul,
Please see
https://stackoverflow.com/questions/45985753/what-is-the-difference-between-dofn-setup-and-dofn-startbundle
-
let me know if it answers your question sufficiently.
On Mon, May 21, 2018 at 7:04 PM Abdul Qadeer wrote:
> Hi!
>
> I was trying to understand the behavior of StartB
Hi Abdul,
The bundle is chosen by the runner in order to best balance low latency
with amortized cost of FinishBundle and committing transactions, so you
cannot generally control it, by design. If you have a very large amount of
data coming through, or are running a batch job, then the runner will
Hi!
I was trying to understand the behavior of StartBundle and FinishBundle
w.r.t. DoFns.
I have an unbounded data source and I am trying to leverage bundling to
achieve batching.
>From the docs of ParDo:
"when a ParDo transform is executed, the elements of the input PCollection
are first divided