It is indeed well documented that numShards is required for unbounded
input. And I do believe that a helpful error is thrown in the case of
unbounded input and runner-determined sharding.
I do believe there's still a bug here; it's just wandered quite a bit from
the original title of the thread. T
Actually, this is a documented known issue.
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L152
On Fri, Jan 11, 2019 at 9:23 AM Jeff Klukas wrote:
> Indeed, I was wrong about the ValueProvider distinction. I updated that in
> the JIRA.
Indeed, I was wrong about the ValueProvider distinction. I updated that in
the JIRA.
It's when numShards is 0 (so runner-provided sharding) vs. an explicit
number. Things work fine for explicit sharding. It's the runner-provided
sharding mode that encounters the Flatten of PCollections with confli
A runner is free to process things in streaming mode, batch mode, or
even alternate between the two. Generally there are certain
efficiencies/simplifications that only work (well) in batch mode, and
on the other hand the presence of an unbounded source means one cannot
wait for a PCollection to be
A question for the runner implementers:
The Beam model is stream vs batch agnostic. But I have use cases where we
replay history (from BigTable or BigQuery) but then transition into
streaming.
Now with Splittable DoFn's it's easier to create inputs that start batch,
then go streaming. But I have