The spilling will only happen when joining the branched data sets.
If you keep them separate and eventually emit them, no intermediate data
will be spilled.
2018-05-04 18:05 GMT+02:00 Flavio Pompermaier :
> Does this duplication happen when I write directly to disk after the
> flatMaps?
>
> On Fr
Does this duplication happen when I write directly to disk after the
flatMaps?
On Fri, May 4, 2018 at 6:02 PM, Fabian Hueske wrote:
> That will happen if you join (or coGroup) the branched DataSets, i.e., you
> have branching and merging pattern in your stream.
>
> The problem in that case is th
That will happen if you join (or coGroup) the branched DataSets, i.e., you
have branching and merging pattern in your stream.
The problem in that case is that one of the inputs is pipelined (e.g., the
probe side of a hash join) and the other one is blocking.
In order to execute such a plan, we mus
Hi Fabian,
thanks for the detailed reply.
The problem I see is that the source dataset is huge and, since it doesn't
fit in memory, it's spilled twice to disk (I checked the increasing disk
usage during the job and it was corresponding exactly to the size estimated
by the Flink UI, that is twice it
Hi Flavio,
No, there's no way around it.
DataSets that are processed by more than one operator cannot be processed
by chained operators.
The records need to be copied to avoid concurrent modifications. However,
the data should not be shipped over the network if all operators have the
same parallel
Flink 1.3.1 (I'm waiting 1.5 before upgrading..)
On Fri, May 4, 2018 at 2:50 PM, Amit Jain wrote:
> Hi Flavio,
>
> Which version of Flink are you using?
>
> --
> Thanks,
> Amit
>
> On Fri, May 4, 2018 at 6:14 PM, Flavio Pompermaier
> wrote:
> > Hi all,
> > I've a Flink batch job that reads a pa
Hi Flavio,
Which version of Flink are you using?
--
Thanks,
Amit
On Fri, May 4, 2018 at 6:14 PM, Flavio Pompermaier wrote:
> Hi all,
> I've a Flink batch job that reads a parquet dataset and then applies 2
> flatMap to it (see pseudocode below).
> The problem is that this dataset is quite big a
Hi all,
I've a Flink batch job that reads a parquet dataset and then applies 2
flatMap to it (see pseudocode below).
The problem is that this dataset is quite big and Flink duplicates it before
sending the data to these 2 operators (I've guessed this from the doubling
amount of sent bytes) .
Is the