It might even be materialized (to disk) if both derived data sets are
joined.

2015-10-22 12:01 GMT+02:00 Till Rohrmann <trohrm...@apache.org>:

> I fear that the filter operations are not chained because there are at
> least two of them which have the same DataSet as input. However, it's true
> that the intermediate results are not materialized.
>
> It is also correct that the filter operators are deployed colocated to the
> data sources. Thus, there is no network traffic. However, the data will
> still be serialized/deserialized between the not-chained operators (also if
> they reside on the same machine).
>
>
>
> On Thu, Oct 22, 2015 at 11:49 AM, Gábor Gévay <gga...@gmail.com> wrote:
>
>> Hello!
>>
>> > I have thought about a workaround where the InputFormat would return
>> > Tuple2s and the first field is the name of the dataset to which a record
>> > belongs. This would however require me to filter the read data once for
>> > each dataset or to do a groupReduce which is some overhead i'm
>> > looking to prevent.
>>
>> I think that those two filters might not have that much overhead,
>> because of several optimizations Flink does under the hood:
>> - The dataset of Tuple2s won't be materialized, but instead will be
>> streamed directly to the two filter operators.
>> - The input format and the two filters will probably end up on the
>> same machine, because of chaining, so there won't be
>> serialization/deserialization between them.
>>
>> Best,
>> Gabor
>>
>>
>>
>> 2015-10-22 11:38 GMT+02:00 Pieter Hameete <phame...@gmail.com>:
>> > Good morning!
>> >
>> > I have the following usecase:
>> >
>> > My program reads nested data (in this specific case XML) based on
>> > projections (path expressions) of this data. Often multiple paths are
>> > projected onto the same input. I would like each path to result in its
>> own
>> > dataset.
>> >
>> > Is it possible to generate more than 1 dataset using a readFile
>> operation to
>> > prevent reading the input twice?
>> >
>> > I have thought about a workaround where the InputFormat would return
>> Tuple2s
>> > and the first field is the name of the dataset to which a record
>> belongs.
>> > This would however require me to filter the read data once for each
>> dataset
>> > or to do a groupReduce which is some overhead i'm looking to prevent.
>> >
>> > Is there a better (less overhead) workaround for doing this? Or is there
>> > some mechanism in Flink that would allow me to do this?
>> >
>> > Cheers!
>> >
>> > - Pieter
>>
>
>

Reply via email to