It might even be materialized (to disk) if both derived data sets are joined.
2015-10-22 12:01 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: > I fear that the filter operations are not chained because there are at > least two of them which have the same DataSet as input. However, it's true > that the intermediate results are not materialized. > > It is also correct that the filter operators are deployed colocated to the > data sources. Thus, there is no network traffic. However, the data will > still be serialized/deserialized between the not-chained operators (also if > they reside on the same machine). > > > > On Thu, Oct 22, 2015 at 11:49 AM, Gábor Gévay <gga...@gmail.com> wrote: > >> Hello! >> >> > I have thought about a workaround where the InputFormat would return >> > Tuple2s and the first field is the name of the dataset to which a record >> > belongs. This would however require me to filter the read data once for >> > each dataset or to do a groupReduce which is some overhead i'm >> > looking to prevent. >> >> I think that those two filters might not have that much overhead, >> because of several optimizations Flink does under the hood: >> - The dataset of Tuple2s won't be materialized, but instead will be >> streamed directly to the two filter operators. >> - The input format and the two filters will probably end up on the >> same machine, because of chaining, so there won't be >> serialization/deserialization between them. >> >> Best, >> Gabor >> >> >> >> 2015-10-22 11:38 GMT+02:00 Pieter Hameete <phame...@gmail.com>: >> > Good morning! >> > >> > I have the following usecase: >> > >> > My program reads nested data (in this specific case XML) based on >> > projections (path expressions) of this data. Often multiple paths are >> > projected onto the same input. I would like each path to result in its >> own >> > dataset. >> > >> > Is it possible to generate more than 1 dataset using a readFile >> operation to >> > prevent reading the input twice? >> > >> > I have thought about a workaround where the InputFormat would return >> Tuple2s >> > and the first field is the name of the dataset to which a record >> belongs. >> > This would however require me to filter the read data once for each >> dataset >> > or to do a groupReduce which is some overhead i'm looking to prevent. >> > >> > Is there a better (less overhead) workaround for doing this? Or is there >> > some mechanism in Flink that would allow me to do this? >> > >> > Cheers! >> > >> > - Pieter >> > >