Hello! > I have thought about a workaround where the InputFormat would return > Tuple2s and the first field is the name of the dataset to which a record > belongs. This would however require me to filter the read data once for > each dataset or to do a groupReduce which is some overhead i'm > looking to prevent.
I think that those two filters might not have that much overhead, because of several optimizations Flink does under the hood: - The dataset of Tuple2s won't be materialized, but instead will be streamed directly to the two filter operators. - The input format and the two filters will probably end up on the same machine, because of chaining, so there won't be serialization/deserialization between them. Best, Gabor 2015-10-22 11:38 GMT+02:00 Pieter Hameete <[email protected]>: > Good morning! > > I have the following usecase: > > My program reads nested data (in this specific case XML) based on > projections (path expressions) of this data. Often multiple paths are > projected onto the same input. I would like each path to result in its own > dataset. > > Is it possible to generate more than 1 dataset using a readFile operation to > prevent reading the input twice? > > I have thought about a workaround where the InputFormat would return Tuple2s > and the first field is the name of the dataset to which a record belongs. > This would however require me to filter the read data once for each dataset > or to do a groupReduce which is some overhead i'm looking to prevent. > > Is there a better (less overhead) workaround for doing this? Or is there > some mechanism in Flink that would allow me to do this? > > Cheers! > > - Pieter
