Re: Reading multiple datasets with one read operation

2015-10-22 Thread Fabian Hueske
In principle, a data set the branches needs only to be materialized if both branches are pipelined until they are merged (i.e., in a hybrid-hash join). Otherwise, the data flow might deadlock due to pipelining. If you group both data sets before they are joined, the pipeline is broken due to the b

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Pieter Hameete
Thanks for your responses! The derived datasets would indeed be grouped after the filter operations. Why would this cause them to be materialized to disk? And if I understand correctly the the data source will not chain to more than one filter, causing (de)serialization to transfer the records fro

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Fabian Hueske
It might even be materialized (to disk) if both derived data sets are joined. 2015-10-22 12:01 GMT+02:00 Till Rohrmann : > I fear that the filter operations are not chained because there are at > least two of them which have the same DataSet as input. However, it's true > that the intermediate re

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Till Rohrmann
I fear that the filter operations are not chained because there are at least two of them which have the same DataSet as input. However, it's true that the intermediate results are not materialized. It is also correct that the filter operators are deployed colocated to the data sources. Thus, there

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Gábor Gévay
Hello! > I have thought about a workaround where the InputFormat would return > Tuple2s and the first field is the name of the dataset to which a record > belongs. This would however require me to filter the read data once for > each dataset or to do a groupReduce which is some overhead i'm > look

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Till Rohrmann
Hi Pieter, at the moment there is no support to partition a `DataSet` into multiple sub sets with one pass over it. If you really want to have distinct data sets for each path, then you have to filter, afaik. Cheers, Till On Thu, Oct 22, 2015 at 11:38 AM, Pieter Hameete wrote: > Good morning!