Re: Best way to compute the difference between 2 datasets

2019-09-20 Thread Fabian Hueske
Btw. there is a set difference or minus operator in the Table API [1] that might be helpful. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/tableApi.html#set-operations Am Fr., 20. Sept. 2019 um 15:30 Uhr schrieb Fabian Hueske : > Hi Juan, > > Both, the local execution

Re: Best way to compute the difference between 2 datasets

2019-09-20 Thread Fabian Hueske
Hi Juan, Both, the local execution environment and the remote execution environment run the same code to execute the program. The implementation of the sortPartition operator was designed to scale to data sizes that exceed the memory. Internally, it serializes all records into byte arrays and sort

Re: Best way to compute the difference between 2 datasets

2019-09-16 Thread Juan Rodríguez Hortalá
Hi Ken, Thanks for the suggestion, that idea should also work for implementing a data set difference operation, which is what concerns me here. However, I was also curious about why there is so much performance difference between using sortPartition and sorting in memory by partition, for datasets

Re: Best way to compute the difference between 2 datasets

2019-07-21 Thread Ken Krugler
Hi Juan, If you want to deduplicate, then you could group by the record, and use a (very simple) reduce function to only emit a record if the group contains one element. There will be performance issues, though - Flink will have to generate all groups first, which typically means spilling to di

Best way to compute the difference between 2 datasets

2019-07-21 Thread Juan Rodríguez Hortalá
Hi, I've been trying to write a function to compute the difference between 2 datasets. With that I mean computing a dataset that has all the elements of a dataset that are not present in another dataset. I first tried using coCogroup, but it was very slow in a local execution environment, and ofte