subject:"Best way to compute the difference between 2 datasets"

Re: Best way to compute the difference between 2 datasets

2019-09-20 Thread Fabian Hueske

Btw. there is a set difference or minus operator in the Table API [1] that might be helpful. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/tableApi.html#set-operations Am Fr., 20. Sept. 2019 um 15:30 Uhr schrieb Fabian Hueske : > Hi Juan, > > Both, the local execution

Re: Best way to compute the difference between 2 datasets

2019-09-20 Thread Fabian Hueske

Hi Juan, Both, the local execution environment and the remote execution environment run the same code to execute the program. The implementation of the sortPartition operator was designed to scale to data sizes that exceed the memory. Internally, it serializes all records into byte arrays and sort

Re: Best way to compute the difference between 2 datasets

2019-09-16 Thread Juan Rodríguez Hortalá

Hi Ken, Thanks for the suggestion, that idea should also work for implementing a data set difference operation, which is what concerns me here. However, I was also curious about why there is so much performance difference between using sortPartition and sorting in memory by partition, for datasets

Re: Best way to compute the difference between 2 datasets

2019-07-21 Thread Ken Krugler

Hi Juan, If you want to deduplicate, then you could group by the record, and use a (very simple) reduce function to only emit a record if the group contains one element. There will be performance issues, though - Flink will have to generate all groups first, which typically means spilling to di

Best way to compute the difference between 2 datasets

2019-07-21 Thread Juan Rodríguez Hortalá

Hi, I've been trying to write a function to compute the difference between 2 datasets. With that I mean computing a dataset that has all the elements of a dataset that are not present in another dataset. I first tried using coCogroup, but it was very slow in a local execution environment, and ofte

Re: Best way to compute the difference between 2 datasets

Re: Best way to compute the difference between 2 datasets

Re: Best way to compute the difference between 2 datasets

Re: Best way to compute the difference between 2 datasets

Best way to compute the difference between 2 datasets

5 matches

Site Navigation

Mail list logo

Footer information