Btw. there is a set difference or minus operator in the Table API [1] that
might be helpful.
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/tableApi.html#set-operations
Am Fr., 20. Sept. 2019 um 15:30 Uhr schrieb Fabian Hueske :
> Hi Juan,
>
> Both, the local execution
Hi Juan,
Both, the local execution environment and the remote execution environment
run the same code to execute the program.
The implementation of the sortPartition operator was designed to scale to
data sizes that exceed the memory.
Internally, it serializes all records into byte arrays and sort
Hi Ken,
Thanks for the suggestion, that idea should also work for implementing a
data set difference operation, which is what concerns me here. However, I
was also curious about why there is so much performance difference between
using sortPartition and sorting in memory by partition, for datasets
Hi Juan,
If you want to deduplicate, then you could group by the record, and use a (very
simple) reduce function to only emit a record if the group contains one element.
There will be performance issues, though - Flink will have to generate all
groups first, which typically means spilling to di
Hi,
I've been trying to write a function to compute the difference between 2
datasets. With that I mean computing a dataset that has all the elements of
a dataset that are not present in another dataset. I first tried using
coCogroup, but it was very slow in a local execution environment, and ofte