Hi, flink developers I read the flink hybrid hash join documents and implementation, very nice job. For the case of small table does not all fit into memory, I think we may able to improve the performance better. Currently in hybrid hash join, while small table does not fit into memory, part of the small table data would be spilled to disk, and the counterpart partition of big table data would be spilled to disk in probe phase as well. You can find detail description here: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html . If we build a bloom filter while spill small table to disk during build phase, and use it to filter the big table records which tend to be spilled to disk, this may greatly reduce the spilled big table file size, and saved the disk IO cost for writing and further reading. I have created FLINK-2240<https://issues.apache.org/jira/browse/FLINK-2240> about it, I would like to contribute on this optimization if someone can assign the JIRA to me. But before that, I would like to hear your comments about this.
Thanks Chengxiang