Use bloom filter to improve hybrid hash join performance

Li, Chengxiang Thu, 18 Jun 2015 02:38:00 -0700

Hi, flink developers

I read the flink hybrid hash join documents and implementation, very nice job. 
For the case of small table does not all fit into memory, I think we may able 
to improve the performance better.  Currently in hybrid hash join, while small 
table does not fit into memory, part of the small table data would be spilled 
to disk, and the counterpart partition of big table data would be spilled to 
disk in probe phase as well. You can find detail description here: 
http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
 . If we build a bloom filter while spill small table to disk during build 
phase, and use it to filter the big table records which tend to be spilled to 
disk, this may greatly reduce the spilled big table file size, and saved the 
disk IO cost for writing and further reading. I have created 
FLINK-2240<https://issues.apache.org/jira/browse/FLINK-2240> about it,  I would 
like to contribute on this optimization if someone can assign the JIRA to me. 
But before that, I would like to hear your comments about this.


Thanks
Chengxiang

Use bloom filter to improve hybrid hash join performance

Reply via email to