[ https://issues.apache.org/jira/browse/ARROW-16389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648574#comment-17648574 ]
Zane Wilbert Keeler commented on ARROW-16389: --------------------------------------------- dr jim changing server hdwe update. z > [C++] Support hash-join on larger than memory datasets > ------------------------------------------------------ > > Key: ARROW-16389 > URL: https://issues.apache.org/jira/browse/ARROW-16389 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Priority: Major > Labels: pull-request-available > Time Spent: 10.5h > Remaining Estimate: 0h > > The current implementation of the hash-join node current queues in memory the > hashtable, the entire build side input, and the entire probe side input (e.g. > the entire dataset). This means the current implementation will run out of > memory and crash if the input dataset is larger than the memory on the system. > By spilling to disk when memory starts to fill up we can allow the hash-join > node to process datasets larger than the available memory on the machine. -- This message was sent by Atlassian Jira (v8.20.10#820010)