[ https://issues.apache.org/jira/browse/ARROW-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-14635: ----------------------------------- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Devise a mechanism to limit the total "system ram" (process + > cache) used by dataset writes > ---------------------------------------------------------------------------------------------------------- > > Key: ARROW-14635 > URL: https://issues.apache.org/jira/browse/ARROW-14635 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Priority: Major > Labels: dataset, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The dataset writer now correctly applies backpressure. However, that > backpressure is only applied when the write calls slow down. This only > happens when the OS disk cache fills up. > However, filling up the OS disk cache is undesirable. It will cause all > running processes to get swapped (assuming the system has any swap > configured) and will make the system unusable for anything else. > This typically has no actual benefit to the dataset write. The marginal > performance boost provided by the extra RAM is often not worth the cost. > One way to do this would be to use direct I/O (although that comes with a > plethora of warnings). Another way might be to flag the output was WONTNEED > but I don't know for sure if this works (the OS might still cache it so that > it can satisfy the write call quickly). Another way might be to somehow > track how much disk cache is being used for writes but that would get > complex. I'm sure there are other ways I'm just not aware of yet. -- This message was sent by Atlassian Jira (v8.20.10#820010)