[ 
https://issues.apache.org/jira/browse/ARROW-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14635:
-----------------------------------
    Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Devise a mechanism to limit the total "system ram" (process + 
> cache) used by dataset writes
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14635
>                 URL: https://issues.apache.org/jira/browse/ARROW-14635
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>              Labels: dataset, pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The dataset writer now correctly applies backpressure.  However, that 
> backpressure is only applied when the write calls slow down.  This only 
> happens when the OS disk cache fills up.
> However, filling up the OS disk cache is undesirable.  It will cause all 
> running processes to get swapped (assuming the system has any swap 
> configured) and will make the system unusable for anything else.
> This typically has no actual benefit to the dataset write.  The marginal 
> performance boost provided by the extra RAM is often not worth the cost.
> One way to do this would be to use direct I/O (although that comes with a 
> plethora of warnings).  Another way might be to flag the output was WONTNEED 
> but I don't know for sure if this works (the OS might still cache it so that 
> it can satisfy the write call quickly).  Another way might be to somehow 
> track how much disk cache is being used for writes but that would get 
> complex.  I'm sure there are other ways I'm just not aware of yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to