Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-31 Thread Wenbo Hu
Hi, Glad to know the problem has been identified. But the workaround is not very suitable for my situation, from the data source has no idea about whether the queue is filling up. We were expecting to make flow control based on used memory. But we cannot get detailed memory used by each write t

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-31 Thread Weston Pace
Thanks. This is a very helpful reproduction. I was able to reproduce and diagnose the problem. There is a bug on our end and I have filed [1] to address it. There are a lot more details in the ticket if you are interested. In the meantime, the only workaround I can think of is probably to slow

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-30 Thread Wenbo Hu
Hi, The following code should reproduce the problem. ``` import pyarrow as pa import pyarrow.fs, pyarrow.dataset schema = pa.schema([("id", pa.utf8()), ("bucket", pa.uint8())]) def rb_generator(buckets, rows, batches): batch = pa.record_batch( [[f"id-{i}" for i in range(rows)],

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-30 Thread Wenbo Hu
Hi, Then another question is that "why back pressure not working on the input stream of write_dataset api?". If back pressure happens on the end of the acero stream for some reason (on queue stage or write stage), then the input stream should backpressure as well? It should keep the memory to a

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-28 Thread Weston Pace
> How many io threads are writing concurrently in a single write_dataset > call? With the default options, and no partitioning, it will only use 1 I/O thread. This is because we do not write to a single file in parallel. If you change FileSystemDatasetWriteOptions::max_rows_per_file then you may

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-28 Thread Wenbo Hu
Hi, Thanks for your detailed explanation, I made some experiment today. Before experiment, 1. To limit the resources used by the server, I use docker, which uses cgroups. But "free" does not respect the resource limit inside the container. 2. I measured the write speed on the host by "dd if=/de

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-27 Thread Weston Pace
You'll need to measure more but generally the bottleneck for writes is usually going to be the disk itself. Unfortunately, standard OS buffered I/O has some pretty negative behaviors in this case. First I'll describe what I generally see happen (the last time I profiled this was a while back but

dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-27 Thread Wenbo Hu
Hi, I'm using flight to receive streams from client and write to the storage with python `pa.dataset.write_dataset` API. The whole data is 1 Billion rows, over 40GB with one partition column ranges from 0~63. The container runs at 8-cores CPU and 4GB ram resources. It can be done about 160s