Hi,
Glad to know the problem has been identified. But the workaround
is not very suitable for my situation, from the data source has no
idea about whether the queue is filling up. We were expecting to make
flow control based on used memory. But we cannot get detailed memory
used by each write t
Thanks. This is a very helpful reproduction.
I was able to reproduce and diagnose the problem. There is a bug on our
end and I have filed [1] to address it. There are a lot more details in
the ticket if you are interested. In the meantime, the only workaround I
can think of is probably to slow
Hi,
The following code should reproduce the problem.
```
import pyarrow as pa
import pyarrow.fs, pyarrow.dataset
schema = pa.schema([("id", pa.utf8()), ("bucket", pa.uint8())])
def rb_generator(buckets, rows, batches):
batch = pa.record_batch(
[[f"id-{i}" for i in range(rows)],
Hi,
Then another question is that "why back pressure not working on the
input stream of write_dataset api?". If back pressure happens on the
end of the acero stream for some reason (on queue stage or write
stage), then the input stream should backpressure as well? It should
keep the memory to a
> How many io threads are writing concurrently in a single write_dataset
> call?
With the default options, and no partitioning, it will only use 1 I/O
thread. This is because we do not write to a single file in parallel.
If you change FileSystemDatasetWriteOptions::max_rows_per_file then you may
Hi,
Thanks for your detailed explanation, I made some experiment today.
Before experiment,
1. To limit the resources used by the server, I use docker, which uses
cgroups. But "free" does not respect the resource limit inside the
container.
2. I measured the write speed on the host by "dd if=/de
You'll need to measure more but generally the bottleneck for writes is
usually going to be the disk itself. Unfortunately, standard OS buffered
I/O has some pretty negative behaviors in this case. First I'll describe
what I generally see happen (the last time I profiled this was a while back
but
Hi,
I'm using flight to receive streams from client and write to the
storage with python `pa.dataset.write_dataset` API. The whole data is
1 Billion rows, over 40GB with one partition column ranges from 0~63.
The container runs at 8-cores CPU and 4GB ram resources.
It can be done about 160s