I added these configs as you guys mentioned in the c++ code
>
> parquet::ArrowReaderProperties arrow_reader_properties =
> parquet::default_arrow_reader_properties();
> arrow_reader_properties.set_pre_buffer(true);
> arrow_reader_properties.set_use_threads(true);
> parquet::ReaderProperties read
>
> I suspect the reason for the difference is that pyarrow uses the datasets
> API internally (I'm pretty sure) even for single file reads now (this
> allows us to have consistent behavior). This is also an asynchronous path
> internally.
I see thanks for explaining, does this mean that the c++
I will do an audit of the configs. Is there a way to get a dump of all?
Also any particular candidate configs to look for?
On Thu, Jun 13, 2024 at 10:37 AM wish maple wrote:
> Some configs, like use_thread would be true in Python but false in C++
>
> Maybe we call fill all configs explicitly wit
Some configs, like use_thread would be true in Python but false in C++
Maybe we call fill all configs explicitly with same values
Best,
Xuwei Fu
J N 于2024年6月13日周四 13:32写道:
> Hello,
> We all know that there inherent overhead in Python, and we wanted to
> compare the performance of reading d
pyarrow uses c++ code internally. With the large files I would guess that
less than 0.1% of your pyarrow benchmark is spent in the python interpreter.
Given this fact, my main advice is to not worry too much about the
difference between pyarrow and carrow. A lot of work goes into pyarrow to
make
Hello,
We all know that there inherent overhead in Python, and we wanted to
compare the performance of reading data using C++ Arrow against PyArrow for
high throughput systems. Since I couldn't find any benchmarks online for
this comparison, I decided to create my own. These programs read a Par