Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

2024-06-13 Thread J N
I added these configs as you guys mentioned in the c++ code > > parquet::ArrowReaderProperties arrow_reader_properties = > parquet::default_arrow_reader_properties(); > arrow_reader_properties.set_pre_buffer(true); > arrow_reader_properties.set_use_threads(true); > parquet::ReaderProperties read

Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

2024-06-13 Thread J N
> > I suspect the reason for the difference is that pyarrow uses the datasets > API internally (I'm pretty sure) even for single file reads now (this > allows us to have consistent behavior). This is also an asynchronous path > internally. I see thanks for explaining, does this mean that the c++

Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

2024-06-13 Thread J N
I will do an audit of the configs. Is there a way to get a dump of all? Also any particular candidate configs to look for? On Thu, Jun 13, 2024 at 10:37 AM wish maple wrote: > Some configs, like use_thread would be true in Python but false in C++ > > Maybe we call fill all configs explicitly wit

Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

2024-06-13 Thread wish maple
Some configs, like use_thread would be true in Python but false in C++ Maybe we call fill all configs explicitly with same values Best, Xuwei Fu J N 于2024年6月13日周四 13:32写道: > Hello, > We all know that there inherent overhead in Python, and we wanted to > compare the performance of reading d

Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

2024-06-13 Thread Weston Pace
pyarrow uses c++ code internally. With the large files I would guess that less than 0.1% of your pyarrow benchmark is spent in the python interpreter. Given this fact, my main advice is to not worry too much about the difference between pyarrow and carrow. A lot of work goes into pyarrow to make

[C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

2024-06-12 Thread J N
Hello, We all know that there inherent overhead in Python, and we wanted to compare the performance of reading data using C++ Arrow against PyArrow for high throughput systems. Since I couldn't find any benchmarks online for this comparison, I decided to create my own. These programs read a Par