That will work, but the size of a single row group could be very large https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L176
This function also appears to have a bug in it. If any column is a ChunkedArray after calling ReadRowGroup, then the call to TableBatchReader::ReadNext will return only part of the row group https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L200 I opened https://issues.apache.org/jira/browse/ARROW-3822 On Thu, Nov 15, 2018 at 11:23 PM Kouhei Sutou <[email protected]> wrote: > > Hi, > > I think that we can use > parquet::arrow::FileReader::GetRecordBatchReader() > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L175 > for this propose. > > It doesn't read the specified number of rows but it'll read > only rows in each row group. > (Do I misunderstand?) > > > Thanks, > -- > kou > > In <CAJPUwMBY_KHF84T4KAXPUtVP0AVYiKv05erNA_N=cfjyh8k...@mail.gmail.com> > "Re: Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 22:41:13 -0500, > Wes McKinney <[email protected]> wrote: > > > garrow_record_batch_stream_reader_new() is for reading files that use > > the stream IPC protocol described in > > https://github.com/apache/arrow/blob/master/format/IPC.md, not for > > Parquet files > > > > We don't have a streaming reader implemented yet for Parquet files. > > The relevant JIRA (a bit thin on detail) is > > https://issues.apache.org/jira/browse/ARROW-1012. To be clear, I mean > > to implement this interface, with the option to read some number of > > "rows" at a time: > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166 > > On Thu, Nov 15, 2018 at 10:33 PM Kouhei Sutou <[email protected]> wrote: > >> > >> Hi, > >> > >> We didn't implement record batch reader feature for Parquet > >> in C API yet. It's easy to implement. So we can provide the > >> feature in the next release. Can you open a JIRA issue for > >> this feature? You can find "Create" button at > >> https://issues.apache.org/jira/projects/ARROW/issues/ > >> > >> If you can use C++ API, you can use the feature with the > >> current release. > >> > >> > >> Thanks, > >> -- > >> kou > >> > >> In <[email protected]> > >> "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500, > >> Korry Douglas <[email protected]> wrote: > >> > >> > Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) > >> > that will let PostgreSQL read Parquet-format files. > >> > > >> > I have just a few questions for now: > >> > > >> > 1) I have created a few sample Parquet data files using AWS Glue. Glue > >> > split my CSV input into many (48) smaller xxx.snappy.parquet files, each > >> > about 30MB. When I open one of these files using > >> > gparquet_arrow_file_reader_new_path(), I can then call > >> > gparquet_arrow_file_reader_read_table() (and then access the content of > >> > the table). However, …_read_table() seems to read the entire file into > >> > memory all at once (I say that based on the amount of time it takes for > >> > gparquet_arrow_file_reader_read_table() to return). That’s not the > >> > behavior I need. > >> > > >> > I have tried to use garrow_memory_mappend_input_stream_new() to open the > >> > file, followed by garrow_record_batch_stream_reader_new(). The call to > >> > garrow_record_batch_stream_reader_new() fails with the message: > >> > > >> > [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 > >> > metadata bytes, but only read 30284162 > >> > > >> > Does this error occur because Glue split the input data? Or because > >> > Glue compressed the data using snappy? Do I need to uncompress before I > >> > can read/open the file? Do I need to merge the files before I can > >> > open/read the data? > >> > > >> > 2) If I use garrow_record_batch_stream_reader_new() instead of > >> > gparquet_arrow_file_reader_new_path(), will I avoid the overhead of > >> > reading the entire into memory before I fetch the first row? > >> > > >> > > >> > Thanks in advance for help and any advice. > >> > > >> > > >> > ― Korry
