Hello Korry, the C(glib)-API calls the C++ functions in the background, so this only another layer on top. The parquet::arrow C++ API is built in a way that it does not use C++ exceptions. Instead if there is a failure, we will return arrow::Status objects indicating this.
Uwe On Fri, Nov 16, 2018, at 3:27 PM, Korry Douglas wrote: > Thanks Kouhei and Wes for the fast response, much appreciated. > > C++ is a bit troublesome for me because of the difference between > PostgreSQL exception handling (setjmp/longjmp) and C++ exception > handling (throw/catch) - I’m worried that destructors might not get > invoked properly when cleaning up errors in Postgres. > > I’ve found very few examples on the web that demonstrate how to use the > Parquet C or C++ API’s. Are you aware of any projects that I might look > into to understand how to use the APIs? Any blogs that might be > helpful? > > > > — Korry > > > > On Nov 16, 2018, at 8:41 AM, Wes McKinney <[email protected]> wrote: > > > > That will work, but the size of a single row group could be very large > > > > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L176 > > > > This function also appears to have a bug in it. If any column is a > > ChunkedArray after calling ReadRowGroup, then the call to > > TableBatchReader::ReadNext will return only part of the row group > > > > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L200 > > > > I opened https://issues.apache.org/jira/browse/ARROW-3822 > > On Thu, Nov 15, 2018 at 11:23 PM Kouhei Sutou <[email protected]> wrote: > >> > >> Hi, > >> > >> I think that we can use > >> parquet::arrow::FileReader::GetRecordBatchReader() > >> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L175 > >> for this propose. > >> > >> It doesn't read the specified number of rows but it'll read > >> only rows in each row group. > >> (Do I misunderstand?) > >> > >> > >> Thanks, > >> -- > >> kou > >> > >> In <CAJPUwMBY_KHF84T4KAXPUtVP0AVYiKv05erNA_N=cfjyh8k...@mail.gmail.com> > >> "Re: Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 22:41:13 -0500, > >> Wes McKinney <[email protected]> wrote: > >> > >>> garrow_record_batch_stream_reader_new() is for reading files that use > >>> the stream IPC protocol described in > >>> https://github.com/apache/arrow/blob/master/format/IPC.md, not for > >>> Parquet files > >>> > >>> We don't have a streaming reader implemented yet for Parquet files. > >>> The relevant JIRA (a bit thin on detail) is > >>> https://issues.apache.org/jira/browse/ARROW-1012. To be clear, I mean > >>> to implement this interface, with the option to read some number of > >>> "rows" at a time: > >>> > >>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166 > >>> On Thu, Nov 15, 2018 at 10:33 PM Kouhei Sutou <[email protected]> wrote: > >>>> > >>>> Hi, > >>>> > >>>> We didn't implement record batch reader feature for Parquet > >>>> in C API yet. It's easy to implement. So we can provide the > >>>> feature in the next release. Can you open a JIRA issue for > >>>> this feature? You can find "Create" button at > >>>> https://issues.apache.org/jira/projects/ARROW/issues/ > >>>> > >>>> If you can use C++ API, you can use the feature with the > >>>> current release. > >>>> > >>>> > >>>> Thanks, > >>>> -- > >>>> kou > >>>> > >>>> In <[email protected]> > >>>> "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500, > >>>> Korry Douglas <[email protected]> wrote: > >>>> > >>>>> Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) > >>>>> that will let PostgreSQL read Parquet-format files. > >>>>> > >>>>> I have just a few questions for now: > >>>>> > >>>>> 1) I have created a few sample Parquet data files using AWS Glue. Glue > >>>>> split my CSV input into many (48) smaller xxx.snappy.parquet files, > >>>>> each about 30MB. When I open one of these files using > >>>>> gparquet_arrow_file_reader_new_path(), I can then call > >>>>> gparquet_arrow_file_reader_read_table() (and then access the content of > >>>>> the table). However, …_read_table() seems to read the entire file into > >>>>> memory all at once (I say that based on the amount of time it takes for > >>>>> gparquet_arrow_file_reader_read_table() to return). That’s not the > >>>>> behavior I need. > >>>>> > >>>>> I have tried to use garrow_memory_mappend_input_stream_new() to open > >>>>> the file, followed by garrow_record_batch_stream_reader_new(). The > >>>>> call to garrow_record_batch_stream_reader_new() fails with the message: > >>>>> > >>>>> [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 > >>>>> metadata bytes, but only read 30284162 > >>>>> > >>>>> Does this error occur because Glue split the input data? Or because > >>>>> Glue compressed the data using snappy? Do I need to uncompress before > >>>>> I can read/open the file? Do I need to merge the files before I can > >>>>> open/read the data? > >>>>> > >>>>> 2) If I use garrow_record_batch_stream_reader_new() instead of > >>>>> gparquet_arrow_file_reader_new_path(), will I avoid the overhead of > >>>>> reading the entire into memory before I fetch the first row? > >>>>> > >>>>> > >>>>> Thanks in advance for help and any advice. > >>>>> > >>>>> > >>>>> ― Korry >
