Re: Joining Parquet & PostgreSQL

Uwe L. Korn Fri, 16 Nov 2018 07:27:29 -0800

Hello Korry,

the C(glib)-API calls the C++ functions in the background, so this only another 
layer on top. The parquet::arrow C++ API is built in a way that it does not use 
C++ exceptions. Instead if there is a failure, we will return arrow::Status 
objects indicating this.


Uwe

On Fri, Nov 16, 2018, at 3:27 PM, Korry Douglas wrote:
> Thanks Kouhei and Wes for the fast response, much appreciated.
> 
> C++ is a bit troublesome for me because of the difference between 
> PostgreSQL exception handling (setjmp/longjmp) and C++ exception 
> handling (throw/catch) - I’m worried that destructors might not get 
> invoked properly when cleaning up errors in Postgres.  
> 
> I’ve found very few examples on the web that demonstrate how to use the 
> Parquet C or C++ API’s.  Are you aware of any projects that I might look 
> into to understand how to use the APIs?  Any blogs that might be 
> helpful?
> 
> 
> 
>                    — Korry
> 
> 
> > On Nov 16, 2018, at 8:41 AM, Wes McKinney <[email protected]> wrote:
> > 
> > That will work, but the size of a single row group could be very large
> > 
> > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L176
> > 
> > This function also appears to have a bug in it. If any column is a
> > ChunkedArray after calling ReadRowGroup, then the call to
> > TableBatchReader::ReadNext will return only part of the row group
> > 
> > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L200
> > 
> > I opened https://issues.apache.org/jira/browse/ARROW-3822
> > On Thu, Nov 15, 2018 at 11:23 PM Kouhei Sutou <[email protected]> wrote:
> >> 
> >> Hi,
> >> 
> >> I think that we can use
> >> parquet::arrow::FileReader::GetRecordBatchReader()
> >> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L175
> >> for this propose.
> >> 
> >> It doesn't read the specified number of rows but it'll read
> >> only rows in each row group.
> >> (Do I misunderstand?)
> >> 
> >> 
> >> Thanks,
> >> --
> >> kou
> >> 
> >> In <CAJPUwMBY_KHF84T4KAXPUtVP0AVYiKv05erNA_N=cfjyh8k...@mail.gmail.com>
> >>  "Re: Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 22:41:13 -0500,
> >>  Wes McKinney <[email protected]> wrote:
> >> 
> >>> garrow_record_batch_stream_reader_new() is for reading files that use
> >>> the stream IPC protocol described in
> >>> https://github.com/apache/arrow/blob/master/format/IPC.md, not for
> >>> Parquet files
> >>> 
> >>> We don't have a streaming reader implemented yet for Parquet files.
> >>> The relevant JIRA (a bit thin on detail) is
> >>> https://issues.apache.org/jira/browse/ARROW-1012. To be clear, I mean
> >>> to implement this interface, with the option to read some number of
> >>> "rows" at a time:
> >>> 
> >>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166
> >>> On Thu, Nov 15, 2018 at 10:33 PM Kouhei Sutou <[email protected]> wrote:
> >>>> 
> >>>> Hi,
> >>>> 
> >>>> We didn't implement record batch reader feature for Parquet
> >>>> in C API yet. It's easy to implement. So we can provide the
> >>>> feature in the next release. Can you open a JIRA issue for
> >>>> this feature? You can find "Create" button at
> >>>> https://issues.apache.org/jira/projects/ARROW/issues/
> >>>> 
> >>>> If you can use C++ API, you can use the feature with the
> >>>> current release.
> >>>> 
> >>>> 
> >>>> Thanks,
> >>>> --
> >>>> kou
> >>>> 
> >>>> In <[email protected]>
> >>>>  "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500,
> >>>>  Korry Douglas <[email protected]> wrote:
> >>>> 
> >>>>> Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) 
> >>>>> that will let PostgreSQL read Parquet-format files.
> >>>>> 
> >>>>> I have just a few questions for now:
> >>>>> 
> >>>>> 1) I have created a few sample Parquet data files using AWS Glue.  Glue 
> >>>>> split my CSV input into many (48) smaller xxx.snappy.parquet files, 
> >>>>> each about 30MB. When I open one of these files using 
> >>>>> gparquet_arrow_file_reader_new_path(), I can then call 
> >>>>> gparquet_arrow_file_reader_read_table() (and then access the content of 
> >>>>> the table).  However, …_read_table() seems to read the entire file into 
> >>>>> memory all at once (I say that based on the amount of time it takes for 
> >>>>> gparquet_arrow_file_reader_read_table() to return).   That’s not the 
> >>>>> behavior I need.
> >>>>> 
> >>>>> I have tried to use garrow_memory_mappend_input_stream_new() to open 
> >>>>> the file, followed by garrow_record_batch_stream_reader_new().  The 
> >>>>> call to garrow_record_batch_stream_reader_new() fails with the message:
> >>>>> 
> >>>>> [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 
> >>>>> metadata bytes, but only read 30284162
> >>>>> 
> >>>>> Does this error occur because Glue split the input data?  Or because 
> >>>>> Glue compressed the data using snappy?  Do I need to uncompress before 
> >>>>> I can read/open the file?  Do I need to merge the files before I can 
> >>>>> open/read the data?
> >>>>> 
> >>>>> 2) If I use garrow_record_batch_stream_reader_new() instead of 
> >>>>> gparquet_arrow_file_reader_new_path(), will I avoid the overhead of 
> >>>>> reading the entire into memory before I fetch the first row?
> >>>>> 
> >>>>> 
> >>>>> Thanks in advance for help and any advice.
> >>>>> 
> >>>>> 
> >>>>>            ― Korry
>

Re: Joining Parquet & PostgreSQL

Reply via email to