Re: Joining Parquet & PostgreSQL

Wes McKinney Fri, 16 Nov 2018 05:42:42 -0800

That will work, but the size of a single row group could be very large

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L176


This function also appears to have a bug in it. If any column is a
ChunkedArray after calling ReadRowGroup, then the call to
TableBatchReader::ReadNext will return only part of the row group

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L200

I opened https://issues.apache.org/jira/browse/ARROW-3822
On Thu, Nov 15, 2018 at 11:23 PM Kouhei Sutou <[email protected]> wrote:
>
> Hi,
>
> I think that we can use
> parquet::arrow::FileReader::GetRecordBatchReader()
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L175
> for this propose.
>
> It doesn't read the specified number of rows but it'll read
> only rows in each row group.
> (Do I misunderstand?)
>
>
> Thanks,
> --
> kou
>
> In <CAJPUwMBY_KHF84T4KAXPUtVP0AVYiKv05erNA_N=cfjyh8k...@mail.gmail.com>
>   "Re: Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 22:41:13 -0500,
>   Wes McKinney <[email protected]> wrote:
>
> > garrow_record_batch_stream_reader_new() is for reading files that use
> > the stream IPC protocol described in
> > https://github.com/apache/arrow/blob/master/format/IPC.md, not for
> > Parquet files
> >
> > We don't have a streaming reader implemented yet for Parquet files.
> > The relevant JIRA (a bit thin on detail) is
> > https://issues.apache.org/jira/browse/ARROW-1012. To be clear, I mean
> > to implement this interface, with the option to read some number of
> > "rows" at a time:
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166
> > On Thu, Nov 15, 2018 at 10:33 PM Kouhei Sutou <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> We didn't implement record batch reader feature for Parquet
> >> in C API yet. It's easy to implement. So we can provide the
> >> feature in the next release. Can you open a JIRA issue for
> >> this feature? You can find "Create" button at
> >> https://issues.apache.org/jira/projects/ARROW/issues/
> >>
> >> If you can use C++ API, you can use the feature with the
> >> current release.
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <[email protected]>
> >>   "Joining Parquet & PostgreSQL" on Thu, 15 Nov 2018 12:56:34 -0500,
> >>   Korry Douglas <[email protected]> wrote:
> >>
> >> > Hi all, I’m exploring the idea of adding a foreign data wrapper (FDW) 
> >> > that will let PostgreSQL read Parquet-format files.
> >> >
> >> > I have just a few questions for now:
> >> >
> >> > 1) I have created a few sample Parquet data files using AWS Glue.  Glue 
> >> > split my CSV input into many (48) smaller xxx.snappy.parquet files, each 
> >> > about 30MB. When I open one of these files using 
> >> > gparquet_arrow_file_reader_new_path(), I can then call 
> >> > gparquet_arrow_file_reader_read_table() (and then access the content of 
> >> > the table).  However, …_read_table() seems to read the entire file into 
> >> > memory all at once (I say that based on the amount of time it takes for 
> >> > gparquet_arrow_file_reader_read_table() to return).   That’s not the 
> >> > behavior I need.
> >> >
> >> > I have tried to use garrow_memory_mappend_input_stream_new() to open the 
> >> > file, followed by garrow_record_batch_stream_reader_new().  The call to 
> >> > garrow_record_batch_stream_reader_new() fails with the message:
> >> >
> >> > [record-batch-stream-reader][open]: Invalid: Expected to read 827474256 
> >> > metadata bytes, but only read 30284162
> >> >
> >> > Does this error occur because Glue split the input data?  Or because 
> >> > Glue compressed the data using snappy?  Do I need to uncompress before I 
> >> > can read/open the file?  Do I need to merge the files before I can 
> >> > open/read the data?
> >> >
> >> > 2) If I use garrow_record_batch_stream_reader_new() instead of 
> >> > gparquet_arrow_file_reader_new_path(), will I avoid the overhead of 
> >> > reading the entire into memory before I fetch the first row?
> >> >
> >> >
> >> > Thanks in advance for help and any advice.
> >> >
> >> >
> >> >             ― Korry

Re: Joining Parquet & PostgreSQL

Reply via email to