Re: C++ Write Schema with RecordBatchStreamWriter

Wes McKinney Tue, 16 Jun 2020 09:40:17 -0700

It looks like on Python 2.7 that the open_stream/open_file functions
are treating the file name that you are passing as a binary buffer
rather than a file path (inferring from the fact that '1' is one byte
in Py2.7 and 'foo' is 3 bytes). Try passing an open file handle
instead


On Tue, Jun 16, 2020 at 11:28 AM Rares Vernica <[email protected]> wrote:
>
> Thank you for your help in getting to the bottom of this.  It seems that
> there is no problem with the C++ code, but the PyArrow/Python 2.7
> combination.
>
> Here are more details. I have two C++ programs writing two Arrow files. The
> first one is the bigger plugin I'm attempting to port and the second one is
> the small example listed earlier in this thread. The resulting Arrow files
> cannot be read by PyArrow in Python 2.7 but they work fine in Python 3.8.
> The Arrow and PyArrow versions match. I'm using 0.16.0 since there is a
> PyArrow .whl for Python 2.7 in PyPI.
>
> Here is the output from Python 2.7:
>
> > python
> Python 2.7.12 (default, Apr 15 2020, 17:07:12)
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow
> >>> pyarrow.__version__
> '0.16.0'
> >>> pyarrow.ipc.open_stream('1').read_pandas()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137,
> in open_stream
>     return RecordBatchStreamReader(source)
>   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61, in
> __init__
>     self._open(source)
>   File "pyarrow/ipc.pxi", line 352, in
> pyarrow.lib._RecordBatchStreamReader._open
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Corrupted message, only 1 bytes available
> >>> pyarrow.ipc.open_stream('foo').read_pandas()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137,
> in open_stream
>     return RecordBatchStreamReader(source)
>   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61, in
> __init__
>     self._open(source)
>   File "pyarrow/ipc.pxi", line 352, in
> pyarrow.lib._RecordBatchStreamReader._open
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Corrupted message, only 3 bytes available
>
> And here is the output from Python 3.8:
>
> > python
> Python 3.8.3 (default, May 15 2020, 00:00:00)
> [GCC 10.1.1 20200507 (Red Hat 10.1.1-1)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow
> >>> pyarrow.__version__
> '0.16.0'
> >>> pyarrow.ipc.open_stream('1').read_pandas()
>      x     y
> 0  -10 -10.0
> 1   -9   NaN
> 2   -8  -8.0
> 3   -7   NaN
> 4   -6  -6.0
> 5   -5   NaN
> 6   -4  -4.0
> 7   -3   NaN
> 8   -2  -2.0
> 9   -1   NaN
> 10   0   0.0
> 11   1   NaN
> 12   2   2.0
> 13   3   NaN
> 14   4   4.0
> 15   5   NaN
> 16   6   6.0
> 17   7   NaN
> 18   8   8.0
> 19   9   NaN
> 20  10  10.0
> >>> pyarrow.ipc.open_stream('foo').read_pandas()
>     foo   bar
> 0     0   0.0
> 1     1   NaN
> 2     2   2.0
> 3     3   NaN
> 4     4   4.0
> 5     5   NaN
> 6     6   6.0
> 7     7   NaN
> 8     8   8.0
> 9     9   NaN
> 10   10  10.0
> 11   11   NaN
> 12   12  12.0
> 13   13   NaN
> 14   14  14.0
> 15   15   NaN
> 16   16  16.0
> 17   17   NaN
> 18   18  18.0
> 19   19   NaN
> 20   20  20.0
>
> Is this a bug in PyArrow or some Python 2.7 package issue?
>
> Thanks!
> Rares
>
> On Mon, Jun 15, 2020 at 10:55 PM Micah Kornfield <[email protected]>
> wrote:
>
> > Hi Rares,
> > This last issue sounds like you are trying to write data from 0.16.0
> > version of the library and read it from a pre-0.15.0 version of the python
> > library.  If you want to do this you need to set  "bool
> > write_legacy_ipc_format" to true on IpcWriterOptions/IpcOptions object and
> > construct the StreamWriter with the object.
> >
> > -Micah
> >
> >
> > On Mon, Jun 15, 2020 at 10:38 PM Rares Vernica <[email protected]> wrote:
> >
> > > With open_stream I get a different error:
> > >
> > > > python -c "import pyarrow; pyarrow.ipc.open_stream('/tmp/foo')"
> > > Traceback (most recent call last):
> > >   File "<string>", line 1, in <module>
> > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137,
> > > in open_stream
> > >     return RecordBatchStreamReader(source)
> > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61,
> > in
> > > __init__
> > >     self._open(source)
> > >   File "pyarrow/ipc.pxi", line 352, in
> > > pyarrow.lib._RecordBatchStreamReader._open
> > >   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> > > pyarrow.lib.ArrowInvalid: Expected to read 1886221359 metadata bytes, but
> > > only read 4
> > >
> > >
> > > On Mon, Jun 15, 2020 at 10:08 PM Wes McKinney <[email protected]>
> > wrote:
> > >
> > > > On Mon, Jun 15, 2020 at 11:24 PM Rares Vernica <[email protected]>
> > > wrote:
> > > > >
> > > > > I was able to reproduce my issue in a small, fully-contained,
> > program.
> > > > Here
> > > > > is the source code:
> > > > >
> > > > > #include <arrow/builder.h>
> > > > > #include <arrow/io/file.h>
> > > > > #include <arrow/ipc/writer.h>
> > > > > #include <arrow/record_batch.h>
> > > > >
> > > > > arrow::Status foo() {
> > > > >   std::shared_ptr<arrow::io::OutputStream> arrowStream;
> > > > >   std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter;
> > > > >   std::shared_ptr<arrow::RecordBatch> arrowBatch;
> > > > >   std::shared_ptr<arrow::RecordBatchReader> arrowReader;
> > > > >
> > > > >   std::vector<std::shared_ptr<arrow::Field>> arrowFields(2);
> > > > >   arrowFields[0] = arrow::field("foo", arrow::int64());
> > > > >   arrowFields[1] = arrow::field("bar", arrow::int64());
> > > > >   std::shared_ptr<arrow::Schema> arrowSchema =
> > > > arrow::schema(arrowFields);
> > > > >
> > > > >   std::vector<std::shared_ptr<arrow::Array>> arrowArrays(2);
> > > > >   arrow::Int64Builder arrowBuilder;
> > > > >   for (int i = 0; i < 2; i++) {
> > > > >     for (int j = 0; j < 21; j++)
> > > > >       if (i && (j % 2))
> > > > >         arrowBuilder.AppendNull();
> > > > >       else
> > > > >         arrowBuilder.Append(j);
> > > > >     ARROW_RETURN_NOT_OK(arrowBuilder.Finish(&arrowArrays[i]));
> > > > >   }
> > > > >   arrowBatch = arrow::RecordBatch::Make(arrowSchema,
> > > > > arrowArrays[0]->length(), arrowArrays);
> > > > >
> > > > >   ARROW_ASSIGN_OR_RAISE(arrowStream,
> > > > > arrow::io::FileOutputStream::Open("/tmp/foo"));
> > > > >   ARROW_ASSIGN_OR_RAISE(arrowWriter,
> > > > > arrow::ipc::NewStreamWriter(arrowStream.get(), arrowSchema));
> > > > >   ARROW_RETURN_NOT_OK(arrowWriter->WriteRecordBatch(*arrowBatch));
> > > > >   ARROW_RETURN_NOT_OK(arrowWriter->Close());
> > > > >   ARROW_RETURN_NOT_OK(arrowStream->Close());
> > > > >
> > > > >   return arrow::Status::OK();
> > > > > }
> > > > >
> > > > > int main() {
> > > > >   foo();
> > > > > }
> > > > >
> > > > > I compile and run it like this:
> > > > >
> > > > > > g++ -std=c++11 src/foo.cpp -larrow && ./a.out && ll /tmp/foo
> > > > > -rw-r--r--. 1 root root 720 Jun 16 04:16 /tmp/foo
> > > > >
> > > > > The file is small and I can't read it from PyArrow:
> > > > >
> > > > > > python -c "import pyarrow;
> > > > > pyarrow.ipc.open_file('/tmp/foo').read_pandas()"
> > > >
> > > > Here is your problem. Try `pyarrow.ipc.open_stream`.
> > > >
> > > > > Traceback (most recent call last):
> > > > >   File "<string>", line 1, in <module>
> > > > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line
> > > 156,
> > > > > in open_file
> > > > >     return RecordBatchFileReader(source, footer_offset=footer_offset)
> > > > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line
> > > 99,
> > > > in
> > > > > __init__
> > > > >     self._open(source, footer_offset=footer_offset)
> > > > >   File "pyarrow/ipc.pxi", line 398, in
> > > > > pyarrow.lib._RecordBatchFileReader._open
> > > > >   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> > > > > pyarrow.lib.ArrowInvalid: File is too small: 8
> > > > >
> > > > > Here is the Arrow and G++ version:
> > > > >
> > > > > > dpkg -s libarrow-dev
> > > > > Package: libarrow-dev
> > > > > Status: install ok installed
> > > > > Priority: optional
> > > > > Section: libdevel
> > > > > Installed-Size: 38738
> > > > > Maintainer: Apache Arrow Developers <[email protected]>
> > > > > Architecture: amd64
> > > > > Multi-Arch: same
> > > > > Source: apache-arrow
> > > > > Version: 0.17.1-1
> > > > > Depends: libarrow17 (= 0.17.1-1)
> > > > >
> > > > > > g++ --version
> > > > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
> > > > >
> > > > > Does this make sense?
> > > > >
> > > > > Cheers,
> > > > > Rares
> > > > >
> > > > >
> > > > > On Mon, Jun 15, 2020 at 10:45 AM Rares Vernica <[email protected]>
> > > > wrote:
> > > > >
> > > > > > This is the compiler:
> > > > > >
> > > > > > > g++ --version
> > > > > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
> > > > > >
> > > > > > And this is how I compile the code:
> > > > > >
> > > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
> > > > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC
> > > > -D_STDC_FORMAT_MACROS
> > > > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
> > > > > > -fno-omit-frame-pointer -std=c++14 -DCPP11
> > -DARROW_NO_DEPRECATED_API
> > > > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
> > > > > > -I"/opt/scidb/19.11/3rdparty/boost/include/"
> > > > -I"/opt/scidb/19.11/include"
> > > > > > -c PhysicalAioSave.cpp -o PhysicalAioSave.o
> > > > > >
> > > > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
> > > > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC
> > > > -D_STDC_FORMAT_MACROS
> > > > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
> > > > > > -fno-omit-frame-pointer -std=c++14 -DCPP11
> > -DARROW_NO_DEPRECATED_API
> > > > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
> > > > > > -I"/opt/scidb/19.11/3rdparty/boost/include/"
> > > > -I"/opt/scidb/19.11/include"
> > > > > > -o libaccelerated_io_tools.so plugin.o LogicalSplit.o
> > PhysicalSplit.o
> > > > > > LogicalParse.o PhysicalParse.o LogicalAioInput.o PhysicalAioInput.o
> > > > > > LogicalAioSave.o PhysicalAioSave.o Functions.o -shared
> > > > > > -Wl,-soname,libaccelerated_io_tools.so -L.
> > > > > > -L"/opt/scidb/19.11/3rdparty/boost/lib" -L"/opt/scidb/19.11/lib"
> > > > > > -Wl,-rpath,/opt/scidb/19.11/lib -lm -larrow
> > > > > >
> > > > > > We targeted 0.16.0 because we are still stuck on Python 2.7 and
> > PyPI
> > > > still
> > > > > > has PyArrow binaries for 2.7.
> > > > > >
> > > > > > Anyway, I temporarily upgraded to 0.17.1 but the result is the
> > same.
> > > I
> > > > > > also fixed all the deprecation warnings but that did not help
> > either.
> > > > > >
> > > > > > Setting a breakpoint might be a challenge since this code runs as a
> > > > > > plug-in, but I'll try to isolate this further.
> > > > > >
> > > > > > Thanks!
> > > > > > Rares
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Jun 15, 2020 at 9:15 AM Wes McKinney <[email protected]>
> > > > wrote:
> > > > > >
> > > > > >> What compiler are you using?
> > > > > >>
> > > > > >> In 0.16.0 (what you said you were targeting, though it would be
> > > better
> > > > > >> for you to upgrade to 0.17.1) schema is written in the
> > CheckStarted
> > > > > >> function here
> > > > > >>
> > > > > >>
> > > > > >>
> > > >
> > >
> > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L972
> > > > > >>
> > > > > >> Status CheckStarted() {
> > > > > >>   if (!started_) {
> > > > > >>     return Start();
> > > > > >>   }
> > > > > >>   return Status::OK();
> > > > > >> }
> > > > > >>
> > > > > >> started_ is set to false by a default member initializer in the
> > > > > >> protected block. Maybe you should set a breakpoint in this
> > function
> > > > > >> and see if for some reason started_ is true on the first
> > invocation
> > > > > >> (in which case it makes me wonder if there is something
> > > > > >> not-fully-C++11-compliant about your toolchain).
> > > > > >>
> > > > > >> Otherwise I'm a bit stumped since there are lots of production
> > > > > >> applications that use this code.
> > > > > >>
> > > > > >> On Mon, Jun 15, 2020 at 11:01 AM Rares Vernica <
> > [email protected]>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > Sure, here is briefly what I'm doing:
> > > > > >> >
> > > > > >> >     bool append = false;
> > > > > >> >     std::shared_ptr<arrow::io::OutputStream> arrowStream;
> > > > > >> >     auto arrowResult =
> > arrow::io::FileOutputStream::Open(fileName,
> > > > > >> append);
> > > > > >> >     arrowStream = arrowResult.ValueOrDie();
> > > > > >> >
> > > > > >> >     std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter;
> > > > > >> >     std::shared_ptr<arrow::RecordBatch> arrowBatch;
> > > > > >> >     std::shared_ptr<arrow::RecordBatchReader> arrowReader;
> > > > > >> >
> > > > > >> >     std::shared_ptr<arrow::Schema> arrowSchema =
> > > > attributes2ArrowSchema(
> > > > > >> >             inputSchema, settings.isAttsOnly());
> > > > > >> >     ARROW_RETURN_NOT_OK(
> > > > > >> >             arrow::ipc::RecordBatchStreamWriter::Open(
> > > > > >> >                 arrowStream.get(), arrowSchema, &arrowWriter));
> > > > > >> >
> > > > > >> >     // Setup "arrowReader" using BufferReader and
> > > > > >> RecordBatchStreamReader
> > > > > >> >     ARROW_RETURN_NOT_OK(arrowReader->ReadNext(&arrowBatch));
> > > > > >> >     ARROW_RETURN_NOT_OK(
> > > > > >> >                 arrowWriter->WriteRecordBatch(*arrowBatch));
> > > > > >> >     ARROW_RETURN_NOT_OK(arrowWriter->Close());
> > > > > >> >     ARROW_RETURN_NOT_OK(arrowStream->Close());
> > > > > >> >
> > > > > >> > On Mon, Jun 15, 2020 at 6:26 AM Wes McKinney <
> > [email protected]
> > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Can you show the code you are writing? The first thing the
> > > stream
> > > > > >> writer
> > > > > >> > > does before writing any record batch is write the schema. It
> > > > sounds
> > > > > >> like
> > > > > >> > > you are using arrow::ipc::WriteRecordBatch somewhere.
> > > > > >> > >
> > > > > >> > > On Sun, Jun 14, 2020, 11:44 PM Rares Vernica <
> > > [email protected]>
> > > > > >> wrote:
> > > > > >> > >
> > > > > >> > > > Hello,
> > > > > >> > > >
> > > > > >> > > > I have a RecordBatch that I would like to write to a file.
> > I'm
> > > > using
> > > > > >> > > > FileOutputStream::Open to open the file and
> > > > > >> RecordBatchStreamWriter::Open
> > > > > >> > > > to open the stream. I write a record batch with
> > > > WriteRecordBatch.
> > > > > >> > > Finally,
> > > > > >> > > > I close the RecordBatchWriter and OutputStream.
> > > > > >> > > >
> > > > > >> > > > The resulting file size is exactly the size of the Buffer
> > used
> > > > to
> > > > > >> store
> > > > > >> > > the
> > > > > >> > > > RecordBatch. It looks like it is missing the schema. When I
> > > try
> > > > to
> > > > > >> open
> > > > > >> > > the
> > > > > >> > > > resulting file from PyArrow I get:
> > > > > >> > > >
> > > > > >> > > > >>> pa.ipc.open_file('/tmp/1')
> > > > > >> > > > pyarrow.lib.ArrowInvalid: File is too small: 6
> > > > > >> > > >
> > > > > >> > > > $ ll /tmp/1
> > > > > >> > > > -rw-r--r--. 1 root root 720 Jun 15 03:54 /tmp/1
> > > > > >> > > >
> > > > > >> > > > How can I write the schema as well?
> > > > > >> > > >
> > > > > >> > > > I was browsing the documentation at
> > > > > >> > > > https://arrow.apache.org/docs/cpp/index.html but I can't
> > > locate
> > > > > >> any C++
> > > > > >> > > > documentation about RecordBatchStreamWriter or
> > > > RecordBatchWriter.
> > > > > >> Is this
> > > > > >> > > > intentional?
> > > > > >> > > >
> > > > > >> > > > Thank you!
> > > > > >> > > > Rares
> > > > > >> > > >
> > > > > >> > >
> > > > > >>
> > > > > >
> > > >
> > >
> >

Re: C++ Write Schema with RecordBatchStreamWriter

Reply via email to