Re: C++ Write Schema with RecordBatchStreamWriter

Micah Kornfield Mon, 15 Jun 2020 22:56:16 -0700

Hi Rares,
This last issue sounds like you are trying to write data from 0.16.0
version of the library and read it from a pre-0.15.0 version of the python
library.  If you want to do this you need to set  "bool
write_legacy_ipc_format" to true on IpcWriterOptions/IpcOptions object and
construct the StreamWriter with the object.


-Micah


On Mon, Jun 15, 2020 at 10:38 PM Rares Vernica <[email protected]> wrote:

> With open_stream I get a different error:
>
> > python -c "import pyarrow; pyarrow.ipc.open_stream('/tmp/foo')"
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 137,
> in open_stream
>     return RecordBatchStreamReader(source)
>   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line 61, in
> __init__
>     self._open(source)
>   File "pyarrow/ipc.pxi", line 352, in
> pyarrow.lib._RecordBatchStreamReader._open
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Expected to read 1886221359 metadata bytes, but
> only read 4
>
>
> On Mon, Jun 15, 2020 at 10:08 PM Wes McKinney <[email protected]> wrote:
>
> > On Mon, Jun 15, 2020 at 11:24 PM Rares Vernica <[email protected]>
> wrote:
> > >
> > > I was able to reproduce my issue in a small, fully-contained, program.
> > Here
> > > is the source code:
> > >
> > > #include <arrow/builder.h>
> > > #include <arrow/io/file.h>
> > > #include <arrow/ipc/writer.h>
> > > #include <arrow/record_batch.h>
> > >
> > > arrow::Status foo() {
> > >   std::shared_ptr<arrow::io::OutputStream> arrowStream;
> > >   std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter;
> > >   std::shared_ptr<arrow::RecordBatch> arrowBatch;
> > >   std::shared_ptr<arrow::RecordBatchReader> arrowReader;
> > >
> > >   std::vector<std::shared_ptr<arrow::Field>> arrowFields(2);
> > >   arrowFields[0] = arrow::field("foo", arrow::int64());
> > >   arrowFields[1] = arrow::field("bar", arrow::int64());
> > >   std::shared_ptr<arrow::Schema> arrowSchema =
> > arrow::schema(arrowFields);
> > >
> > >   std::vector<std::shared_ptr<arrow::Array>> arrowArrays(2);
> > >   arrow::Int64Builder arrowBuilder;
> > >   for (int i = 0; i < 2; i++) {
> > >     for (int j = 0; j < 21; j++)
> > >       if (i && (j % 2))
> > >         arrowBuilder.AppendNull();
> > >       else
> > >         arrowBuilder.Append(j);
> > >     ARROW_RETURN_NOT_OK(arrowBuilder.Finish(&arrowArrays[i]));
> > >   }
> > >   arrowBatch = arrow::RecordBatch::Make(arrowSchema,
> > > arrowArrays[0]->length(), arrowArrays);
> > >
> > >   ARROW_ASSIGN_OR_RAISE(arrowStream,
> > > arrow::io::FileOutputStream::Open("/tmp/foo"));
> > >   ARROW_ASSIGN_OR_RAISE(arrowWriter,
> > > arrow::ipc::NewStreamWriter(arrowStream.get(), arrowSchema));
> > >   ARROW_RETURN_NOT_OK(arrowWriter->WriteRecordBatch(*arrowBatch));
> > >   ARROW_RETURN_NOT_OK(arrowWriter->Close());
> > >   ARROW_RETURN_NOT_OK(arrowStream->Close());
> > >
> > >   return arrow::Status::OK();
> > > }
> > >
> > > int main() {
> > >   foo();
> > > }
> > >
> > > I compile and run it like this:
> > >
> > > > g++ -std=c++11 src/foo.cpp -larrow && ./a.out && ll /tmp/foo
> > > -rw-r--r--. 1 root root 720 Jun 16 04:16 /tmp/foo
> > >
> > > The file is small and I can't read it from PyArrow:
> > >
> > > > python -c "import pyarrow;
> > > pyarrow.ipc.open_file('/tmp/foo').read_pandas()"
> >
> > Here is your problem. Try `pyarrow.ipc.open_stream`.
> >
> > > Traceback (most recent call last):
> > >   File "<string>", line 1, in <module>
> > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line
> 156,
> > > in open_file
> > >     return RecordBatchFileReader(source, footer_offset=footer_offset)
> > >   File "/usr/local/lib/python2.7/dist-packages/pyarrow/ipc.py", line
> 99,
> > in
> > > __init__
> > >     self._open(source, footer_offset=footer_offset)
> > >   File "pyarrow/ipc.pxi", line 398, in
> > > pyarrow.lib._RecordBatchFileReader._open
> > >   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> > > pyarrow.lib.ArrowInvalid: File is too small: 8
> > >
> > > Here is the Arrow and G++ version:
> > >
> > > > dpkg -s libarrow-dev
> > > Package: libarrow-dev
> > > Status: install ok installed
> > > Priority: optional
> > > Section: libdevel
> > > Installed-Size: 38738
> > > Maintainer: Apache Arrow Developers <[email protected]>
> > > Architecture: amd64
> > > Multi-Arch: same
> > > Source: apache-arrow
> > > Version: 0.17.1-1
> > > Depends: libarrow17 (= 0.17.1-1)
> > >
> > > > g++ --version
> > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
> > >
> > > Does this make sense?
> > >
> > > Cheers,
> > > Rares
> > >
> > >
> > > On Mon, Jun 15, 2020 at 10:45 AM Rares Vernica <[email protected]>
> > wrote:
> > >
> > > > This is the compiler:
> > > >
> > > > > g++ --version
> > > > g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
> > > >
> > > > And this is how I compile the code:
> > > >
> > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
> > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC
> > -D_STDC_FORMAT_MACROS
> > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
> > > > -fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API
> > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
> > > > -I"/opt/scidb/19.11/3rdparty/boost/include/"
> > -I"/opt/scidb/19.11/include"
> > > > -c PhysicalAioSave.cpp -o PhysicalAioSave.o
> > > >
> > > > g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
> > > > -Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC
> > -D_STDC_FORMAT_MACROS
> > > > -Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
> > > > -fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API
> > > > -DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
> > > > -I"/opt/scidb/19.11/3rdparty/boost/include/"
> > -I"/opt/scidb/19.11/include"
> > > > -o libaccelerated_io_tools.so plugin.o LogicalSplit.o PhysicalSplit.o
> > > > LogicalParse.o PhysicalParse.o LogicalAioInput.o PhysicalAioInput.o
> > > > LogicalAioSave.o PhysicalAioSave.o Functions.o -shared
> > > > -Wl,-soname,libaccelerated_io_tools.so -L.
> > > > -L"/opt/scidb/19.11/3rdparty/boost/lib" -L"/opt/scidb/19.11/lib"
> > > > -Wl,-rpath,/opt/scidb/19.11/lib -lm -larrow
> > > >
> > > > We targeted 0.16.0 because we are still stuck on Python 2.7 and PyPI
> > still
> > > > has PyArrow binaries for 2.7.
> > > >
> > > > Anyway, I temporarily upgraded to 0.17.1 but the result is the same.
> I
> > > > also fixed all the deprecation warnings but that did not help either.
> > > >
> > > > Setting a breakpoint might be a challenge since this code runs as a
> > > > plug-in, but I'll try to isolate this further.
> > > >
> > > > Thanks!
> > > > Rares
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Jun 15, 2020 at 9:15 AM Wes McKinney <[email protected]>
> > wrote:
> > > >
> > > >> What compiler are you using?
> > > >>
> > > >> In 0.16.0 (what you said you were targeting, though it would be
> better
> > > >> for you to upgrade to 0.17.1) schema is written in the CheckStarted
> > > >> function here
> > > >>
> > > >>
> > > >>
> >
> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L972
> > > >>
> > > >> Status CheckStarted() {
> > > >>   if (!started_) {
> > > >>     return Start();
> > > >>   }
> > > >>   return Status::OK();
> > > >> }
> > > >>
> > > >> started_ is set to false by a default member initializer in the
> > > >> protected block. Maybe you should set a breakpoint in this function
> > > >> and see if for some reason started_ is true on the first invocation
> > > >> (in which case it makes me wonder if there is something
> > > >> not-fully-C++11-compliant about your toolchain).
> > > >>
> > > >> Otherwise I'm a bit stumped since there are lots of production
> > > >> applications that use this code.
> > > >>
> > > >> On Mon, Jun 15, 2020 at 11:01 AM Rares Vernica <[email protected]>
> > > >> wrote:
> > > >> >
> > > >> > Sure, here is briefly what I'm doing:
> > > >> >
> > > >> >     bool append = false;
> > > >> >     std::shared_ptr<arrow::io::OutputStream> arrowStream;
> > > >> >     auto arrowResult = arrow::io::FileOutputStream::Open(fileName,
> > > >> append);
> > > >> >     arrowStream = arrowResult.ValueOrDie();
> > > >> >
> > > >> >     std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter;
> > > >> >     std::shared_ptr<arrow::RecordBatch> arrowBatch;
> > > >> >     std::shared_ptr<arrow::RecordBatchReader> arrowReader;
> > > >> >
> > > >> >     std::shared_ptr<arrow::Schema> arrowSchema =
> > attributes2ArrowSchema(
> > > >> >             inputSchema, settings.isAttsOnly());
> > > >> >     ARROW_RETURN_NOT_OK(
> > > >> >             arrow::ipc::RecordBatchStreamWriter::Open(
> > > >> >                 arrowStream.get(), arrowSchema, &arrowWriter));
> > > >> >
> > > >> >     // Setup "arrowReader" using BufferReader and
> > > >> RecordBatchStreamReader
> > > >> >     ARROW_RETURN_NOT_OK(arrowReader->ReadNext(&arrowBatch));
> > > >> >     ARROW_RETURN_NOT_OK(
> > > >> >                 arrowWriter->WriteRecordBatch(*arrowBatch));
> > > >> >     ARROW_RETURN_NOT_OK(arrowWriter->Close());
> > > >> >     ARROW_RETURN_NOT_OK(arrowStream->Close());
> > > >> >
> > > >> > On Mon, Jun 15, 2020 at 6:26 AM Wes McKinney <[email protected]
> >
> > > >> wrote:
> > > >> >
> > > >> > > Can you show the code you are writing? The first thing the
> stream
> > > >> writer
> > > >> > > does before writing any record batch is write the schema. It
> > sounds
> > > >> like
> > > >> > > you are using arrow::ipc::WriteRecordBatch somewhere.
> > > >> > >
> > > >> > > On Sun, Jun 14, 2020, 11:44 PM Rares Vernica <
> [email protected]>
> > > >> wrote:
> > > >> > >
> > > >> > > > Hello,
> > > >> > > >
> > > >> > > > I have a RecordBatch that I would like to write to a file. I'm
> > using
> > > >> > > > FileOutputStream::Open to open the file and
> > > >> RecordBatchStreamWriter::Open
> > > >> > > > to open the stream. I write a record batch with
> > WriteRecordBatch.
> > > >> > > Finally,
> > > >> > > > I close the RecordBatchWriter and OutputStream.
> > > >> > > >
> > > >> > > > The resulting file size is exactly the size of the Buffer used
> > to
> > > >> store
> > > >> > > the
> > > >> > > > RecordBatch. It looks like it is missing the schema. When I
> try
> > to
> > > >> open
> > > >> > > the
> > > >> > > > resulting file from PyArrow I get:
> > > >> > > >
> > > >> > > > >>> pa.ipc.open_file('/tmp/1')
> > > >> > > > pyarrow.lib.ArrowInvalid: File is too small: 6
> > > >> > > >
> > > >> > > > $ ll /tmp/1
> > > >> > > > -rw-r--r--. 1 root root 720 Jun 15 03:54 /tmp/1
> > > >> > > >
> > > >> > > > How can I write the schema as well?
> > > >> > > >
> > > >> > > > I was browsing the documentation at
> > > >> > > > https://arrow.apache.org/docs/cpp/index.html but I can't
> locate
> > > >> any C++
> > > >> > > > documentation about RecordBatchStreamWriter or
> > RecordBatchWriter.
> > > >> Is this
> > > >> > > > intentional?
> > > >> > > >
> > > >> > > > Thank you!
> > > >> > > > Rares
> > > >> > > >
> > > >> > >
> > > >>
> > > >
> >
>

Re: C++ Write Schema with RecordBatchStreamWriter

Reply via email to