Re: C++ Write Schema with RecordBatchStreamWriter

Rares Vernica Mon, 15 Jun 2020 10:46:34 -0700

This is the compiler:

> g++ --version
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609


And this is how I compile the code:

g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
-Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC -D_STDC_FORMAT_MACROS
-Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
-fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API
-DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
-I"/opt/scidb/19.11/3rdparty/boost/include/" -I"/opt/scidb/19.11/include"
-c PhysicalAioSave.cpp -o PhysicalAioSave.o

g++ -W -Wextra -Wall -Wno-unused-parameter -Wno-variadic-macros
-Wno-strict-aliasing -Wno-long-long -Wno-unused -fPIC -D_STDC_FORMAT_MACROS
-Wno-system-headers -O3 -g -DNDEBUG -D_STDC_LIMIT_MACROS
-fno-omit-frame-pointer -std=c++14 -DCPP11 -DARROW_NO_DEPRECATED_API
-DUSE_ARROW -I. -DPROJECT_ROOT="\"/opt/scidb/19.11\""
-I"/opt/scidb/19.11/3rdparty/boost/include/" -I"/opt/scidb/19.11/include"
-o libaccelerated_io_tools.so plugin.o LogicalSplit.o PhysicalSplit.o
LogicalParse.o PhysicalParse.o LogicalAioInput.o PhysicalAioInput.o
LogicalAioSave.o PhysicalAioSave.o Functions.o -shared
-Wl,-soname,libaccelerated_io_tools.so -L.
-L"/opt/scidb/19.11/3rdparty/boost/lib" -L"/opt/scidb/19.11/lib"
-Wl,-rpath,/opt/scidb/19.11/lib -lm -larrow

We targeted 0.16.0 because we are still stuck on Python 2.7 and PyPI still
has PyArrow binaries for 2.7.

Anyway, I temporarily upgraded to 0.17.1 but the result is the same. I also
fixed all the deprecation warnings but that did not help either.

Setting a breakpoint might be a challenge since this code runs as a
plug-in, but I'll try to isolate this further.

Thanks!
Rares




On Mon, Jun 15, 2020 at 9:15 AM Wes McKinney <wesmck...@gmail.com> wrote:

> What compiler are you using?
>
> In 0.16.0 (what you said you were targeting, though it would be better
> for you to upgrade to 0.17.1) schema is written in the CheckStarted
> function here
>
>
> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/writer.cc#L972
>
> Status CheckStarted() {
>   if (!started_) {
>     return Start();
>   }
>   return Status::OK();
> }
>
> started_ is set to false by a default member initializer in the
> protected block. Maybe you should set a breakpoint in this function
> and see if for some reason started_ is true on the first invocation
> (in which case it makes me wonder if there is something
> not-fully-C++11-compliant about your toolchain).
>
> Otherwise I'm a bit stumped since there are lots of production
> applications that use this code.
>
> On Mon, Jun 15, 2020 at 11:01 AM Rares Vernica <rvern...@gmail.com> wrote:
> >
> > Sure, here is briefly what I'm doing:
> >
> >     bool append = false;
> >     std::shared_ptr<arrow::io::OutputStream> arrowStream;
> >     auto arrowResult = arrow::io::FileOutputStream::Open(fileName,
> append);
> >     arrowStream = arrowResult.ValueOrDie();
> >
> >     std::shared_ptr<arrow::ipc::RecordBatchWriter> arrowWriter;
> >     std::shared_ptr<arrow::RecordBatch> arrowBatch;
> >     std::shared_ptr<arrow::RecordBatchReader> arrowReader;
> >
> >     std::shared_ptr<arrow::Schema> arrowSchema = attributes2ArrowSchema(
> >             inputSchema, settings.isAttsOnly());
> >     ARROW_RETURN_NOT_OK(
> >             arrow::ipc::RecordBatchStreamWriter::Open(
> >                 arrowStream.get(), arrowSchema, &arrowWriter));
> >
> >     // Setup "arrowReader" using BufferReader and RecordBatchStreamReader
> >     ARROW_RETURN_NOT_OK(arrowReader->ReadNext(&arrowBatch));
> >     ARROW_RETURN_NOT_OK(
> >                 arrowWriter->WriteRecordBatch(*arrowBatch));
> >     ARROW_RETURN_NOT_OK(arrowWriter->Close());
> >     ARROW_RETURN_NOT_OK(arrowStream->Close());
> >
> > On Mon, Jun 15, 2020 at 6:26 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > Can you show the code you are writing? The first thing the stream
> writer
> > > does before writing any record batch is write the schema. It sounds
> like
> > > you are using arrow::ipc::WriteRecordBatch somewhere.
> > >
> > > On Sun, Jun 14, 2020, 11:44 PM Rares Vernica <rvern...@gmail.com>
> wrote:
> > >
> > > > Hello,
> > > >
> > > > I have a RecordBatch that I would like to write to a file. I'm using
> > > > FileOutputStream::Open to open the file and
> RecordBatchStreamWriter::Open
> > > > to open the stream. I write a record batch with WriteRecordBatch.
> > > Finally,
> > > > I close the RecordBatchWriter and OutputStream.
> > > >
> > > > The resulting file size is exactly the size of the Buffer used to
> store
> > > the
> > > > RecordBatch. It looks like it is missing the schema. When I try to
> open
> > > the
> > > > resulting file from PyArrow I get:
> > > >
> > > > >>> pa.ipc.open_file('/tmp/1')
> > > > pyarrow.lib.ArrowInvalid: File is too small: 6
> > > >
> > > > $ ll /tmp/1
> > > > -rw-r--r--. 1 root root 720 Jun 15 03:54 /tmp/1
> > > >
> > > > How can I write the schema as well?
> > > >
> > > > I was browsing the documentation at
> > > > https://arrow.apache.org/docs/cpp/index.html but I can't locate any
> C++
> > > > documentation about RecordBatchStreamWriter or RecordBatchWriter. Is
> this
> > > > intentional?
> > > >
> > > > Thank you!
> > > > Rares
> > > >
> > >
>

Re: C++ Write Schema with RecordBatchStreamWriter

Reply via email to