The file format isn't intended to be used in a streaming setting. Use
RecordBatchStreamWriter if you need to be able to read a dataset as a
stream

That being said I don't have a problem with writing the EOS in the
file, but the current implementation is not "wrong".

On Wed, May 22, 2019 at 8:37 AM John Muehlhausen <j...@jgm.org> wrote:
>
> I believe the change involves updating the File format notes as above, as
> well as something like the following.  The format also mentions "there is
> no requirement that dictionary keys should be defined in a DictionaryBatch
> before they are used in a RecordBatch, as long as the keys are defined
> somewhere in the file."  I'm not sure this is a good idea if the file is
> enormous.  The option to process a file as a stream is how we keep memory
> usage reasonable?
>
> I will try to figure out a workflow for recommending changes to the code
> base.... merge requests etc.  Any tips appreciated.
>
>   Status Close() override {
> *    // Close stream for compatibility with sequential readers*
> *    RETURN_NOT_OK(UpdatePosition());*
> *    int32_t end_of_stream_marker = 0;*
> *    RETURN_NOT_OK(Write(&end_of_stream_marker, sizeof(int32_t)));*
>
>     // Write file footer
>     RETURN_NOT_OK(UpdatePosition());
>     int64_t initial_position = position_;
>     RETURN_NOT_OK(WriteFileFooter(*schema_, dictionaries_, record_batches_,
> sink_));
>
>     // Write footer length
>     RETURN_NOT_OK(UpdatePosition());
>     int32_t footer_length = static_cast<int32_t>(position_ -
> initial_position);
>     if (footer_length <= 0) {
>       return Status::Invalid("Invalid file footer");
>     }
>
>     RETURN_NOT_OK(Write(&footer_length, sizeof(int32_t)));
>
>     // Write magic bytes to end file
>     return Write(kArrowMagicBytes, strlen(kArrowMagicBytes));
>   }
>
>
> On Tue, May 21, 2019 at 11:19 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > This seems like a reasonable change.  Is there any reason that we shouldnt
> > always append EOS?
> >
> > On Tuesday, May 21, 2019, John Muehlhausen <j...@jgm.org> wrote:
> >
> > > Wes,
> > >
> > > Check out reader.cpp.  It seg faults when it gets to the next
> > > message-that-is-not-a-message... it is a footer.  But I have no way to
> > > know this in reader.cpp because I'm piping the File in via stdin.
> > >
> > > In seeker.cpp I seek to the end and figure out where the footer is (this
> > > is a py-arrow-written file) and indeed it is at the offset where my
> > > "streamed File" reader bombed out.  If EOS were mandatory at this
> > location
> > > it would have been fine... I would have said "oh, time for the footer!"
> > >
> > > Basically what I'm saying is that we can't assume that File won't be
> > > processed as a stream.  In an actual non-file stream it is either EOS or
> > > end-of-stream.  But with a file-as-stream there is more data and we have
> > to
> > > know it isn't the stream anymore.
> > >
> > > Otherwise we've locked the File use-cases into those where the File isn't
> > > streamed -- i.e. is seekable.  See what I'm saying?  For reader.cpp to
> > have
> > > been functional it would have had to read the entire File into a buffer
> > > before parsing, since it could not seek().  This could be easily avoided
> > > with a mandatory EOS in the File format.  Basically:
> > >
> > > <magic number "ARROW1">
> > > <empty padding bytes [to 8 byte boundary]>
> > > <STREAMING FORMAT>
> > > *<EOS if not in stream>*
> > > <FOOTER>
> > > <FOOTER SIZE: int32>
> > > <magic number "ARROW1">
> > >
> > > -John
> > >
> > > On Tue, May 21, 2019 at 4:44 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > >> hi John,
> > >>
> > >> I'm not sure I follow. The EOS you're referring to is part of the
> > >> streaming format. It's designed to be readable using an InputStream
> > >> interface that does not support seeking at all. You can see the core
> > >> logic where messages are popped off the InputStream here
> > >>
> > >> https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f51
> > >> 1962c8b012/cpp/src/arrow/ipc/message.cc#L281
> > >>
> > >> If the end of the byte stream is reached, or EOS (0) is encountered,
> > >> then the stream reader stops iteration.
> > >>
> > >> - Wes
> > >>
> > >> On Tue, May 21, 2019 at 4:34 PM John Muehlhausen <j...@jgm.org> wrote:
> > >> >
> > >> > https://arrow.apache.org/docs/format/IPC.html#file-format
> > >> >
> > >> > <EOS [optional]: int32>
> > >> >
> > >> > If this stream marker is optional in the file format, doesn't this
> > >> prevent
> > >> > someone from reading the file without being able to seek() it, e.g. if
> > >> it
> > >> > is "piped in" to a program?  Or otherwise they'll have to stream in
> > the
> > >> > entire thing before they can start parsing?
> > >> >
> > >> > Any reason it can't be mandatory for a File?
> > >> >
> > >> > -John
> > >>
> > >
> >

Reply via email to