Hi John,

It the IPC code segfaults on invalid input then it's worth opening an
issue on JIRA.

Regards

Antoine.


Le 21/05/2019 à 23:52, John Muehlhausen a écrit :
> Wes,
> 
> Check out reader.cpp.  It seg faults when it gets to the next
> message-that-is-not-a-message... it is a footer.  But I have no way to
> know this in reader.cpp because I'm piping the File in via stdin.
> 
> In seeker.cpp I seek to the end and figure out where the footer is (this
> is a py-arrow-written file) and indeed it is at the offset where my
> "streamed File" reader bombed out.  If EOS were mandatory at this
> location it would have been fine... I would have said "oh, time for the
> footer!"
> 
> Basically what I'm saying is that we can't assume that File won't be
> processed as a stream.  In an actual non-file stream it is either EOS or
> end-of-stream.  But with a file-as-stream there is more data and we have
> to know it isn't the stream anymore.
> 
> Otherwise we've locked the File use-cases into those where the File
> isn't streamed -- i.e. is seekable.  See what I'm saying?  For
> reader.cpp to have been functional it would have had to read the entire
> File into a buffer before parsing, since it could not seek().  This
> could be easily avoided with a mandatory EOS in the File format.  Basically:
> 
> <magic number "ARROW1">
> <empty padding bytes [to 8 byte boundary]>
> <STREAMING FORMAT>
> *<EOS if not in stream>*
> <FOOTER>
> <FOOTER SIZE: int32>
> <magic number "ARROW1">
> 
> -John
> 
> On Tue, May 21, 2019 at 4:44 PM Wes McKinney <wesmck...@gmail.com
> <mailto:wesmck...@gmail.com>> wrote:
> 
>     hi John,
> 
>     I'm not sure I follow. The EOS you're referring to is part of the
>     streaming format. It's designed to be readable using an InputStream
>     interface that does not support seeking at all. You can see the core
>     logic where messages are popped off the InputStream here
> 
>     
> https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f511962c8b012/cpp/src/arrow/ipc/message.cc#L281
> 
>     If the end of the byte stream is reached, or EOS (0) is encountered,
>     then the stream reader stops iteration.
> 
>     - Wes
> 
>     On Tue, May 21, 2019 at 4:34 PM John Muehlhausen <j...@jgm.org
>     <mailto:j...@jgm.org>> wrote:
>     >
>     > https://arrow.apache.org/docs/format/IPC.html#file-format
>     >
>     > <EOS [optional]: int32>
>     >
>     > If this stream marker is optional in the file format, doesn't this
>     prevent
>     > someone from reading the file without being able to seek() it,
>     e.g. if it
>     > is "piped in" to a program?  Or otherwise they'll have to stream
>     in the
>     > entire thing before they can start parsing?
>     >
>     > Any reason it can't be mandatory for a File?
>     >
>     > -John
> 

Reply via email to