Hi John,
It the IPC code segfaults on invalid input then it's worth opening an issue on JIRA. Regards Antoine. Le 21/05/2019 à 23:52, John Muehlhausen a écrit : > Wes, > > Check out reader.cpp. It seg faults when it gets to the next > message-that-is-not-a-message... it is a footer. But I have no way to > know this in reader.cpp because I'm piping the File in via stdin. > > In seeker.cpp I seek to the end and figure out where the footer is (this > is a py-arrow-written file) and indeed it is at the offset where my > "streamed File" reader bombed out. If EOS were mandatory at this > location it would have been fine... I would have said "oh, time for the > footer!" > > Basically what I'm saying is that we can't assume that File won't be > processed as a stream. In an actual non-file stream it is either EOS or > end-of-stream. But with a file-as-stream there is more data and we have > to know it isn't the stream anymore. > > Otherwise we've locked the File use-cases into those where the File > isn't streamed -- i.e. is seekable. See what I'm saying? For > reader.cpp to have been functional it would have had to read the entire > File into a buffer before parsing, since it could not seek(). This > could be easily avoided with a mandatory EOS in the File format. Basically: > > <magic number "ARROW1"> > <empty padding bytes [to 8 byte boundary]> > <STREAMING FORMAT> > *<EOS if not in stream>* > <FOOTER> > <FOOTER SIZE: int32> > <magic number "ARROW1"> > > -John > > On Tue, May 21, 2019 at 4:44 PM Wes McKinney <wesmck...@gmail.com > <mailto:wesmck...@gmail.com>> wrote: > > hi John, > > I'm not sure I follow. The EOS you're referring to is part of the > streaming format. It's designed to be readable using an InputStream > interface that does not support seeking at all. You can see the core > logic where messages are popped off the InputStream here > > > https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f511962c8b012/cpp/src/arrow/ipc/message.cc#L281 > > If the end of the byte stream is reached, or EOS (0) is encountered, > then the stream reader stops iteration. > > - Wes > > On Tue, May 21, 2019 at 4:34 PM John Muehlhausen <j...@jgm.org > <mailto:j...@jgm.org>> wrote: > > > > https://arrow.apache.org/docs/format/IPC.html#file-format > > > > <EOS [optional]: int32> > > > > If this stream marker is optional in the file format, doesn't this > prevent > > someone from reading the file without being able to seek() it, > e.g. if it > > is "piped in" to a program? Or otherwise they'll have to stream > in the > > entire thing before they can start parsing? > > > > Any reason it can't be mandatory for a File? > > > > -John >