If I'm the sole voice on this perspective, I'll concede the point. I didn't even catch the increase in allowed record batch sizes as part of ARROW-661 and ARROW-679. :(
I'm of split mind of the thoughts there: - We need more applications so making sure that we have the features available to support certain applications is good - Having all sorts of optional components that are different between our two primary implementations is likely to lead to a bad experience as a cross-compatible representation. It may make sense to require having a RI for each of Java and C++ for all format features. On Fri, Apr 14, 2017 at 4:56 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > Since Arrow already requires a batch to be no larger than 2^16-1 records > in > size, it won't map 1:1 to an arbitrary construct. > > This is only true of some Arrow applications (e.g. Drill), which is why I > think this is an application-level concern. In ARROW-661 and ARROW-679, we > modified the metadata to permit very large record batches (as a strict > opt-in for Arrow implementations; e.g. practically speaking the official > Arrow format is capped at INT32_MAX) because other C++ applications (e.g. > https://github.com/ray-project/ray) have begun using Arrow as the data > plane for representing very large unions in shared memory. > > > We're communicating a stream of records, not batches. > > I don't see it this way necessarily; this is one interpretation of Arrow's > memory model within a particular application domain. We have defined a > series of in-memory data structures and metadata to enable applications to > interact with them with zero-copy. The stream and file tools are a way to > faithfully transport these data structures from one process's memory space > to another (either via shared memory, or a socket, etc.). > > So if the zero-length batches have no semantic meaning in an application, > then either skip them on load or don't write them in the stream at all. > This seems less burdensome than the alternative. > > On Fri, Apr 14, 2017 at 6:57 PM, Jacques Nadeau <jacq...@apache.org> > wrote: > > > To Jason's comments: > > > > Data and control flow should be separate. Schema (a.k.a. a head-type > > message) is already defined separate from a batch of records. I'm all > for a > > termination message as well from a stream perspective. (I don't think it > > makes sense to couple record batch size to termination-- I'm agreeing > with > > Wes on the previously mentioned jira on this point.) > > > > If you want to communicate no records it would look like <SCHEMA> <TERM> > > > > If you want to communicate a bunch of data it might look like: > > > > <SCHEMA> <BATCH> <BATCH> <TERM> > > > > I don't fully see purpose of having an IPC representation of an empty > batch > > in this pattern. In terms a relationship with common zero length patterns > > like collections: i'm find if an application wants to deal with zero > > batches internally. I'm saying the batch should be a noop when > serialized. > > > > Per Wes's thoughts: > > > > Since Arrow already requires a batch to be no larger than 2^16-1 records > in > > size, it won't map 1:1 to an arbitrary construct. > > > > Per Wes's example: > > > > for batch in batches: > > writer.write_batch(batch) > > > > I'm fine with writer.write_batch being a noop if the batch has > zero-length. > > I don't think you need to throw an error. I'd just expect > > writer.getNonZeroBatchCount() to be available if you want to turn around > > and read the stream. > > > > In general, It seems like desiring zero-length batches to be faithfully > > maintained across a stream is attempting to couple them with an external > > system's needs. We're communicating a stream of records, not batches. The > > records happen to be chunked on both the sending and the receiving side. > A > > particular implementation of this communication protocol might decide to > > subdivide or concatenate the batches to better communicate the stream. > > Should an internal system be able to express this concept, sure. Should > we > > allow these to be communicated between two separate systems, I'm not so > > sure. > > > > Put another way, I think a zero length batch should be serialized to zero > > bytes. :) > > > > > > > > > > > > > > On Fri, Apr 14, 2017 at 3:26 PM, Ted Dunning <ted.dunn...@gmail.com> > > wrote: > > > > > Speaking as a relative outsider, having the boundary cases for a > transfer > > > protocol be MORE restrictive than the senders and receivers is asking > for > > > boundary bugs. > > > > > > In this case, both the senders and receiver think that the boundary is > 0 > > > (empty lists, empty data frames, 0 results from a database). Having the > > > Arrow format think that the boundary is 1 just adds impedance mismatch > > > where none is necessary. > > > > > > > > > > > > On Fri, Apr 14, 2017 at 3:23 PM, Jason Altekruse < > > altekruseja...@gmail.com > > > > > > > wrote: > > > > > > > I'm with Wes on this one. A bunch of systems have constructs that > deal > > > with > > > > zero length collections, lists, iterators, etc. These are established > > > > patterns that everyone knows they need to handle the empty case. > > Forcing > > > > applications to create an unnecessary protocol complexity of a > special > > > > sentinel value to represent an empty set would be more burdensome. > > > > > > > > There is the separate considerations of cases when you want to send > > only > > > > schema, which it makes sense to me that someone could use arrows > > metadata > > > > as a universal representation of schema between systems. I think it > > makes > > > > sense to have a separate concept for schema absent a batch, but users > > > > shouldn't be forced to make all of their APIs return one of these, > or a > > > > batch of data. > > > > > > > > On Fri, Apr 14, 2017 at 3:16 PM, Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > Here is valid pyarrow code that works right now: > > > > > > > > > > import pyarrow as pa > > > > > > > > > > rb = pa.RecordBatch.from_arrays([ > > > > > pa.from_pylist([1, 2, 3]), > > > > > pa.from_pylist(['foo', 'bar', 'baz']) > > > > > ], names=['one', 'two']) > > > > > > > > > > batches = [rb, rb.slice(0, 0)] > > > > > > > > > > stream = pa.InMemoryOutputStream() > > > > > > > > > > writer = pa.StreamWriter(stream, rb.schema) > > > > > for batch in batches: > > > > > writer.write_batch(batch) > > > > > writer.close() > > > > > > > > > > reader = pa.StreamReader(stream.get_result()) > > > > > > > > > > results = [reader.get_next_batch(), reader.get_next_batch()] > > > > > > > > > > With the proposal to disallow length-0 batches, where should this > > > break? > > > > > Probably StreamWriter.write_batch should raise ValueError, but now > we > > > > have > > > > > to write: > > > > > > > > > > for batch in batches: > > > > > if len(batch) > 0: > > > > > writer.write_batch(batch) > > > > > > > > > > That seems worse, because now the user has to think about batch > > sizes. > > > > When > > > > > we write: > > > > > > > > > > pa.Table.from_batches(results).to_pandas() > > > > > > > > > > the 0 length batches get skipped over anyhow > > > > > > > > > > onetwo > > > > > 0 1 foo > > > > > 1 2 bar > > > > > 2 3 baz > > > > > If you pass only zero-length batches, you get > > > > > > > > > > onetwo > > > > > There are plenty of reasons why things would end up zero-length, > > like: > > > > > > > > > > * Result of a predicate evaluation that filtered out all the data > > > > > * Files (e.g. Parquet files) with 0 rows > > > > > > > > > > My concern is being able to faithfully represent the in-memory > > results > > > of > > > > > operations in an RPC/IPC setting. Having to deal with a null / no > > > message > > > > > is much worse for us, because we will in many cases still have to > > > > construct > > > > > a length-0 RecordBatch in C++; the question is whether we're > letting > > > the > > > > > IPC loader do it or having to construct a "dummy" object based on > the > > > > > schema. > > > > > > > > > > On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau < > jacq...@apache.org> > > > > > wrote: > > > > > > > > > > > Hey All, > > > > > > > > > > > > I had a quick comment on ARROW-783 that Wes responded to and I > > wanted > > > > to > > > > > > elevate the conversation here for a moment. > > > > > > > > > > > > My suggestion there was that we should disallow zero-length > > batches. > > > > > > > > > > > > Wes thought that should be an application level concern. I wanted > > to > > > > see > > > > > > what others thought. > > > > > > > > > > > > My general perspective is that zero-length batches are > meaningless > > > and > > > > > > better to disallow than make every application have special > > handling > > > > for > > > > > > them. In the jira Wes noted that they have to deal with > zero-length > > > > > > dataframes. Despite that, I don't think there is a requirement > that > > > > there > > > > > > should be a 1:1 mapping between arrow record batches and > > dataframes. > > > If > > > > > > someone wants to communicate empty things, no need for them to > use > > > > Arrow. > > > > > > > > > > > > What do others think? > > > > > > > > > > > > > > > > > > > > >