I agree with Jacques on the expansion of the allowed batch sizes being a significant change to the format.
Optional features do have a risk of fragmenting the community, but need to be balanced against the benefits provided to a particular user of arrow. I think that the requirement that individual columns require contiguous memory (even user space virtually contiguous) make this usage pattern seem undesirable to me. If you are currently consuming half of the memory of your system, and want to add some data to your arrow structures, you would need to re-allocate expanded large chunks of memory. You could make this work to an extent by resizing each column individually, to avoid prematurely running out of memory, but I would think this would be undesirable relative to an approach that would just add a new batch of data to the end of a logical wrapper around several batches. I am I missing something, is the case specifically where the data size is known and doesn't need to be added to throughout processing? Do the results of the intermediate computation represent a small amount of data? On Fri, Apr 14, 2017 at 6:13 PM, Jacques Nadeau <jacq...@apache.org> wrote: > If I'm the sole voice on this perspective, I'll concede the point. > > I didn't even catch the increase in allowed record batch sizes as part of > ARROW-661 and ARROW-679. :( > > I'm of split mind of the thoughts there: > - We need more applications so making sure that we have the features > available to support certain applications is good > - Having all sorts of optional components that are different between our > two primary implementations is likely to lead to a bad experience as a > cross-compatible representation. > > It may make sense to require having a RI for each of Java and C++ for all > format features. > > > > > > On Fri, Apr 14, 2017 at 4:56 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > > > Since Arrow already requires a batch to be no larger than 2^16-1 > records > > in > > size, it won't map 1:1 to an arbitrary construct. > > > > This is only true of some Arrow applications (e.g. Drill), which is why I > > think this is an application-level concern. In ARROW-661 and ARROW-679, > we > > modified the metadata to permit very large record batches (as a strict > > opt-in for Arrow implementations; e.g. practically speaking the official > > Arrow format is capped at INT32_MAX) because other C++ applications (e.g. > > https://github.com/ray-project/ray) have begun using Arrow as the data > > plane for representing very large unions in shared memory. > > > > > We're communicating a stream of records, not batches. > > > > I don't see it this way necessarily; this is one interpretation of > Arrow's > > memory model within a particular application domain. We have defined a > > series of in-memory data structures and metadata to enable applications > to > > interact with them with zero-copy. The stream and file tools are a way to > > faithfully transport these data structures from one process's memory > space > > to another (either via shared memory, or a socket, etc.). > > > > So if the zero-length batches have no semantic meaning in an application, > > then either skip them on load or don't write them in the stream at all. > > This seems less burdensome than the alternative. > > > > On Fri, Apr 14, 2017 at 6:57 PM, Jacques Nadeau <jacq...@apache.org> > > wrote: > > > > > To Jason's comments: > > > > > > Data and control flow should be separate. Schema (a.k.a. a head-type > > > message) is already defined separate from a batch of records. I'm all > > for a > > > termination message as well from a stream perspective. (I don't think > it > > > makes sense to couple record batch size to termination-- I'm agreeing > > with > > > Wes on the previously mentioned jira on this point.) > > > > > > If you want to communicate no records it would look like <SCHEMA> > <TERM> > > > > > > If you want to communicate a bunch of data it might look like: > > > > > > <SCHEMA> <BATCH> <BATCH> <TERM> > > > > > > I don't fully see purpose of having an IPC representation of an empty > > batch > > > in this pattern. In terms a relationship with common zero length > patterns > > > like collections: i'm find if an application wants to deal with zero > > > batches internally. I'm saying the batch should be a noop when > > serialized. > > > > > > Per Wes's thoughts: > > > > > > Since Arrow already requires a batch to be no larger than 2^16-1 > records > > in > > > size, it won't map 1:1 to an arbitrary construct. > > > > > > Per Wes's example: > > > > > > for batch in batches: > > > writer.write_batch(batch) > > > > > > I'm fine with writer.write_batch being a noop if the batch has > > zero-length. > > > I don't think you need to throw an error. I'd just expect > > > writer.getNonZeroBatchCount() to be available if you want to turn > around > > > and read the stream. > > > > > > In general, It seems like desiring zero-length batches to be faithfully > > > maintained across a stream is attempting to couple them with an > external > > > system's needs. We're communicating a stream of records, not batches. > The > > > records happen to be chunked on both the sending and the receiving > side. > > A > > > particular implementation of this communication protocol might decide > to > > > subdivide or concatenate the batches to better communicate the stream. > > > Should an internal system be able to express this concept, sure. Should > > we > > > allow these to be communicated between two separate systems, I'm not so > > > sure. > > > > > > Put another way, I think a zero length batch should be serialized to > zero > > > bytes. :) > > > > > > > > > > > > > > > > > > > > > On Fri, Apr 14, 2017 at 3:26 PM, Ted Dunning <ted.dunn...@gmail.com> > > > wrote: > > > > > > > Speaking as a relative outsider, having the boundary cases for a > > transfer > > > > protocol be MORE restrictive than the senders and receivers is asking > > for > > > > boundary bugs. > > > > > > > > In this case, both the senders and receiver think that the boundary > is > > 0 > > > > (empty lists, empty data frames, 0 results from a database). Having > the > > > > Arrow format think that the boundary is 1 just adds impedance > mismatch > > > > where none is necessary. > > > > > > > > > > > > > > > > On Fri, Apr 14, 2017 at 3:23 PM, Jason Altekruse < > > > altekruseja...@gmail.com > > > > > > > > > wrote: > > > > > > > > > I'm with Wes on this one. A bunch of systems have constructs that > > deal > > > > with > > > > > zero length collections, lists, iterators, etc. These are > established > > > > > patterns that everyone knows they need to handle the empty case. > > > Forcing > > > > > applications to create an unnecessary protocol complexity of a > > special > > > > > sentinel value to represent an empty set would be more burdensome. > > > > > > > > > > There is the separate considerations of cases when you want to send > > > only > > > > > schema, which it makes sense to me that someone could use arrows > > > metadata > > > > > as a universal representation of schema between systems. I think it > > > makes > > > > > sense to have a separate concept for schema absent a batch, but > users > > > > > shouldn't be forced to make all of their APIs return one of these, > > or a > > > > > batch of data. > > > > > > > > > > On Fri, Apr 14, 2017 at 3:16 PM, Wes McKinney <wesmck...@gmail.com > > > > > > wrote: > > > > > > > > > > > Here is valid pyarrow code that works right now: > > > > > > > > > > > > import pyarrow as pa > > > > > > > > > > > > rb = pa.RecordBatch.from_arrays([ > > > > > > pa.from_pylist([1, 2, 3]), > > > > > > pa.from_pylist(['foo', 'bar', 'baz']) > > > > > > ], names=['one', 'two']) > > > > > > > > > > > > batches = [rb, rb.slice(0, 0)] > > > > > > > > > > > > stream = pa.InMemoryOutputStream() > > > > > > > > > > > > writer = pa.StreamWriter(stream, rb.schema) > > > > > > for batch in batches: > > > > > > writer.write_batch(batch) > > > > > > writer.close() > > > > > > > > > > > > reader = pa.StreamReader(stream.get_result()) > > > > > > > > > > > > results = [reader.get_next_batch(), reader.get_next_batch()] > > > > > > > > > > > > With the proposal to disallow length-0 batches, where should this > > > > break? > > > > > > Probably StreamWriter.write_batch should raise ValueError, but > now > > we > > > > > have > > > > > > to write: > > > > > > > > > > > > for batch in batches: > > > > > > if len(batch) > 0: > > > > > > writer.write_batch(batch) > > > > > > > > > > > > That seems worse, because now the user has to think about batch > > > sizes. > > > > > When > > > > > > we write: > > > > > > > > > > > > pa.Table.from_batches(results).to_pandas() > > > > > > > > > > > > the 0 length batches get skipped over anyhow > > > > > > > > > > > > onetwo > > > > > > 0 1 foo > > > > > > 1 2 bar > > > > > > 2 3 baz > > > > > > If you pass only zero-length batches, you get > > > > > > > > > > > > onetwo > > > > > > There are plenty of reasons why things would end up zero-length, > > > like: > > > > > > > > > > > > * Result of a predicate evaluation that filtered out all the data > > > > > > * Files (e.g. Parquet files) with 0 rows > > > > > > > > > > > > My concern is being able to faithfully represent the in-memory > > > results > > > > of > > > > > > operations in an RPC/IPC setting. Having to deal with a null / no > > > > message > > > > > > is much worse for us, because we will in many cases still have to > > > > > construct > > > > > > a length-0 RecordBatch in C++; the question is whether we're > > letting > > > > the > > > > > > IPC loader do it or having to construct a "dummy" object based on > > the > > > > > > schema. > > > > > > > > > > > > On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau < > > jacq...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > Hey All, > > > > > > > > > > > > > > I had a quick comment on ARROW-783 that Wes responded to and I > > > wanted > > > > > to > > > > > > > elevate the conversation here for a moment. > > > > > > > > > > > > > > My suggestion there was that we should disallow zero-length > > > batches. > > > > > > > > > > > > > > Wes thought that should be an application level concern. I > wanted > > > to > > > > > see > > > > > > > what others thought. > > > > > > > > > > > > > > My general perspective is that zero-length batches are > > meaningless > > > > and > > > > > > > better to disallow than make every application have special > > > handling > > > > > for > > > > > > > them. In the jira Wes noted that they have to deal with > > zero-length > > > > > > > dataframes. Despite that, I don't think there is a requirement > > that > > > > > there > > > > > > > should be a 1:1 mapping between arrow record batches and > > > dataframes. > > > > If > > > > > > > someone wants to communicate empty things, no need for them to > > use > > > > > Arrow. > > > > > > > > > > > > > > What do others think? > > > > > > > > > > > > > > > > > > > > > > > > > > > >