It seems like we could address these concerns by adding alternate write/read APIs that do the dropping (on write) / skipping (on load) automatically, so it doesn't have to bubble up into application logic.
On Fri, Apr 14, 2017 at 7:56 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > Since Arrow already requires a batch to be no larger than 2^16-1 > records in > size, it won't map 1:1 to an arbitrary construct. > > This is only true of some Arrow applications (e.g. Drill), which is why I > think this is an application-level concern. In ARROW-661 and ARROW-679, we > modified the metadata to permit very large record batches (as a strict > opt-in for Arrow implementations; e.g. practically speaking the official > Arrow format is capped at INT32_MAX) because other C++ applications (e.g. > https://github.com/ray-project/ray) have begun using Arrow as the data > plane for representing very large unions in shared memory. > > > We're communicating a stream of records, not batches. > > I don't see it this way necessarily; this is one interpretation of Arrow's > memory model within a particular application domain. We have defined a > series of in-memory data structures and metadata to enable applications to > interact with them with zero-copy. The stream and file tools are a way to > faithfully transport these data structures from one process's memory space > to another (either via shared memory, or a socket, etc.). > > So if the zero-length batches have no semantic meaning in an application, > then either skip them on load or don't write them in the stream at all. > This seems less burdensome than the alternative. > > On Fri, Apr 14, 2017 at 6:57 PM, Jacques Nadeau <jacq...@apache.org> > wrote: > >> To Jason's comments: >> >> Data and control flow should be separate. Schema (a.k.a. a head-type >> message) is already defined separate from a batch of records. I'm all for >> a >> termination message as well from a stream perspective. (I don't think it >> makes sense to couple record batch size to termination-- I'm agreeing with >> Wes on the previously mentioned jira on this point.) >> >> If you want to communicate no records it would look like <SCHEMA> <TERM> >> >> If you want to communicate a bunch of data it might look like: >> >> <SCHEMA> <BATCH> <BATCH> <TERM> >> >> I don't fully see purpose of having an IPC representation of an empty >> batch >> in this pattern. In terms a relationship with common zero length patterns >> like collections: i'm find if an application wants to deal with zero >> batches internally. I'm saying the batch should be a noop when serialized. >> >> Per Wes's thoughts: >> >> Since Arrow already requires a batch to be no larger than 2^16-1 records >> in >> size, it won't map 1:1 to an arbitrary construct. >> >> Per Wes's example: >> >> for batch in batches: >> writer.write_batch(batch) >> >> I'm fine with writer.write_batch being a noop if the batch has >> zero-length. >> I don't think you need to throw an error. I'd just expect >> writer.getNonZeroBatchCount() to be available if you want to turn around >> and read the stream. >> >> In general, It seems like desiring zero-length batches to be faithfully >> maintained across a stream is attempting to couple them with an external >> system's needs. We're communicating a stream of records, not batches. The >> records happen to be chunked on both the sending and the receiving side. A >> particular implementation of this communication protocol might decide to >> subdivide or concatenate the batches to better communicate the stream. >> Should an internal system be able to express this concept, sure. Should we >> allow these to be communicated between two separate systems, I'm not so >> sure. >> >> Put another way, I think a zero length batch should be serialized to zero >> bytes. :) >> >> >> >> >> >> >> On Fri, Apr 14, 2017 at 3:26 PM, Ted Dunning <ted.dunn...@gmail.com> >> wrote: >> >> > Speaking as a relative outsider, having the boundary cases for a >> transfer >> > protocol be MORE restrictive than the senders and receivers is asking >> for >> > boundary bugs. >> > >> > In this case, both the senders and receiver think that the boundary is 0 >> > (empty lists, empty data frames, 0 results from a database). Having the >> > Arrow format think that the boundary is 1 just adds impedance mismatch >> > where none is necessary. >> > >> > >> > >> > On Fri, Apr 14, 2017 at 3:23 PM, Jason Altekruse < >> altekruseja...@gmail.com >> > > >> > wrote: >> > >> > > I'm with Wes on this one. A bunch of systems have constructs that deal >> > with >> > > zero length collections, lists, iterators, etc. These are established >> > > patterns that everyone knows they need to handle the empty case. >> Forcing >> > > applications to create an unnecessary protocol complexity of a special >> > > sentinel value to represent an empty set would be more burdensome. >> > > >> > > There is the separate considerations of cases when you want to send >> only >> > > schema, which it makes sense to me that someone could use arrows >> metadata >> > > as a universal representation of schema between systems. I think it >> makes >> > > sense to have a separate concept for schema absent a batch, but users >> > > shouldn't be forced to make all of their APIs return one of these, or >> a >> > > batch of data. >> > > >> > > On Fri, Apr 14, 2017 at 3:16 PM, Wes McKinney <wesmck...@gmail.com> >> > wrote: >> > > >> > > > Here is valid pyarrow code that works right now: >> > > > >> > > > import pyarrow as pa >> > > > >> > > > rb = pa.RecordBatch.from_arrays([ >> > > > pa.from_pylist([1, 2, 3]), >> > > > pa.from_pylist(['foo', 'bar', 'baz']) >> > > > ], names=['one', 'two']) >> > > > >> > > > batches = [rb, rb.slice(0, 0)] >> > > > >> > > > stream = pa.InMemoryOutputStream() >> > > > >> > > > writer = pa.StreamWriter(stream, rb.schema) >> > > > for batch in batches: >> > > > writer.write_batch(batch) >> > > > writer.close() >> > > > >> > > > reader = pa.StreamReader(stream.get_result()) >> > > > >> > > > results = [reader.get_next_batch(), reader.get_next_batch()] >> > > > >> > > > With the proposal to disallow length-0 batches, where should this >> > break? >> > > > Probably StreamWriter.write_batch should raise ValueError, but now >> we >> > > have >> > > > to write: >> > > > >> > > > for batch in batches: >> > > > if len(batch) > 0: >> > > > writer.write_batch(batch) >> > > > >> > > > That seems worse, because now the user has to think about batch >> sizes. >> > > When >> > > > we write: >> > > > >> > > > pa.Table.from_batches(results).to_pandas() >> > > > >> > > > the 0 length batches get skipped over anyhow >> > > > >> > > > onetwo >> > > > 0 1 foo >> > > > 1 2 bar >> > > > 2 3 baz >> > > > If you pass only zero-length batches, you get >> > > > >> > > > onetwo >> > > > There are plenty of reasons why things would end up zero-length, >> like: >> > > > >> > > > * Result of a predicate evaluation that filtered out all the data >> > > > * Files (e.g. Parquet files) with 0 rows >> > > > >> > > > My concern is being able to faithfully represent the in-memory >> results >> > of >> > > > operations in an RPC/IPC setting. Having to deal with a null / no >> > message >> > > > is much worse for us, because we will in many cases still have to >> > > construct >> > > > a length-0 RecordBatch in C++; the question is whether we're letting >> > the >> > > > IPC loader do it or having to construct a "dummy" object based on >> the >> > > > schema. >> > > > >> > > > On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau <jacq...@apache.org >> > >> > > > wrote: >> > > > >> > > > > Hey All, >> > > > > >> > > > > I had a quick comment on ARROW-783 that Wes responded to and I >> wanted >> > > to >> > > > > elevate the conversation here for a moment. >> > > > > >> > > > > My suggestion there was that we should disallow zero-length >> batches. >> > > > > >> > > > > Wes thought that should be an application level concern. I wanted >> to >> > > see >> > > > > what others thought. >> > > > > >> > > > > My general perspective is that zero-length batches are meaningless >> > and >> > > > > better to disallow than make every application have special >> handling >> > > for >> > > > > them. In the jira Wes noted that they have to deal with >> zero-length >> > > > > dataframes. Despite that, I don't think there is a requirement >> that >> > > there >> > > > > should be a 1:1 mapping between arrow record batches and >> dataframes. >> > If >> > > > > someone wants to communicate empty things, no need for them to use >> > > Arrow. >> > > > > >> > > > > What do others think? >> > > > > >> > > > >> > > >> > >> > >