Speaking as a relative outsider, having the boundary cases for a transfer protocol be MORE restrictive than the senders and receivers is asking for boundary bugs.
In this case, both the senders and receiver think that the boundary is 0 (empty lists, empty data frames, 0 results from a database). Having the Arrow format think that the boundary is 1 just adds impedance mismatch where none is necessary. On Fri, Apr 14, 2017 at 3:23 PM, Jason Altekruse <altekruseja...@gmail.com> wrote: > I'm with Wes on this one. A bunch of systems have constructs that deal with > zero length collections, lists, iterators, etc. These are established > patterns that everyone knows they need to handle the empty case. Forcing > applications to create an unnecessary protocol complexity of a special > sentinel value to represent an empty set would be more burdensome. > > There is the separate considerations of cases when you want to send only > schema, which it makes sense to me that someone could use arrows metadata > as a universal representation of schema between systems. I think it makes > sense to have a separate concept for schema absent a batch, but users > shouldn't be forced to make all of their APIs return one of these, or a > batch of data. > > On Fri, Apr 14, 2017 at 3:16 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > > Here is valid pyarrow code that works right now: > > > > import pyarrow as pa > > > > rb = pa.RecordBatch.from_arrays([ > > pa.from_pylist([1, 2, 3]), > > pa.from_pylist(['foo', 'bar', 'baz']) > > ], names=['one', 'two']) > > > > batches = [rb, rb.slice(0, 0)] > > > > stream = pa.InMemoryOutputStream() > > > > writer = pa.StreamWriter(stream, rb.schema) > > for batch in batches: > > writer.write_batch(batch) > > writer.close() > > > > reader = pa.StreamReader(stream.get_result()) > > > > results = [reader.get_next_batch(), reader.get_next_batch()] > > > > With the proposal to disallow length-0 batches, where should this break? > > Probably StreamWriter.write_batch should raise ValueError, but now we > have > > to write: > > > > for batch in batches: > > if len(batch) > 0: > > writer.write_batch(batch) > > > > That seems worse, because now the user has to think about batch sizes. > When > > we write: > > > > pa.Table.from_batches(results).to_pandas() > > > > the 0 length batches get skipped over anyhow > > > > onetwo > > 0 1 foo > > 1 2 bar > > 2 3 baz > > If you pass only zero-length batches, you get > > > > onetwo > > There are plenty of reasons why things would end up zero-length, like: > > > > * Result of a predicate evaluation that filtered out all the data > > * Files (e.g. Parquet files) with 0 rows > > > > My concern is being able to faithfully represent the in-memory results of > > operations in an RPC/IPC setting. Having to deal with a null / no message > > is much worse for us, because we will in many cases still have to > construct > > a length-0 RecordBatch in C++; the question is whether we're letting the > > IPC loader do it or having to construct a "dummy" object based on the > > schema. > > > > On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau <jacq...@apache.org> > > wrote: > > > > > Hey All, > > > > > > I had a quick comment on ARROW-783 that Wes responded to and I wanted > to > > > elevate the conversation here for a moment. > > > > > > My suggestion there was that we should disallow zero-length batches. > > > > > > Wes thought that should be an application level concern. I wanted to > see > > > what others thought. > > > > > > My general perspective is that zero-length batches are meaningless and > > > better to disallow than make every application have special handling > for > > > them. In the jira Wes noted that they have to deal with zero-length > > > dataframes. Despite that, I don't think there is a requirement that > there > > > should be a 1:1 mapping between arrow record batches and dataframes. If > > > someone wants to communicate empty things, no need for them to use > Arrow. > > > > > > What do others think? > > > > > >