Here is valid pyarrow code that works right now: import pyarrow as pa
rb = pa.RecordBatch.from_arrays([ pa.from_pylist([1, 2, 3]), pa.from_pylist(['foo', 'bar', 'baz']) ], names=['one', 'two']) batches = [rb, rb.slice(0, 0)] stream = pa.InMemoryOutputStream() writer = pa.StreamWriter(stream, rb.schema) for batch in batches: writer.write_batch(batch) writer.close() reader = pa.StreamReader(stream.get_result()) results = [reader.get_next_batch(), reader.get_next_batch()] With the proposal to disallow length-0 batches, where should this break? Probably StreamWriter.write_batch should raise ValueError, but now we have to write: for batch in batches: if len(batch) > 0: writer.write_batch(batch) That seems worse, because now the user has to think about batch sizes. When we write: pa.Table.from_batches(results).to_pandas() the 0 length batches get skipped over anyhow onetwo 0 1 foo 1 2 bar 2 3 baz If you pass only zero-length batches, you get onetwo There are plenty of reasons why things would end up zero-length, like: * Result of a predicate evaluation that filtered out all the data * Files (e.g. Parquet files) with 0 rows My concern is being able to faithfully represent the in-memory results of operations in an RPC/IPC setting. Having to deal with a null / no message is much worse for us, because we will in many cases still have to construct a length-0 RecordBatch in C++; the question is whether we're letting the IPC loader do it or having to construct a "dummy" object based on the schema. On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau <jacq...@apache.org> wrote: > Hey All, > > I had a quick comment on ARROW-783 that Wes responded to and I wanted to > elevate the conversation here for a moment. > > My suggestion there was that we should disallow zero-length batches. > > Wes thought that should be an application level concern. I wanted to see > what others thought. > > My general perspective is that zero-length batches are meaningless and > better to disallow than make every application have special handling for > them. In the jira Wes noted that they have to deal with zero-length > dataframes. Despite that, I don't think there is a requirement that there > should be a 1:1 mapping between arrow record batches and dataframes. If > someone wants to communicate empty things, no need for them to use Arrow. > > What do others think? >