Re: Say no to zero length batches...

Jason Altekruse Fri, 14 Apr 2017 15:24:06 -0700

I'm with Wes on this one. A bunch of systems have constructs that deal with
zero length collections, lists, iterators, etc. These are established
patterns that everyone knows they need to handle the empty case. Forcing
applications to create an unnecessary protocol complexity of a special
sentinel value to represent an empty set would be more burdensome.


There is the separate considerations of cases when you want to send only
schema, which it makes sense to me that someone could use arrows metadata
as a universal representation of schema between systems. I think it makes
sense to have a separate concept for schema absent a batch, but users
shouldn't be forced to make all of their APIs return one of these, or a
batch of data.

On Fri, Apr 14, 2017 at 3:16 PM, Wes McKinney <[email protected]> wrote:

> Here is valid pyarrow code that works right now:
>
> import pyarrow as pa
>
> rb = pa.RecordBatch.from_arrays([
>     pa.from_pylist([1, 2, 3]),
>     pa.from_pylist(['foo', 'bar', 'baz'])
> ], names=['one', 'two'])
>
> batches = [rb, rb.slice(0, 0)]
>
> stream = pa.InMemoryOutputStream()
>
> writer = pa.StreamWriter(stream, rb.schema)
> for batch in batches:
>     writer.write_batch(batch)
> writer.close()
>
> reader = pa.StreamReader(stream.get_result())
>
> results = [reader.get_next_batch(), reader.get_next_batch()]
>
> With the proposal to disallow length-0 batches, where should this break?
> Probably StreamWriter.write_batch should raise ValueError, but now we have
> to write:
>
> for batch in batches:
>     if len(batch) > 0:
>         writer.write_batch(batch)
>
> That seems worse, because now the user has to think about batch sizes. When
> we write:
>
> pa.Table.from_batches(results).to_pandas()
>
> the 0 length batches get skipped over anyhow
>
> onetwo
> 0 1 foo
> 1 2 bar
> 2 3 baz
> If you pass only zero-length batches, you get
>
> onetwo
> There are plenty of reasons why things would end up zero-length, like:
>
> * Result of a predicate evaluation that filtered out all the data
> * Files (e.g. Parquet files) with 0 rows
>
> My concern is being able to faithfully represent the in-memory results of
> operations in an RPC/IPC setting. Having to deal with a null / no message
> is much worse for us, because we will in many cases still have to construct
> a length-0 RecordBatch in C++; the question is whether we're letting the
> IPC loader do it or having to construct a "dummy" object based on the
> schema.
>
> On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau <[email protected]>
> wrote:
>
> > Hey All,
> >
> > I had a quick comment on ARROW-783 that Wes responded to and I wanted to
> > elevate the conversation here for a moment.
> >
> > My suggestion there was that we should disallow zero-length batches.
> >
> > Wes thought that should be an application level concern. I wanted to see
> > what others thought.
> >
> > My general perspective is that zero-length batches are meaningless and
> > better to disallow than make every application have special handling for
> > them. In the jira Wes noted that they have to deal with zero-length
> > dataframes. Despite that, I don't think there is a requirement that there
> > should be a 1:1 mapping between arrow record batches and dataframes. If
> > someone wants to communicate empty things, no need for them to use Arrow.
> >
> > What do others think?
> >
>

Re: Say no to zero length batches...

Reply via email to