Re: Say no to zero length batches...

Wes McKinney Fri, 14 Apr 2017 15:17:33 -0700

Here is valid pyarrow code that works right now:

import pyarrow as pa

rb = pa.RecordBatch.from_arrays([
    pa.from_pylist([1, 2, 3]),
    pa.from_pylist(['foo', 'bar', 'baz'])
], names=['one', 'two'])

batches = [rb, rb.slice(0, 0)]

stream = pa.InMemoryOutputStream()

writer = pa.StreamWriter(stream, rb.schema)
for batch in batches:
    writer.write_batch(batch)
writer.close()

reader = pa.StreamReader(stream.get_result())

results = [reader.get_next_batch(), reader.get_next_batch()]

With the proposal to disallow length-0 batches, where should this break?
Probably StreamWriter.write_batch should raise ValueError, but now we have
to write:

for batch in batches:
    if len(batch) > 0:
        writer.write_batch(batch)

That seems worse, because now the user has to think about batch sizes. When
we write:

pa.Table.from_batches(results).to_pandas()

the 0 length batches get skipped over anyhow

onetwo
0 1 foo
1 2 bar
2 3 baz
If you pass only zero-length batches, you get

onetwo
There are plenty of reasons why things would end up zero-length, like:

* Result of a predicate evaluation that filtered out all the data
* Files (e.g. Parquet files) with 0 rows

My concern is being able to faithfully represent the in-memory results of
operations in an RPC/IPC setting. Having to deal with a null / no message
is much worse for us, because we will in many cases still have to construct
a length-0 RecordBatch in C++; the question is whether we're letting the
IPC loader do it or having to construct a "dummy" object based on the
schema.

On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau <jacq...@apache.org> wrote:

> Hey All,
>
> I had a quick comment on ARROW-783 that Wes responded to and I wanted to
> elevate the conversation here for a moment.
>
> My suggestion there was that we should disallow zero-length batches.
>
> Wes thought that should be an application level concern. I wanted to see
> what others thought.
>
> My general perspective is that zero-length batches are meaningless and
> better to disallow than make every application have special handling for
> them. In the jira Wes noted that they have to deal with zero-length
> dataframes. Despite that, I don't think there is a requirement that there
> should be a 1:1 mapping between arrow record batches and dataframes. If
> someone wants to communicate empty things, no need for them to use Arrow.
>
> What do others think?
>

Re: Say no to zero length batches...

Reply via email to