Re: Say no to zero length batches...

Jacques Nadeau Fri, 14 Apr 2017 15:58:07 -0700

To Jason's comments:

Data and control flow should be separate. Schema (a.k.a. a head-type
message) is already defined separate from a batch of records. I'm all for a
termination message as well from a stream perspective. (I don't think it
makes sense to couple record batch size to termination-- I'm agreeing with
Wes on the previously mentioned jira on this point.)


If you want to communicate no records it would look like <SCHEMA> <TERM>

If you want to communicate a bunch of data it might look like:

<SCHEMA> <BATCH> <BATCH> <TERM>

I don't fully see purpose of having an IPC representation of an empty batch
in this pattern. In terms a relationship with common zero length patterns
like collections: i'm find if an application wants to deal with zero
batches internally. I'm saying the batch should be a noop when serialized.

Per Wes's thoughts:

Since Arrow already requires a batch to be no larger than 2^16-1 records in
size, it won't map 1:1 to an arbitrary construct.

Per Wes's example:

for batch in batches:
  writer.write_batch(batch)

I'm fine with writer.write_batch being a noop if the batch has zero-length.
I don't think you need to throw an error. I'd just expect
writer.getNonZeroBatchCount() to be available if you want to turn around
and read the stream.

In general, It seems like desiring zero-length batches to be faithfully
maintained across a stream is attempting to couple them with an external
system's needs. We're communicating a stream of records, not batches. The
records happen to be chunked on both the sending and the receiving side. A
particular implementation of this communication protocol might decide to
subdivide or concatenate the batches to better communicate the stream.
Should an internal system be able to express this concept, sure. Should we
allow these to be communicated between two separate systems, I'm not so
sure.

Put another way, I think a zero length batch should be serialized to zero
bytes. :)






On Fri, Apr 14, 2017 at 3:26 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Speaking as a relative outsider, having the boundary cases for a transfer
> protocol be MORE restrictive than the senders and receivers is asking for
> boundary bugs.
>
> In this case, both the senders and receiver think that the boundary is 0
> (empty lists, empty data frames, 0 results from a database). Having the
> Arrow format think that the boundary is 1 just adds impedance mismatch
> where none is necessary.
>
>
>
> On Fri, Apr 14, 2017 at 3:23 PM, Jason Altekruse <altekruseja...@gmail.com
> >
> wrote:
>
> > I'm with Wes on this one. A bunch of systems have constructs that deal
> with
> > zero length collections, lists, iterators, etc. These are established
> > patterns that everyone knows they need to handle the empty case. Forcing
> > applications to create an unnecessary protocol complexity of a special
> > sentinel value to represent an empty set would be more burdensome.
> >
> > There is the separate considerations of cases when you want to send only
> > schema, which it makes sense to me that someone could use arrows metadata
> > as a universal representation of schema between systems. I think it makes
> > sense to have a separate concept for schema absent a batch, but users
> > shouldn't be forced to make all of their APIs return one of these, or a
> > batch of data.
> >
> > On Fri, Apr 14, 2017 at 3:16 PM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > Here is valid pyarrow code that works right now:
> > >
> > > import pyarrow as pa
> > >
> > > rb = pa.RecordBatch.from_arrays([
> > >     pa.from_pylist([1, 2, 3]),
> > >     pa.from_pylist(['foo', 'bar', 'baz'])
> > > ], names=['one', 'two'])
> > >
> > > batches = [rb, rb.slice(0, 0)]
> > >
> > > stream = pa.InMemoryOutputStream()
> > >
> > > writer = pa.StreamWriter(stream, rb.schema)
> > > for batch in batches:
> > >     writer.write_batch(batch)
> > > writer.close()
> > >
> > > reader = pa.StreamReader(stream.get_result())
> > >
> > > results = [reader.get_next_batch(), reader.get_next_batch()]
> > >
> > > With the proposal to disallow length-0 batches, where should this
> break?
> > > Probably StreamWriter.write_batch should raise ValueError, but now we
> > have
> > > to write:
> > >
> > > for batch in batches:
> > >     if len(batch) > 0:
> > >         writer.write_batch(batch)
> > >
> > > That seems worse, because now the user has to think about batch sizes.
> > When
> > > we write:
> > >
> > > pa.Table.from_batches(results).to_pandas()
> > >
> > > the 0 length batches get skipped over anyhow
> > >
> > > onetwo
> > > 0 1 foo
> > > 1 2 bar
> > > 2 3 baz
> > > If you pass only zero-length batches, you get
> > >
> > > onetwo
> > > There are plenty of reasons why things would end up zero-length, like:
> > >
> > > * Result of a predicate evaluation that filtered out all the data
> > > * Files (e.g. Parquet files) with 0 rows
> > >
> > > My concern is being able to faithfully represent the in-memory results
> of
> > > operations in an RPC/IPC setting. Having to deal with a null / no
> message
> > > is much worse for us, because we will in many cases still have to
> > construct
> > > a length-0 RecordBatch in C++; the question is whether we're letting
> the
> > > IPC loader do it or having to construct a "dummy" object based on the
> > > schema.
> > >
> > > On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau <jacq...@apache.org>
> > > wrote:
> > >
> > > > Hey All,
> > > >
> > > > I had a quick comment on ARROW-783 that Wes responded to and I wanted
> > to
> > > > elevate the conversation here for a moment.
> > > >
> > > > My suggestion there was that we should disallow zero-length batches.
> > > >
> > > > Wes thought that should be an application level concern. I wanted to
> > see
> > > > what others thought.
> > > >
> > > > My general perspective is that zero-length batches are meaningless
> and
> > > > better to disallow than make every application have special handling
> > for
> > > > them. In the jira Wes noted that they have to deal with zero-length
> > > > dataframes. Despite that, I don't think there is a requirement that
> > there
> > > > should be a 1:1 mapping between arrow record batches and dataframes.
> If
> > > > someone wants to communicate empty things, no need for them to use
> > Arrow.
> > > >
> > > > What do others think?
> > > >
> > >
> >
>

Re: Say no to zero length batches...

Reply via email to