Re: Say no to zero length batches...

Wes McKinney Fri, 14 Apr 2017 17:08:16 -0700

It seems like we could address these concerns by adding alternate
write/read APIs that do the dropping (on write) / skipping (on load)
automatically, so it doesn't have to bubble up into application logic.


On Fri, Apr 14, 2017 at 7:56 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> > Since Arrow already requires a batch to be no larger than 2^16-1
> records in
> size, it won't map 1:1 to an arbitrary construct.
>
> This is only true of some Arrow applications (e.g. Drill), which is why I
> think this is an application-level concern. In ARROW-661 and ARROW-679, we
> modified the metadata to permit very large record batches (as a strict
> opt-in for Arrow implementations; e.g. practically speaking the official
> Arrow format is capped at INT32_MAX) because other C++ applications (e.g.
> https://github.com/ray-project/ray) have begun using Arrow as the data
> plane for representing very large unions in shared memory.
>
> > We're communicating a stream of records, not batches.
>
> I don't see it this way necessarily; this is one interpretation of Arrow's
> memory model within a particular application domain. We have defined a
> series of in-memory data structures and metadata to enable applications to
> interact with them with zero-copy. The stream and file tools are a way to
> faithfully transport these data structures from one process's memory space
> to another (either via shared memory, or a socket, etc.).
>
> So if the zero-length batches have no semantic meaning in an application,
> then either skip them on load or don't write them in the stream at all.
> This seems less burdensome than the alternative.
>
> On Fri, Apr 14, 2017 at 6:57 PM, Jacques Nadeau <jacq...@apache.org>
> wrote:
>
>> To Jason's comments:
>>
>> Data and control flow should be separate. Schema (a.k.a. a head-type
>> message) is already defined separate from a batch of records. I'm all for
>> a
>> termination message as well from a stream perspective. (I don't think it
>> makes sense to couple record batch size to termination-- I'm agreeing with
>> Wes on the previously mentioned jira on this point.)
>>
>> If you want to communicate no records it would look like <SCHEMA> <TERM>
>>
>> If you want to communicate a bunch of data it might look like:
>>
>> <SCHEMA> <BATCH> <BATCH> <TERM>
>>
>> I don't fully see purpose of having an IPC representation of an empty
>> batch
>> in this pattern. In terms a relationship with common zero length patterns
>> like collections: i'm find if an application wants to deal with zero
>> batches internally. I'm saying the batch should be a noop when serialized.
>>
>> Per Wes's thoughts:
>>
>> Since Arrow already requires a batch to be no larger than 2^16-1 records
>> in
>> size, it won't map 1:1 to an arbitrary construct.
>>
>> Per Wes's example:
>>
>> for batch in batches:
>>   writer.write_batch(batch)
>>
>> I'm fine with writer.write_batch being a noop if the batch has
>> zero-length.
>> I don't think you need to throw an error. I'd just expect
>> writer.getNonZeroBatchCount() to be available if you want to turn around
>> and read the stream.
>>
>> In general, It seems like desiring zero-length batches to be faithfully
>> maintained across a stream is attempting to couple them with an external
>> system's needs. We're communicating a stream of records, not batches. The
>> records happen to be chunked on both the sending and the receiving side. A
>> particular implementation of this communication protocol might decide to
>> subdivide or concatenate the batches to better communicate the stream.
>> Should an internal system be able to express this concept, sure. Should we
>> allow these to be communicated between two separate systems, I'm not so
>> sure.
>>
>> Put another way, I think a zero length batch should be serialized to zero
>> bytes. :)
>>
>>
>>
>>
>>
>>
>> On Fri, Apr 14, 2017 at 3:26 PM, Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>>
>> > Speaking as a relative outsider, having the boundary cases for a
>> transfer
>> > protocol be MORE restrictive than the senders and receivers is asking
>> for
>> > boundary bugs.
>> >
>> > In this case, both the senders and receiver think that the boundary is 0
>> > (empty lists, empty data frames, 0 results from a database). Having the
>> > Arrow format think that the boundary is 1 just adds impedance mismatch
>> > where none is necessary.
>> >
>> >
>> >
>> > On Fri, Apr 14, 2017 at 3:23 PM, Jason Altekruse <
>> altekruseja...@gmail.com
>> > >
>> > wrote:
>> >
>> > > I'm with Wes on this one. A bunch of systems have constructs that deal
>> > with
>> > > zero length collections, lists, iterators, etc. These are established
>> > > patterns that everyone knows they need to handle the empty case.
>> Forcing
>> > > applications to create an unnecessary protocol complexity of a special
>> > > sentinel value to represent an empty set would be more burdensome.
>> > >
>> > > There is the separate considerations of cases when you want to send
>> only
>> > > schema, which it makes sense to me that someone could use arrows
>> metadata
>> > > as a universal representation of schema between systems. I think it
>> makes
>> > > sense to have a separate concept for schema absent a batch, but users
>> > > shouldn't be forced to make all of their APIs return one of these, or
>> a
>> > > batch of data.
>> > >
>> > > On Fri, Apr 14, 2017 at 3:16 PM, Wes McKinney <wesmck...@gmail.com>
>> > wrote:
>> > >
>> > > > Here is valid pyarrow code that works right now:
>> > > >
>> > > > import pyarrow as pa
>> > > >
>> > > > rb = pa.RecordBatch.from_arrays([
>> > > >     pa.from_pylist([1, 2, 3]),
>> > > >     pa.from_pylist(['foo', 'bar', 'baz'])
>> > > > ], names=['one', 'two'])
>> > > >
>> > > > batches = [rb, rb.slice(0, 0)]
>> > > >
>> > > > stream = pa.InMemoryOutputStream()
>> > > >
>> > > > writer = pa.StreamWriter(stream, rb.schema)
>> > > > for batch in batches:
>> > > >     writer.write_batch(batch)
>> > > > writer.close()
>> > > >
>> > > > reader = pa.StreamReader(stream.get_result())
>> > > >
>> > > > results = [reader.get_next_batch(), reader.get_next_batch()]
>> > > >
>> > > > With the proposal to disallow length-0 batches, where should this
>> > break?
>> > > > Probably StreamWriter.write_batch should raise ValueError, but now
>> we
>> > > have
>> > > > to write:
>> > > >
>> > > > for batch in batches:
>> > > >     if len(batch) > 0:
>> > > >         writer.write_batch(batch)
>> > > >
>> > > > That seems worse, because now the user has to think about batch
>> sizes.
>> > > When
>> > > > we write:
>> > > >
>> > > > pa.Table.from_batches(results).to_pandas()
>> > > >
>> > > > the 0 length batches get skipped over anyhow
>> > > >
>> > > > onetwo
>> > > > 0 1 foo
>> > > > 1 2 bar
>> > > > 2 3 baz
>> > > > If you pass only zero-length batches, you get
>> > > >
>> > > > onetwo
>> > > > There are plenty of reasons why things would end up zero-length,
>> like:
>> > > >
>> > > > * Result of a predicate evaluation that filtered out all the data
>> > > > * Files (e.g. Parquet files) with 0 rows
>> > > >
>> > > > My concern is being able to faithfully represent the in-memory
>> results
>> > of
>> > > > operations in an RPC/IPC setting. Having to deal with a null / no
>> > message
>> > > > is much worse for us, because we will in many cases still have to
>> > > construct
>> > > > a length-0 RecordBatch in C++; the question is whether we're letting
>> > the
>> > > > IPC loader do it or having to construct a "dummy" object based on
>> the
>> > > > schema.
>> > > >
>> > > > On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau <jacq...@apache.org
>> >
>> > > > wrote:
>> > > >
>> > > > > Hey All,
>> > > > >
>> > > > > I had a quick comment on ARROW-783 that Wes responded to and I
>> wanted
>> > > to
>> > > > > elevate the conversation here for a moment.
>> > > > >
>> > > > > My suggestion there was that we should disallow zero-length
>> batches.
>> > > > >
>> > > > > Wes thought that should be an application level concern. I wanted
>> to
>> > > see
>> > > > > what others thought.
>> > > > >
>> > > > > My general perspective is that zero-length batches are meaningless
>> > and
>> > > > > better to disallow than make every application have special
>> handling
>> > > for
>> > > > > them. In the jira Wes noted that they have to deal with
>> zero-length
>> > > > > dataframes. Despite that, I don't think there is a requirement
>> that
>> > > there
>> > > > > should be a 1:1 mapping between arrow record batches and
>> dataframes.
>> > If
>> > > > > someone wants to communicate empty things, no need for them to use
>> > > Arrow.
>> > > > >
>> > > > > What do others think?
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Say no to zero length batches...

Reply via email to