Re: Say no to zero length batches...

Jacques Nadeau Fri, 14 Apr 2017 18:14:54 -0700

If I'm the sole voice on this perspective, I'll concede the point.

I didn't even catch the increase in allowed record batch sizes as part of
ARROW-661 and ARROW-679.  :(


I'm of split mind of the thoughts there:
- We need more applications so making sure that we have the features
available to support certain applications is good
- Having all sorts of optional components that are different between our
two primary implementations is likely to lead to a bad experience as a
cross-compatible representation.

It may make sense to require having a RI for each of Java and C++ for all
format features.





On Fri, Apr 14, 2017 at 4:56 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> > Since Arrow already requires a batch to be no larger than 2^16-1 records
> in
> size, it won't map 1:1 to an arbitrary construct.
>
> This is only true of some Arrow applications (e.g. Drill), which is why I
> think this is an application-level concern. In ARROW-661 and ARROW-679, we
> modified the metadata to permit very large record batches (as a strict
> opt-in for Arrow implementations; e.g. practically speaking the official
> Arrow format is capped at INT32_MAX) because other C++ applications (e.g.
> https://github.com/ray-project/ray) have begun using Arrow as the data
> plane for representing very large unions in shared memory.
>
> > We're communicating a stream of records, not batches.
>
> I don't see it this way necessarily; this is one interpretation of Arrow's
> memory model within a particular application domain. We have defined a
> series of in-memory data structures and metadata to enable applications to
> interact with them with zero-copy. The stream and file tools are a way to
> faithfully transport these data structures from one process's memory space
> to another (either via shared memory, or a socket, etc.).
>
> So if the zero-length batches have no semantic meaning in an application,
> then either skip them on load or don't write them in the stream at all.
> This seems less burdensome than the alternative.
>
> On Fri, Apr 14, 2017 at 6:57 PM, Jacques Nadeau <jacq...@apache.org>
> wrote:
>
> > To Jason's comments:
> >
> > Data and control flow should be separate. Schema (a.k.a. a head-type
> > message) is already defined separate from a batch of records. I'm all
> for a
> > termination message as well from a stream perspective. (I don't think it
> > makes sense to couple record batch size to termination-- I'm agreeing
> with
> > Wes on the previously mentioned jira on this point.)
> >
> > If you want to communicate no records it would look like <SCHEMA> <TERM>
> >
> > If you want to communicate a bunch of data it might look like:
> >
> > <SCHEMA> <BATCH> <BATCH> <TERM>
> >
> > I don't fully see purpose of having an IPC representation of an empty
> batch
> > in this pattern. In terms a relationship with common zero length patterns
> > like collections: i'm find if an application wants to deal with zero
> > batches internally. I'm saying the batch should be a noop when
> serialized.
> >
> > Per Wes's thoughts:
> >
> > Since Arrow already requires a batch to be no larger than 2^16-1 records
> in
> > size, it won't map 1:1 to an arbitrary construct.
> >
> > Per Wes's example:
> >
> > for batch in batches:
> >   writer.write_batch(batch)
> >
> > I'm fine with writer.write_batch being a noop if the batch has
> zero-length.
> > I don't think you need to throw an error. I'd just expect
> > writer.getNonZeroBatchCount() to be available if you want to turn around
> > and read the stream.
> >
> > In general, It seems like desiring zero-length batches to be faithfully
> > maintained across a stream is attempting to couple them with an external
> > system's needs. We're communicating a stream of records, not batches. The
> > records happen to be chunked on both the sending and the receiving side.
> A
> > particular implementation of this communication protocol might decide to
> > subdivide or concatenate the batches to better communicate the stream.
> > Should an internal system be able to express this concept, sure. Should
> we
> > allow these to be communicated between two separate systems, I'm not so
> > sure.
> >
> > Put another way, I think a zero length batch should be serialized to zero
> > bytes. :)
> >
> >
> >
> >
> >
> >
> > On Fri, Apr 14, 2017 at 3:26 PM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> >
> > > Speaking as a relative outsider, having the boundary cases for a
> transfer
> > > protocol be MORE restrictive than the senders and receivers is asking
> for
> > > boundary bugs.
> > >
> > > In this case, both the senders and receiver think that the boundary is
> 0
> > > (empty lists, empty data frames, 0 results from a database). Having the
> > > Arrow format think that the boundary is 1 just adds impedance mismatch
> > > where none is necessary.
> > >
> > >
> > >
> > > On Fri, Apr 14, 2017 at 3:23 PM, Jason Altekruse <
> > altekruseja...@gmail.com
> > > >
> > > wrote:
> > >
> > > > I'm with Wes on this one. A bunch of systems have constructs that
> deal
> > > with
> > > > zero length collections, lists, iterators, etc. These are established
> > > > patterns that everyone knows they need to handle the empty case.
> > Forcing
> > > > applications to create an unnecessary protocol complexity of a
> special
> > > > sentinel value to represent an empty set would be more burdensome.
> > > >
> > > > There is the separate considerations of cases when you want to send
> > only
> > > > schema, which it makes sense to me that someone could use arrows
> > metadata
> > > > as a universal representation of schema between systems. I think it
> > makes
> > > > sense to have a separate concept for schema absent a batch, but users
> > > > shouldn't be forced to make all of their APIs return one of these,
> or a
> > > > batch of data.
> > > >
> > > > On Fri, Apr 14, 2017 at 3:16 PM, Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > >
> > > > > Here is valid pyarrow code that works right now:
> > > > >
> > > > > import pyarrow as pa
> > > > >
> > > > > rb = pa.RecordBatch.from_arrays([
> > > > >     pa.from_pylist([1, 2, 3]),
> > > > >     pa.from_pylist(['foo', 'bar', 'baz'])
> > > > > ], names=['one', 'two'])
> > > > >
> > > > > batches = [rb, rb.slice(0, 0)]
> > > > >
> > > > > stream = pa.InMemoryOutputStream()
> > > > >
> > > > > writer = pa.StreamWriter(stream, rb.schema)
> > > > > for batch in batches:
> > > > >     writer.write_batch(batch)
> > > > > writer.close()
> > > > >
> > > > > reader = pa.StreamReader(stream.get_result())
> > > > >
> > > > > results = [reader.get_next_batch(), reader.get_next_batch()]
> > > > >
> > > > > With the proposal to disallow length-0 batches, where should this
> > > break?
> > > > > Probably StreamWriter.write_batch should raise ValueError, but now
> we
> > > > have
> > > > > to write:
> > > > >
> > > > > for batch in batches:
> > > > >     if len(batch) > 0:
> > > > >         writer.write_batch(batch)
> > > > >
> > > > > That seems worse, because now the user has to think about batch
> > sizes.
> > > > When
> > > > > we write:
> > > > >
> > > > > pa.Table.from_batches(results).to_pandas()
> > > > >
> > > > > the 0 length batches get skipped over anyhow
> > > > >
> > > > > onetwo
> > > > > 0 1 foo
> > > > > 1 2 bar
> > > > > 2 3 baz
> > > > > If you pass only zero-length batches, you get
> > > > >
> > > > > onetwo
> > > > > There are plenty of reasons why things would end up zero-length,
> > like:
> > > > >
> > > > > * Result of a predicate evaluation that filtered out all the data
> > > > > * Files (e.g. Parquet files) with 0 rows
> > > > >
> > > > > My concern is being able to faithfully represent the in-memory
> > results
> > > of
> > > > > operations in an RPC/IPC setting. Having to deal with a null / no
> > > message
> > > > > is much worse for us, because we will in many cases still have to
> > > > construct
> > > > > a length-0 RecordBatch in C++; the question is whether we're
> letting
> > > the
> > > > > IPC loader do it or having to construct a "dummy" object based on
> the
> > > > > schema.
> > > > >
> > > > > On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau <
> jacq...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hey All,
> > > > > >
> > > > > > I had a quick comment on ARROW-783 that Wes responded to and I
> > wanted
> > > > to
> > > > > > elevate the conversation here for a moment.
> > > > > >
> > > > > > My suggestion there was that we should disallow zero-length
> > batches.
> > > > > >
> > > > > > Wes thought that should be an application level concern. I wanted
> > to
> > > > see
> > > > > > what others thought.
> > > > > >
> > > > > > My general perspective is that zero-length batches are
> meaningless
> > > and
> > > > > > better to disallow than make every application have special
> > handling
> > > > for
> > > > > > them. In the jira Wes noted that they have to deal with
> zero-length
> > > > > > dataframes. Despite that, I don't think there is a requirement
> that
> > > > there
> > > > > > should be a 1:1 mapping between arrow record batches and
> > dataframes.
> > > If
> > > > > > someone wants to communicate empty things, no need for them to
> use
> > > > Arrow.
> > > > > >
> > > > > > What do others think?
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Say no to zero length batches...

Reply via email to