> Since Arrow already requires a batch to be no larger than 2^16-1 records
in
size, it won't map 1:1 to an arbitrary construct.

This is only true of some Arrow applications (e.g. Drill), which is why I
think this is an application-level concern. In ARROW-661 and ARROW-679, we
modified the metadata to permit very large record batches (as a strict
opt-in for Arrow implementations; e.g. practically speaking the official
Arrow format is capped at INT32_MAX) because other C++ applications (e.g.
https://github.com/ray-project/ray) have begun using Arrow as the data
plane for representing very large unions in shared memory.

> We're communicating a stream of records, not batches.

I don't see it this way necessarily; this is one interpretation of Arrow's
memory model within a particular application domain. We have defined a
series of in-memory data structures and metadata to enable applications to
interact with them with zero-copy. The stream and file tools are a way to
faithfully transport these data structures from one process's memory space
to another (either via shared memory, or a socket, etc.).

So if the zero-length batches have no semantic meaning in an application,
then either skip them on load or don't write them in the stream at all.
This seems less burdensome than the alternative.

On Fri, Apr 14, 2017 at 6:57 PM, Jacques Nadeau <jacq...@apache.org> wrote:

> To Jason's comments:
>
> Data and control flow should be separate. Schema (a.k.a. a head-type
> message) is already defined separate from a batch of records. I'm all for a
> termination message as well from a stream perspective. (I don't think it
> makes sense to couple record batch size to termination-- I'm agreeing with
> Wes on the previously mentioned jira on this point.)
>
> If you want to communicate no records it would look like <SCHEMA> <TERM>
>
> If you want to communicate a bunch of data it might look like:
>
> <SCHEMA> <BATCH> <BATCH> <TERM>
>
> I don't fully see purpose of having an IPC representation of an empty batch
> in this pattern. In terms a relationship with common zero length patterns
> like collections: i'm find if an application wants to deal with zero
> batches internally. I'm saying the batch should be a noop when serialized.
>
> Per Wes's thoughts:
>
> Since Arrow already requires a batch to be no larger than 2^16-1 records in
> size, it won't map 1:1 to an arbitrary construct.
>
> Per Wes's example:
>
> for batch in batches:
>   writer.write_batch(batch)
>
> I'm fine with writer.write_batch being a noop if the batch has zero-length.
> I don't think you need to throw an error. I'd just expect
> writer.getNonZeroBatchCount() to be available if you want to turn around
> and read the stream.
>
> In general, It seems like desiring zero-length batches to be faithfully
> maintained across a stream is attempting to couple them with an external
> system's needs. We're communicating a stream of records, not batches. The
> records happen to be chunked on both the sending and the receiving side. A
> particular implementation of this communication protocol might decide to
> subdivide or concatenate the batches to better communicate the stream.
> Should an internal system be able to express this concept, sure. Should we
> allow these to be communicated between two separate systems, I'm not so
> sure.
>
> Put another way, I think a zero length batch should be serialized to zero
> bytes. :)
>
>
>
>
>
>
> On Fri, Apr 14, 2017 at 3:26 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > Speaking as a relative outsider, having the boundary cases for a transfer
> > protocol be MORE restrictive than the senders and receivers is asking for
> > boundary bugs.
> >
> > In this case, both the senders and receiver think that the boundary is 0
> > (empty lists, empty data frames, 0 results from a database). Having the
> > Arrow format think that the boundary is 1 just adds impedance mismatch
> > where none is necessary.
> >
> >
> >
> > On Fri, Apr 14, 2017 at 3:23 PM, Jason Altekruse <
> altekruseja...@gmail.com
> > >
> > wrote:
> >
> > > I'm with Wes on this one. A bunch of systems have constructs that deal
> > with
> > > zero length collections, lists, iterators, etc. These are established
> > > patterns that everyone knows they need to handle the empty case.
> Forcing
> > > applications to create an unnecessary protocol complexity of a special
> > > sentinel value to represent an empty set would be more burdensome.
> > >
> > > There is the separate considerations of cases when you want to send
> only
> > > schema, which it makes sense to me that someone could use arrows
> metadata
> > > as a universal representation of schema between systems. I think it
> makes
> > > sense to have a separate concept for schema absent a batch, but users
> > > shouldn't be forced to make all of their APIs return one of these, or a
> > > batch of data.
> > >
> > > On Fri, Apr 14, 2017 at 3:16 PM, Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > Here is valid pyarrow code that works right now:
> > > >
> > > > import pyarrow as pa
> > > >
> > > > rb = pa.RecordBatch.from_arrays([
> > > >     pa.from_pylist([1, 2, 3]),
> > > >     pa.from_pylist(['foo', 'bar', 'baz'])
> > > > ], names=['one', 'two'])
> > > >
> > > > batches = [rb, rb.slice(0, 0)]
> > > >
> > > > stream = pa.InMemoryOutputStream()
> > > >
> > > > writer = pa.StreamWriter(stream, rb.schema)
> > > > for batch in batches:
> > > >     writer.write_batch(batch)
> > > > writer.close()
> > > >
> > > > reader = pa.StreamReader(stream.get_result())
> > > >
> > > > results = [reader.get_next_batch(), reader.get_next_batch()]
> > > >
> > > > With the proposal to disallow length-0 batches, where should this
> > break?
> > > > Probably StreamWriter.write_batch should raise ValueError, but now we
> > > have
> > > > to write:
> > > >
> > > > for batch in batches:
> > > >     if len(batch) > 0:
> > > >         writer.write_batch(batch)
> > > >
> > > > That seems worse, because now the user has to think about batch
> sizes.
> > > When
> > > > we write:
> > > >
> > > > pa.Table.from_batches(results).to_pandas()
> > > >
> > > > the 0 length batches get skipped over anyhow
> > > >
> > > > onetwo
> > > > 0 1 foo
> > > > 1 2 bar
> > > > 2 3 baz
> > > > If you pass only zero-length batches, you get
> > > >
> > > > onetwo
> > > > There are plenty of reasons why things would end up zero-length,
> like:
> > > >
> > > > * Result of a predicate evaluation that filtered out all the data
> > > > * Files (e.g. Parquet files) with 0 rows
> > > >
> > > > My concern is being able to faithfully represent the in-memory
> results
> > of
> > > > operations in an RPC/IPC setting. Having to deal with a null / no
> > message
> > > > is much worse for us, because we will in many cases still have to
> > > construct
> > > > a length-0 RecordBatch in C++; the question is whether we're letting
> > the
> > > > IPC loader do it or having to construct a "dummy" object based on the
> > > > schema.
> > > >
> > > > On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau <jacq...@apache.org>
> > > > wrote:
> > > >
> > > > > Hey All,
> > > > >
> > > > > I had a quick comment on ARROW-783 that Wes responded to and I
> wanted
> > > to
> > > > > elevate the conversation here for a moment.
> > > > >
> > > > > My suggestion there was that we should disallow zero-length
> batches.
> > > > >
> > > > > Wes thought that should be an application level concern. I wanted
> to
> > > see
> > > > > what others thought.
> > > > >
> > > > > My general perspective is that zero-length batches are meaningless
> > and
> > > > > better to disallow than make every application have special
> handling
> > > for
> > > > > them. In the jira Wes noted that they have to deal with zero-length
> > > > > dataframes. Despite that, I don't think there is a requirement that
> > > there
> > > > > should be a 1:1 mapping between arrow record batches and
> dataframes.
> > If
> > > > > someone wants to communicate empty things, no need for them to use
> > > Arrow.
> > > > >
> > > > > What do others think?
> > > > >
> > > >
> > >
> >
>

Reply via email to