Speaking as a relative outsider, having the boundary cases for a transfer
protocol be MORE restrictive than the senders and receivers is asking for
boundary bugs.

In this case, both the senders and receiver think that the boundary is 0
(empty lists, empty data frames, 0 results from a database). Having the
Arrow format think that the boundary is 1 just adds impedance mismatch
where none is necessary.



On Fri, Apr 14, 2017 at 3:23 PM, Jason Altekruse <altekruseja...@gmail.com>
wrote:

> I'm with Wes on this one. A bunch of systems have constructs that deal with
> zero length collections, lists, iterators, etc. These are established
> patterns that everyone knows they need to handle the empty case. Forcing
> applications to create an unnecessary protocol complexity of a special
> sentinel value to represent an empty set would be more burdensome.
>
> There is the separate considerations of cases when you want to send only
> schema, which it makes sense to me that someone could use arrows metadata
> as a universal representation of schema between systems. I think it makes
> sense to have a separate concept for schema absent a batch, but users
> shouldn't be forced to make all of their APIs return one of these, or a
> batch of data.
>
> On Fri, Apr 14, 2017 at 3:16 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>
> > Here is valid pyarrow code that works right now:
> >
> > import pyarrow as pa
> >
> > rb = pa.RecordBatch.from_arrays([
> >     pa.from_pylist([1, 2, 3]),
> >     pa.from_pylist(['foo', 'bar', 'baz'])
> > ], names=['one', 'two'])
> >
> > batches = [rb, rb.slice(0, 0)]
> >
> > stream = pa.InMemoryOutputStream()
> >
> > writer = pa.StreamWriter(stream, rb.schema)
> > for batch in batches:
> >     writer.write_batch(batch)
> > writer.close()
> >
> > reader = pa.StreamReader(stream.get_result())
> >
> > results = [reader.get_next_batch(), reader.get_next_batch()]
> >
> > With the proposal to disallow length-0 batches, where should this break?
> > Probably StreamWriter.write_batch should raise ValueError, but now we
> have
> > to write:
> >
> > for batch in batches:
> >     if len(batch) > 0:
> >         writer.write_batch(batch)
> >
> > That seems worse, because now the user has to think about batch sizes.
> When
> > we write:
> >
> > pa.Table.from_batches(results).to_pandas()
> >
> > the 0 length batches get skipped over anyhow
> >
> > onetwo
> > 0 1 foo
> > 1 2 bar
> > 2 3 baz
> > If you pass only zero-length batches, you get
> >
> > onetwo
> > There are plenty of reasons why things would end up zero-length, like:
> >
> > * Result of a predicate evaluation that filtered out all the data
> > * Files (e.g. Parquet files) with 0 rows
> >
> > My concern is being able to faithfully represent the in-memory results of
> > operations in an RPC/IPC setting. Having to deal with a null / no message
> > is much worse for us, because we will in many cases still have to
> construct
> > a length-0 RecordBatch in C++; the question is whether we're letting the
> > IPC loader do it or having to construct a "dummy" object based on the
> > schema.
> >
> > On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau <jacq...@apache.org>
> > wrote:
> >
> > > Hey All,
> > >
> > > I had a quick comment on ARROW-783 that Wes responded to and I wanted
> to
> > > elevate the conversation here for a moment.
> > >
> > > My suggestion there was that we should disallow zero-length batches.
> > >
> > > Wes thought that should be an application level concern. I wanted to
> see
> > > what others thought.
> > >
> > > My general perspective is that zero-length batches are meaningless and
> > > better to disallow than make every application have special handling
> for
> > > them. In the jira Wes noted that they have to deal with zero-length
> > > dataframes. Despite that, I don't think there is a requirement that
> there
> > > should be a 1:1 mapping between arrow record batches and dataframes. If
> > > someone wants to communicate empty things, no need for them to use
> Arrow.
> > >
> > > What do others think?
> > >
> >
>

Reply via email to