Re: Say no to zero length batches...

2017-04-17 Thread Jason Altekruse
I agree with Jacques on the expansion of the allowed batch sizes being a significant change to the format. Optional features do have a risk of fragmenting the community, but need to be balanced against the benefits provided to a particular user of arrow. I think that the requirement that individu

Re: Say no to zero length batches...

2017-04-14 Thread Jacques Nadeau
If I'm the sole voice on this perspective, I'll concede the point. I didn't even catch the increase in allowed record batch sizes as part of ARROW-661 and ARROW-679. :( I'm of split mind of the thoughts there: - We need more applications so making sure that we have the features available to supp

Re: Say no to zero length batches...

2017-04-14 Thread Wes McKinney
It seems like we could address these concerns by adding alternate write/read APIs that do the dropping (on write) / skipping (on load) automatically, so it doesn't have to bubble up into application logic. On Fri, Apr 14, 2017 at 7:56 PM, Wes McKinney wrote: > > Since Arrow already requires a ba

Re: Say no to zero length batches...

2017-04-14 Thread Wes McKinney
> Since Arrow already requires a batch to be no larger than 2^16-1 records in size, it won't map 1:1 to an arbitrary construct. This is only true of some Arrow applications (e.g. Drill), which is why I think this is an application-level concern. In ARROW-661 and ARROW-679, we modified the metadata

Re: Say no to zero length batches...

2017-04-14 Thread Jacques Nadeau
To Jason's comments: Data and control flow should be separate. Schema (a.k.a. a head-type message) is already defined separate from a batch of records. I'm all for a termination message as well from a stream perspective. (I don't think it makes sense to couple record batch size to termination-- I'

Re: Say no to zero length batches...

2017-04-14 Thread Ted Dunning
Speaking as a relative outsider, having the boundary cases for a transfer protocol be MORE restrictive than the senders and receivers is asking for boundary bugs. In this case, both the senders and receiver think that the boundary is 0 (empty lists, empty data frames, 0 results from a database). H

Re: Say no to zero length batches...

2017-04-14 Thread Jason Altekruse
I'm with Wes on this one. A bunch of systems have constructs that deal with zero length collections, lists, iterators, etc. These are established patterns that everyone knows they need to handle the empty case. Forcing applications to create an unnecessary protocol complexity of a special sentinel

Re: Say no to zero length batches...

2017-04-14 Thread Wes McKinney
Here is valid pyarrow code that works right now: import pyarrow as pa rb = pa.RecordBatch.from_arrays([ pa.from_pylist([1, 2, 3]), pa.from_pylist(['foo', 'bar', 'baz']) ], names=['one', 'two']) batches = [rb, rb.slice(0, 0)] stream = pa.InMemoryOutputStream() writer = pa.StreamWriter(s

Say no to zero length batches...

2017-04-14 Thread Jacques Nadeau
Hey All, I had a quick comment on ARROW-783 that Wes responded to and I wanted to elevate the conversation here for a moment. My suggestion there was that we should disallow zero-length batches. Wes thought that should be an application level concern. I wanted to see what others thought. My gen