I agree with Jacques on the expansion of the allowed batch sizes being a
significant change to the format.
Optional features do have a risk of fragmenting the community, but need to
be balanced against the benefits provided to a particular user of arrow.
I think that the requirement that individu
If I'm the sole voice on this perspective, I'll concede the point.
I didn't even catch the increase in allowed record batch sizes as part of
ARROW-661 and ARROW-679. :(
I'm of split mind of the thoughts there:
- We need more applications so making sure that we have the features
available to supp
It seems like we could address these concerns by adding alternate
write/read APIs that do the dropping (on write) / skipping (on load)
automatically, so it doesn't have to bubble up into application logic.
On Fri, Apr 14, 2017 at 7:56 PM, Wes McKinney wrote:
> > Since Arrow already requires a ba
> Since Arrow already requires a batch to be no larger than 2^16-1 records
in
size, it won't map 1:1 to an arbitrary construct.
This is only true of some Arrow applications (e.g. Drill), which is why I
think this is an application-level concern. In ARROW-661 and ARROW-679, we
modified the metadata
To Jason's comments:
Data and control flow should be separate. Schema (a.k.a. a head-type
message) is already defined separate from a batch of records. I'm all for a
termination message as well from a stream perspective. (I don't think it
makes sense to couple record batch size to termination-- I'
Speaking as a relative outsider, having the boundary cases for a transfer
protocol be MORE restrictive than the senders and receivers is asking for
boundary bugs.
In this case, both the senders and receiver think that the boundary is 0
(empty lists, empty data frames, 0 results from a database). H
I'm with Wes on this one. A bunch of systems have constructs that deal with
zero length collections, lists, iterators, etc. These are established
patterns that everyone knows they need to handle the empty case. Forcing
applications to create an unnecessary protocol complexity of a special
sentinel
Here is valid pyarrow code that works right now:
import pyarrow as pa
rb = pa.RecordBatch.from_arrays([
pa.from_pylist([1, 2, 3]),
pa.from_pylist(['foo', 'bar', 'baz'])
], names=['one', 'two'])
batches = [rb, rb.slice(0, 0)]
stream = pa.InMemoryOutputStream()
writer = pa.StreamWriter(s
Hey All,
I had a quick comment on ARROW-783 that Wes responded to and I wanted to
elevate the conversation here for a moment.
My suggestion there was that we should disallow zero-length batches.
Wes thought that should be an application level concern. I wanted to see
what others thought.
My gen