Re: Arrow File with Multiple Record Batches

Brian Hulette Thu, 08 Sep 2016 14:20:03 -0700

Ah got it, thanks Julien.

I was thinking that each RecordBatch could have different schemas, whichin retrospect doesn't seem very logical. In essence I guess I wasthinking each record batch was a partition of the schema's fields,instead of a partition of the entire dataset.


Thanks for clearing that up
Brian

On 09/08/2016 05:09 PM, Julien Le Dem wrote:

Hi Brian,
It's not one record batch per field. Each field describes a column in the
schema.
Record batches are partitions of the dataset. As such all record batches
have the same schema which is defined in the footer.
There can be any number of record batches for a given schema.

Then in each record batch:
  - there are as many FieldNodes as there are Fields total in the schema
tree.
  - For each field the buffer count is defined by the layout attribute in
Field.

IHTH, Julien



On Thu, Sep 8, 2016 at 9:15 AM, Brian Hulette <bhule...@ccri.com> wrote:

Hi all,

I'm very interested in the Arrow file format - I would eventually like
to use it to export data in a columnar format that can be read directly
in a browser through a Javascript library. I've been reviewing the
specification and Julien's Java implementation, and I'm a little bit
confused about the relationship between the Schema in the footer and the
record batch(es)

If a schema is referring to multiple record batches, is it assumed that
the first fields in the schema refer to the first record batch, until
all of its Buffers and FieldNodes are accounted for, then the next set
of fields refer to the next record batch, and so on?

If so, it doesn't seem like the current implementation supports this
behavior. Which is fine, I just want to make sure I understand.

Thanks,

Brian Hulette

Re: Arrow File with Multiple Record Batches

Reply via email to