Hi Randy,

The first four bytes are the int32 length of the flatbuffers Message metadata <https://github.com/apache/arrow/blob/e14705745bb8d625b3c7dda2857e93cdfe848178/format/Message.fbs#L93> plus 4 bytes of padding between the length and the Message metadata itself. The Message metadata starts on the 8th byte.

So to read an entire Message, read and store the first four bytes (the metadata length). Then advance past the 4 padding bytes, and use the flatbuffers API to read the Message table.

The Message table has a bodyLength field, which is byte length of all the buffers (data, validity, offsets, and typeIds) for all the Arrays in the Message (since Schema messages don't contain any data, its bodyLength is always 0).

Once you've read the Message table via flabuffers, advance `metadata length` number of bytes to position yourself to read the Array buffers.

After reading the buffers, advance another `bodyLength` number of bytes to read the next message. Repeat this process to read all Messages from an Arrow stream.

If you're familiar with JavaScript/TypeScript, you can reference the implementation here <https://github.com/apache/arrow/blob/e14705745bb8d625b3c7dda2857e93cdfe848178/js/src/ipc/reader/binary.ts#L145>.

Hope this clears things up,

Paul


On 07/12/2018 11:30 AM, Randy Zwitch wrote:
I’m trying to understand how to parse a Buffer into a Schema, but using
using pdb with Python and reading the TS/Python/C++ Arrow source hasn’t
really cleared much up for me. Nor has studying
https://arrow.apache.org/docs/ipc.html


Here’s are the steps of what I’ve tried (the code is Julia, but only
because I’m trying to do this natively, rather than wrap the Arrow C code):


# Thrift API method returning a struct (sm_buf, sm_size, df_buf, df_size)
  (works as expected)
julia> tdf = sql_execute_df(conn, "select * from flights_2008_7m limit
1000", 0, 0, 1000)

MapD.TDataFrame(UInt8[0xba, 0x58, 0x1b, 0x3d], 93856, UInt8[0xab, 0xd7,
0x7e, 0x50], 188880)

# Wrap shared memory into julia array, based on handle and size (works as
expected)
julia> sm_buf = MapD.load_buffer(tdf.sm_handle, tdf.sm_size) #wrapper using
shmget/shmat
93856-element Array{UInt8,1}:
  0x2c
  0x16
  0x00
  0x00
  0x14
  0x00
  0x00
  0x00
  0x00
  0x00
     ⋮
  0x20
  0x74
  0x6f
  0x20
  0x4d
  0x66
  0x72
  0x00
  0x00

At this point, walking through an similar Python process, I know that
sm_buf represents
- type: Schema
         - metadata length: 5676
- body_length: 0

Where I’m confused is how to proceed.

I am getting metadata_length by reinterpreting the first 4-bytes as Int32.

julia> mlen = reinterpret(Int32, sm_buf[1:4])[1]
5676

I then assumed that I could start at byte 5 and take the next `mlen-1`
bytes:

julia> metadata = sm_buf[5:5+mlen-1]
5676-element Array{UInt8,1}:
  0x14
  0x00
  0x00
  0x00
  0x00
  0x00
  0x00
  0x00
  0x0c
  0x00
     ⋮
  0x79
  0x65
  0x61
  0x72
  0x00
  0x00
  0x00
  0x00
  0x00


Am I on the right track here? I *think* that my `metadata` variable above
is a FlatBuffer, but how do I know what its structure is? Additionally,
what am I supposed to do with all of the bytes that haven’t been read from
`sm_buf` yet? `sm_buf` is 93856 bytes and I’ve only read the first 4 bytes
+ metadata length, leaving some 88,000 bytes not processed yet.

Any help would be greatly appreciated here. Please note that I’m not asking
for julia coding help, but rather what the Arrow bytes actually mean/their
structure and how to process them further.

Thanks,
Randy Zwitch


Reply via email to