Hi Randy,
The first four bytes are the int32 length of the flatbuffers Message
metadata
<https://github.com/apache/arrow/blob/e14705745bb8d625b3c7dda2857e93cdfe848178/format/Message.fbs#L93>
plus 4 bytes of padding between the length and the Message metadata
itself. The Message metadata starts on the 8th byte.
So to read an entire Message, read and store the first four bytes (the
metadata length). Then advance past the 4 padding bytes, and use the
flatbuffers API to read the Message table.
The Message table has a bodyLength field, which is byte length of all
the buffers (data, validity, offsets, and typeIds) for all the Arrays in
the Message (since Schema messages don't contain any data, its
bodyLength is always 0).
Once you've read the Message table via flabuffers, advance `metadata
length` number of bytes to position yourself to read the Array buffers.
After reading the buffers, advance another `bodyLength` number of bytes
to read the next message. Repeat this process to read all Messages from
an Arrow stream.
If you're familiar with JavaScript/TypeScript, you can reference the
implementation here
<https://github.com/apache/arrow/blob/e14705745bb8d625b3c7dda2857e93cdfe848178/js/src/ipc/reader/binary.ts#L145>.
Hope this clears things up,
Paul
On 07/12/2018 11:30 AM, Randy Zwitch wrote:
I’m trying to understand how to parse a Buffer into a Schema, but using
using pdb with Python and reading the TS/Python/C++ Arrow source hasn’t
really cleared much up for me. Nor has studying
https://arrow.apache.org/docs/ipc.html
Here’s are the steps of what I’ve tried (the code is Julia, but only
because I’m trying to do this natively, rather than wrap the Arrow C code):
# Thrift API method returning a struct (sm_buf, sm_size, df_buf, df_size)
(works as expected)
julia> tdf = sql_execute_df(conn, "select * from flights_2008_7m limit
1000", 0, 0, 1000)
MapD.TDataFrame(UInt8[0xba, 0x58, 0x1b, 0x3d], 93856, UInt8[0xab, 0xd7,
0x7e, 0x50], 188880)
# Wrap shared memory into julia array, based on handle and size (works as
expected)
julia> sm_buf = MapD.load_buffer(tdf.sm_handle, tdf.sm_size) #wrapper using
shmget/shmat
93856-element Array{UInt8,1}:
0x2c
0x16
0x00
0x00
0x14
0x00
0x00
0x00
0x00
0x00
⋮
0x20
0x74
0x6f
0x20
0x4d
0x66
0x72
0x00
0x00
At this point, walking through an similar Python process, I know that
sm_buf represents
- type: Schema
- metadata length: 5676
- body_length: 0
Where I’m confused is how to proceed.
I am getting metadata_length by reinterpreting the first 4-bytes as Int32.
julia> mlen = reinterpret(Int32, sm_buf[1:4])[1]
5676
I then assumed that I could start at byte 5 and take the next `mlen-1`
bytes:
julia> metadata = sm_buf[5:5+mlen-1]
5676-element Array{UInt8,1}:
0x14
0x00
0x00
0x00
0x00
0x00
0x00
0x00
0x0c
0x00
⋮
0x79
0x65
0x61
0x72
0x00
0x00
0x00
0x00
0x00
Am I on the right track here? I *think* that my `metadata` variable above
is a FlatBuffer, but how do I know what its structure is? Additionally,
what am I supposed to do with all of the bytes that haven’t been read from
`sm_buf` yet? `sm_buf` is 93856 bytes and I’ve only read the first 4 bytes
+ metadata length, leaving some 88,000 bytes not processed yet.
Any help would be greatly appreciated here. Please note that I’m not asking
for julia coding help, but rather what the Arrow bytes actually mean/their
structure and how to process them further.
Thanks,
Randy Zwitch