Re: Help understanding IPC Message/Buffer structure

Wes McKinney Thu, 12 Jul 2018 13:39:56 -0700

hi Randy,

In Julia I think this is complicated by the lack of a Flatbuffers
compiler for the language. In the case of Feather files, in Feather.jl
they have implemented the Flatbuffers schema in Julia code:


https://github.com/JuliaData/Feather.jl/blob/master/src/metadata.jl#L3

So you need to do one of:

a) make a Julia compiler for Flatbuffers files
b) Write a native implementation of the Arrow schemas by hand or
c) Wrap a C or C++ version of the compiled Flatbuffers schema

Here is some C++ code where we read a generic Message

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/message.cc#L139

Here's where we read the message protocol from a generic InputStream
(and then call Message::ReadFrom):

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/message.cc#L236

In the case of a Schema, the body length will be 0.

- Wes

On Thu, Jul 12, 2018 at 3:58 PM, Paul Taylor <ptaylor.apa...@gmail.com> wrote:
> Hi Randy,
>
> The first four bytes are the int32 length of the flatbuffers Message
> metadata
> <https://github.com/apache/arrow/blob/e14705745bb8d625b3c7dda2857e93cdfe848178/format/Message.fbs#L93>
> plus 4 bytes of padding between the length and the Message metadata itself.
> The Message metadata starts on the 8th byte.
>
> So to read an entire Message, read and store the first four bytes (the
> metadata length). Then advance past the 4 padding bytes, and use the
> flatbuffers API to read the Message table.
>
> The Message table has a bodyLength field, which is byte length of all the
> buffers (data, validity, offsets, and typeIds) for all the Arrays in the
> Message (since Schema messages don't contain any data, its bodyLength is
> always 0).
>
> Once you've read the Message table via flabuffers, advance `metadata length`
> number of bytes to position yourself to read the Array buffers.
>
> After reading the buffers, advance another `bodyLength` number of bytes to
> read the next message. Repeat this process to read all Messages from an
> Arrow stream.
>
> If you're familiar with JavaScript/TypeScript, you can reference the
> implementation here
> <https://github.com/apache/arrow/blob/e14705745bb8d625b3c7dda2857e93cdfe848178/js/src/ipc/reader/binary.ts#L145>.
>
> Hope this clears things up,
>
> Paul
>
>
>
> On 07/12/2018 11:30 AM, Randy Zwitch wrote:
>>
>> I’m trying to understand how to parse a Buffer into a Schema, but using
>> using pdb with Python and reading the TS/Python/C++ Arrow source hasn’t
>> really cleared much up for me. Nor has studying
>> https://arrow.apache.org/docs/ipc.html
>>
>>
>> Here’s are the steps of what I’ve tried (the code is Julia, but only
>> because I’m trying to do this natively, rather than wrap the Arrow C
>> code):
>>
>>
>> # Thrift API method returning a struct (sm_buf, sm_size, df_buf, df_size)
>>   (works as expected)
>> julia> tdf = sql_execute_df(conn, "select * from flights_2008_7m limit
>> 1000", 0, 0, 1000)
>>
>> MapD.TDataFrame(UInt8[0xba, 0x58, 0x1b, 0x3d], 93856, UInt8[0xab, 0xd7,
>> 0x7e, 0x50], 188880)
>>
>> # Wrap shared memory into julia array, based on handle and size (works as
>> expected)
>> julia> sm_buf = MapD.load_buffer(tdf.sm_handle, tdf.sm_size) #wrapper
>> using
>> shmget/shmat
>> 93856-element Array{UInt8,1}:
>>   0x2c
>>   0x16
>>   0x00
>>   0x00
>>   0x14
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>      ⋮
>>   0x20
>>   0x74
>>   0x6f
>>   0x20
>>   0x4d
>>   0x66
>>   0x72
>>   0x00
>>   0x00
>>
>> At this point, walking through an similar Python process, I know that
>> sm_buf represents
>> - type: Schema
>>          - metadata length: 5676
>> - body_length: 0
>>
>> Where I’m confused is how to proceed.
>>
>> I am getting metadata_length by reinterpreting the first 4-bytes as Int32.
>>
>> julia> mlen = reinterpret(Int32, sm_buf[1:4])[1]
>> 5676
>>
>> I then assumed that I could start at byte 5 and take the next `mlen-1`
>> bytes:
>>
>> julia> metadata = sm_buf[5:5+mlen-1]
>> 5676-element Array{UInt8,1}:
>>   0x14
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x0c
>>   0x00
>>      ⋮
>>   0x79
>>   0x65
>>   0x61
>>   0x72
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>
>>
>> Am I on the right track here? I *think* that my `metadata` variable above
>> is a FlatBuffer, but how do I know what its structure is? Additionally,
>> what am I supposed to do with all of the bytes that haven’t been read from
>> `sm_buf` yet? `sm_buf` is 93856 bytes and I’ve only read the first 4 bytes
>> + metadata length, leaving some 88,000 bytes not processed yet.
>>
>> Any help would be greatly appreciated here. Please note that I’m not
>> asking
>> for julia coding help, but rather what the Arrow bytes actually mean/their
>> structure and how to process them further.
>>
>> Thanks,
>> Randy Zwitch
>>
>

Re: Help understanding IPC Message/Buffer structure

Reply via email to