Re: C++ optimize stream output

Wes McKinney Thu, 15 Mar 2018 17:42:39 -0700

I opened https://issues.apache.org/jira/browse/ARROW-2319 per the
buffered output point


On Thu, Mar 15, 2018 at 8:38 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> hi Rares,
>
> sorry for the delay in writing back. To your questions
>
>    - Does this look good decent? Could the API be used in more efficient
>    ways in order to achieve the same goal?
>
> This seems perfectly reasonable with the API as it is now
>
>    - On each node side, do steps #3 and #7 copy data?
>
> Step 3 does not copy data, but step 7 does. You could do better by
> writing directly to the network protocol instead of using
> io::BufferOutputStream (which accumulates data in an in-memory
> buffer). It would be useful to have a buffered writer to limit the
> number of writes to the socket (or whatever protocol is being used).
> Feel free to open a JIRA about this
>
>    - On the coordinator side, do steps #3.2, and #3.3 copy data?
>
> Per ARROW-2189 -- do you get these problems if you build from source?
> I'm trying to understand if this is a packaging issue or a code issue
>
> Step 3.2 does not copy data Step 3.3 does write data to the FileOutputStream
>
>    - On the coordinator side, do I really need to read and write a record
>    batch? Could I copy the buffer directly somehow?
>
> No, you don't need to necessarily. The idea of the Message* classes in
> arrow::ipc is to facilitate transporting messages while being agnostic
> to their comments. This would be a useful test case to flesh out these
> APIs. Could you please open some JIRAs about this? There are already
> some Message-related JIRAs open so take a look at what is there
> already.
>
> Thanks
> Wes
>
>
> On Mon, Feb 26, 2018 at 10:22 AM, Rares Vernica <rvern...@gmail.com> wrote:
>> Hello,
>>
>> I am using the C++ API to serialize and centralize data over the network. I
>> am wondering if I am using the API in an efficient way.
>>
>> I have multiple nodes and a coordinator communicating over the network. I
>> do not have fine control over the network communication. Individual nodes
>> write one chunk of data to the network. The coordinator will receive all
>> the chunks and can loop over them.
>>
>> On each node I do the following (the code is here
>> <https://github.com/Paradigm4/accelerated_io_tools/blob/e02aa37eb464d2eae501a36e4297adb28467f311/src/PhysicalAioSave.cpp#L512>
>> ):
>>
>>    1. Append data to Builders
>>    2. Finish Builders and get Arrays
>>    3. Create Record Batch from Arrays
>>    4. Create Pool Buffer
>>    5. Create Buffer Output Stream using Pool Buffer
>>    6. Open Record Batch Stream Writer using Buffer Output Stream
>>    7. Write Record Batch to writer
>>    8. Write Buffer data to network
>>
>> On the coordinator I do the following (the code is here
>> <https://github.com/Paradigm4/accelerated_io_tools/blob/e02aa37eb464d2eae501a36e4297adb28467f311/src/PhysicalAioSave.cpp#L1000>
>> ):
>>
>>    1. Open File Output Stream
>>    2. Open Record Batch Stream Writer using File Output Stream
>>    3. For each chunk retrieved from the network
>>       1. Create Buffer Reader using data retrieved from the network (the
>>       code is here
>>       
>> <https://github.com/Paradigm4/accelerated_io_tools/blob/e02aa37eb464d2eae501a36e4297adb28467f311/src/PhysicalAioSave.cpp#L1054>
>>       )
>>       2. Create Record Batch Stream Reader using Buffer Reader and read
>>       Record Batch (I plan to use ReadRecordBatch, but I'm having
>> issues like in
>>       ARROW-2189)
>>       3. Write Record Batch
>>
>> A few questions:
>>
>>    - Does this look good decent? Could the API be used in more efficient
>>    ways in order to achieve the same goal?
>>    - On each node side, do steps #3 and #7 copy data?
>>    - On the coordinator side, do steps #3.2, and #3.3 copy data?
>>    - On the coordinator side, do I really need to read and write a record
>>    batch? Could I copy the buffer directly somehow?
>>
>> Thank you so much!
>> Rares
>>
>>
>>
>> (Could the same Pool Buffer be reused across calls?)

Re: C++ optimize stream output

Reply via email to