For the record, if I run something like "flight-benchmark -num_streams 32 -records_per_batch 65536 -num_threads 1"
then: - 80% of the perf-server's CPU time seems spent inside the implicit memcpy() calls in SerializationTraits<IpcPayload>::Serialize() (going through CodedOutputStream::WriteRaw) - 80% of the benchmark client's CPU time seems spent inside the implicit memcpy() calls in GrpcBuffer::Wrap (going through grpc_byte_buffer_reader_readall) Regards Antoine. On Tue, 12 Feb 2019 16:06:12 -0600 Wes McKinney <wesmck...@gmail.com> wrote: > On Tue, Feb 12, 2019 at 3:46 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > > Le 12/02/2019 à 22:34, Wes McKinney a écrit : > > > On Tue, Feb 12, 2019 at 2:48 PM Antoine Pitrou <anto...@python.org> > > > wrote: > > >> > > >> > > >> Hi David, > > >> > > >> I think allowing to send application-specific ancillary data in addition > > >> to Arrow data makes sense. > > >> > > >> (I'm also wondering whether the choice of gRPC is appropriate at all - > > >> the current C++ hacks around "zero-copy" are not pretty and they may not > > >> translate to other languages either) > > >> > > > > > > This is unrelated to the discussion of extending the Flight protocol, > > > but I'm not sure I would describe the serialization optimizations that > > > have been implemented as "hacks". gRPC exposes its message > > > serialization layer among other things to permit extensibility and to > > > not require the use of Protocol Buffers necessarily. > > > > One thing that surfaced is that the current implementation relies on C++ > > undefined behaviour (the reinterpret_cast from pb::FlightData to the > > unrelated struct FlightData). I don't know if there's a way to > > reimplement the optimization without that cast, but otherwise it's cause > > for worry, IMHO. > > Is there a JIRA about this? I spent some time looking around gRPC's > C++ library (which is header-only) and AFAICT the only exposure of > the template parameter to any relevant part of the code is at the > SerializationTraits interface, so the two template types should be > internally isomorphic (but I am not a C++ language lawyer). There may > be a safer way to get the library to generate the code we are looking > for. Note that the initial C++ implementation was written over a short > period of a few days; my goal was to get something working and do more > research later > > > > > > The reason that we chose to use the Protobuf wire format for all > > > message types, including data, is that there is excellent > > > cross-language support for protobufs, and among production-ready RPC > > > frameworks, gRPC has the most robust language support, covering pretty > > > much all the languages we care about: > > > https://github.com/grpc/grpc#to-start-using-grpc. The only one missing > > > is Rust, and I reckon that will get rectified at some point (there is > > > already https://github.com/stepancheg/grpc-rust, maybe it will be > > > adopted into gRPC formally at some point). But to have C++, C#, Go, > > > Java, and Node officially supported out of the box is not nothing. I > > > think it would be unwise to go a different way unless you have some > > > compelling reason that gRPC / HTTP/2 is fundamentally flawed this this > > > intended use. > > > > Since our use case pretty much requires high-performance transmission > > with as few copies as possible (ideally, data should be directly sent > > from/received to Arrow buffers without any intermediate userspace > > copies), I think we should evaluate whether gRPC can allow us to achieve > > that (there are still copies currently, AFAICT), and at which cost. > > > > As a side note, the Flight C++ benchmark currently achieves a bit more > > than 2 GB/s here. There may be ways to improve this number (does gRPC > > enable TLS by default? does it compress by default?)... > > > > One design question as we work on this project is how one could open a > "side channel" of sorts for moving the dataset itself outside of gRPC > but still using the flexible command layer > > > Regards > > > > Antoine. >