Support of more manipulation for Record Batch
Hi all, I am working on a distributed sorting program which runs on multiple computation nodes. In this sorting program, data is represented as pandas DataFrames and key operations are groupby, concat, and sort_values. For shuffling data among the computation nodes, the DataFrames are converted to Arrow Record Batches and communicated via Arrow Flight. What I’ve noticed is that much time was spent on the conversion between DataFrame and Record Batch. The [zero-copy feature](https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy) unfortunately cannot be applied to my case, since the DataFrames contain strings as well. I wanted to try replacing DataFrames with Record Batches, so there would be no need of conversion. However, there seems to be no direct way to do groupby and sort_values on Record Batches, according to [the documentation](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html) Is there a plan to add such methods to the API of Record Batch in the future? Kind Regards Chengxin Sent with [ProtonMail](https://protonmail.com) Secure Email.
Re: Support of more manipulation for Record Batch
Hi Wes, Thank you for your answer. The projects you mentioned look very exciting. I will keep an eye on them. Kind Regards Chengxin Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Thursday, April 2, 2020 5:46 PM, Wes McKinney wrote: > hi Chengxin, > > Yes, if you look at the JIRA tracker and look for past discussions on > the mailing list, there are plans to develop comprehensive data > manipulation and query processing capabilities in this project for use > in Python, R, and any other language that binds to C++, including > C/GLib and Ruby. > > The way that this functionality is exposed in the pyarrow API will > almost certainly be different than pandas, though. Rather than have > objects with long lists of instance methods, we would opt instead for > computational functions that "act" on the data structures, producing > one or more data structures as output, more similar to tools like > dplyr (an R library). Developers are welcome to create pandas-like > convenience layers, of course, should they so choose. > > References: > > - C++ datasets API project > > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing > > - C++ query engine project > > https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing > > - C++ data frame API project > > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing > > Building these things take time, especially considering the scope of > maintenance involved with keeping this project running. If anyone > reading is interested in contributing time or money to this effort I'd > be happy to speak with you offline about it. If you would like to > contribute we would be glad to have you aboard. > > Thanks > Wes > > On Thu, Apr 2, 2020 at 6:50 AM Chengxin Ma c...@protonmail.ch.invalid > wrote: > > > > Hi all, > > I am working on a distributed sorting program which runs on multiple > > computation nodes. > > In this sorting program, data is represented as pandas DataFrames and key > > operations are groupby, concat, and sort_values. For shuffling data among > > the computation nodes, the DataFrames are converted to Arrow Record Batches > > and communicated via Arrow Flight. > > What I’ve noticed is that much time was spent on the conversion between > > DataFrame and Record Batch. > > The zero-copy feature unfortunately cannot be applied to my case, since the > > DataFrames contain strings as well. > > I wanted to try replacing DataFrames with Record Batches, so there would be > > no need of conversion. However, there seems to be no direct way to do > > groupby and sort_values on Record Batches, according to the documentation > > Is there a plan to add such methods to the API of Record Batch in the > > future? > > Kind Regards > > Chengxin > > Sent with ProtonMail Secure Email.
Re: Flight benchmark question
Hi Yibo, Your discovery is impressive. Did you consider the `num_streams` parameter [1] as well? If I understood correctly, this parameter is used for setting the conceptual concurrent streams between the client and the server, while `num_threads` is used for setting the size of the thread pool that actually handles these streams [2]. By default, both of the two parameters are 4. As for CPU usage, the parameter `records_per_batch`[3] has an impact as well. If you increase the value of this parameter, you will probably see that the data transfer speed increased while the server-side CPU usage dropped [4]. My guess is that as more records are put in one record batch, the total number of batches would decrease. CPU is only used for (de)serializing the metadata (i.e. schema) of each record batch while the payload can be transferred with zero cost [5]. [1] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L43 [2] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L230 [3] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L46 [4] https://drive.google.com/file/d/1aH84DdenLr0iH-RuMFU3_q87nPE_HLmP/view?usp=sharing [5] See "Optimizing Data Throughput over gRPC" in https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ Kind Regards Chengxin Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Wednesday, June 17, 2020 8:35 AM, Yibo Cai wrote: > Find a way to achieve reasonable benchmark result with multiple threads. Diff > pasted below for a quick review or try. > Tested on E5-2650, with this change: > num_threads = 1, speed = 1996 > num_threads = 2, speed = 3555 > num_threads = 4, speed = 5828 > > When running `arrow_flight_benchmark`, I find there's only one TCP connection > between client and server, no matter what `num_threads` is. All clients share > one TCP connection. At server side, I see only one thread is processing > network packets. On my machine, one client already saturates a CPU core, so > it becomes worse when `num_threads` increase, as that single server thread > becomes bottleneck. > > If running in standalone mode, flight clients are from different processes > and have their own TCP connections to the server. There're separated server > threads handling network traffics for each connection, without a central > bottleneck. > > I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before give > up. Setting that arg makes each client establishes its own TCP connection to > the server, similar to standalone mode. > > Actually, I'm not quite sure if we should set this arg. Sharing one TCP > connection is a reasonable configuration, and it's an advantage of gRPC[2]. > > Per my test, most CPU cycles are spent in kernel mode doing networking and > data transfer. Maybe better solution is to leverage modern network techniques > like RDMA or user mode stack for higher performance. > > [1] > https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc > [2] https://platformlab.stanford.edu/Seminar Talks/gRPC.pdf, page5 > > diff --git a/cpp/src/arrow/flight/client.cc b/cpp/src/arrow/flight/client.cc > index d530093d9..6904640d3 100644 > --- a/cpp/src/arrow/flight/client.cc > +++ b/cpp/src/arrow/flight/client.cc > @@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl { > args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100); > // Receive messages of any size > args.SetMaxReceiveMessageSize(-1); > > - // Setting this arg enables each client to open it's own TCP connection > to server, > - // not sharing one single connection, which becomes bottleneck under high > load. > - args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1); > > if (options.override_hostname != "") { > args.SetSslTargetNameOverride(options.override_hostname); > > On 6/15/20 10:00 PM, Wes McKinney wrote: > > > > On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou anto...@python.org wrote: > > > > > Le 15/06/2020 à 15:36, Wes McKinney a écrit : > > > > > > > When you have only a single server, all the gRPC traffic goes through > > > > a common port and is handled by a common server, so if both client and > > > > server are roughly IO bound you aren't going to get better performance > > > > by hitting the server with multiple clients simultaneously, only worse > > > > because the packets from different client requests are intermingled in > > > > the TCP traffic on that port. I'm not a networking expert but this is > > > > my best understanding of what is going on. > > > > > > Yibo Cai's experiment disproves that explanation, though. > > > When I run a single client against the test server, I get ~4 GB/s. When > > > I run 6 standalone clients against the same test server, I get ~8 GB/s > > > aggregate. So there's
Re: [jira] [Created] (ARROW-7434) [GLib] Homebrew packages seem not working
It works. Thanks! Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Wednesday, December 18, 2019 2:52 PM, Jeroen Ooms wrote: > Try this: > > gcc -o hello_world hello_world.c $(pkg-config --libs --cflags arrow-glib) > > On Wed, Dec 18, 2019 at 1:07 PM Chengxin Ma (Jira) j...@apache.org wrote: > > > Chengxin Ma created ARROW-7434: > > > > > > > > Summary: [GLib] Homebrew packages seem not working > > Key: ARROW-7434 > > URL: https://issues.apache.org/jira/browse/ARROW-7434 > > Project: Apache Arrow > > Issue Type: Bug > > Components: GLib > > Affects Versions: 0.15.1 > > Environment: macOS 10.15.2 > > Reporter: Chengxin Ma > > > > > > After installing {{apache-arrow}} and {{apache-arrow-glib}} via > > {{Homebrew}} according to the [Installation > > Guide|https://arrow.apache.org/install/], I wrote a very simple program to > > test if they were successfully installed. > > {code} > > $ cat hello_world.c > > #include > > #include > > int main(int argc, char **argv) { > > printf("Hello, World! \n"); > > } > > {code} > > {{gcc}} gave the following error: > > {code} > > $ gcc -o hello_world hello_world.c > > In file included from hello_world.c:3: > > In file included from /usr/local/include/arrow-glib/arrow-glib.h:22: > > /usr/local/include/arrow-glib/gobject-type.h:22:10: fatal error: > > 'glib-object.h' file not found > > #include > > ^~~ > > 1 error generated. > > {code} > > Is there any step that I didn’t follow here? > > -- > > This message was sent by Atlassian Jira > > (v8.3.4#803005)
Trouble with building Arrow GLib
Hi All, I am building Arrow GLib on a system where I'm not the admin. In the installation instructions (https://github.com/apache/arrow/tree/master/c_glib) there are two options for building GLib: "How to build by users" and "How to build by developers". I followed the "by users" section and met the following problem: "checking for gobject-introspection... configure: error: gobject-introspection-1.0 is not installed". "GObject Introspection" is mentioned explicitly in the "How to build by developers" section. Does this indicate it is optional if we build as a user? I checked "./configure --help" but didn't find a way to exclude it in the build process. I've also tried to build GObject Introspection (http://www.linuxfromscratch.org/blfs/view/svn/general/gobject-introspection.html) but Meson isn't available on the system... Could someone please offer a suggestion for me to build Arrow GLib? Kind Regards Chengxin Sent with [ProtonMail](https://protonmail.com) Secure Email.
Re: Trouble with building Arrow GLib
Hi Kou, I am working on [the Cartesius system](https://userinfo.surfsara.nl/systems/cartesius). The OS on it is [bullx Linux](https://userinfo.surfsara.nl/systems/cartesius/software/rhel). I’ve successfully installed Meson according to your suggestion, however it couldn’t run, probably due to some compatibility issues on the system. (I’ve written an email to the Helpdesk for help.) In the meantime of waiting for their response, I would like to know if it is possible to do some modification in the source code of the current Apache Arrow release to disable building GObject Introspection, instead of waiting for the next release. Kind Regards Chengxin Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Friday, December 20, 2019 4:49 AM, Sutou Kouhei wrote: > Hi, > > Arrow GLib can provide C API but it requires GObject > Introspection for now. So you need to install GObject > Introspection to use Arrow GLib for now. > (I'll add a build option to build without GObject > Introspection. Then you can build Arrow GLib without GObject > Introspection in the next release.) > > Could you show your environment? Are you using a Linux distribution? > > You can install Meson by "pip install --user meson". > See also: https://mesonbuild.com/Getting-meson.html#installing-meson-with-pip > > Thanks, > > --- > > kou > > In > DIVG0Hqiw9iory_bs1T6A_cf5etWsPJ0-lbAFxaJ4H2hrTm87EGUED3ztGenYN6EjVzW5_oYL1KIH4V3F_BE1dMQyu4EtbBflin-m-iGC_Q=@protonmail.ch > "Trouble with building Arrow GLib" on Thu, 19 Dec 2019 10:53:48 +, > Chengxin Ma c...@protonmail.ch.INVALID wrote: > > > Hi All, > > I am building Arrow GLib on a system where I'm not the admin. > > In the installation instructions > > (https://github.com/apache/arrow/tree/master/c_glib) there are two options > > for building GLib: "How to build by users" and "How to build by > > developers". I followed the "by users" section and met the following > > problem: > > "checking for gobject-introspection... configure: error: > > gobject-introspection-1.0 is not installed". > > "GObject Introspection" is mentioned explicitly in the "How to build by > > developers" section. Does this indicate it is optional if we build as a > > user? > > I checked "./configure --help" but didn't find a way to exclude it in the > > build process. > > I've also tried to build GObject Introspection > > (http://www.linuxfromscratch.org/blfs/view/svn/general/gobject-introspection.html) > > but Meson isn't available on the system... > > Could someone please offer a suggestion for me to build Arrow GLib? > > Kind Regards > > Chengxin > > Sent with ProtonMail Secure Email.
Re: Trouble with building Arrow GLib
Hi Kou, Thanks for the quick fix. After installing the patch I am able to build Arrow-GLib now. About the issue related to Meson: I was using Python 3, the problem was solved by using conda instead of pip. Kind Regards Chengxin Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Friday, December 20, 2019 12:27 PM, Sutou Kouhei wrote: > Hi, > > > I’ve successfully installed Meson according to your suggestion, however it > > couldn’t run, probably due to some compatibility issues on the system. > > (I’ve written an email to the Helpdesk for help.) > > I think that you're using Python 2. Meson requires Python 3. > > > In the meantime of waiting for their response, I would like to know if it > > is possible to do some modification in the source code of the current > > Apache Arrow release to disable building GObject Introspection, instead of > > waiting for the next release. > > Here is a patch to make GObject Introspection optional: > > https://patch-diff.githubusercontent.com/raw/apache/arrow/pull/6072.patch > > You need to use Meson to build with this patch. > > To use configure, you need to regenerate c_glib/configure > from c_glib/configure.ac. It requires GObject Introspection. > So you can't use configure on your environment. > > Thanks, > > --- > > kou > > In > tSWuBtndqpJhkCaHTDcHDaE3za0pbqK-8fnqKT99Vv6QGVxYAjKuZUUSZ4A94U6DiZaWxc8wYc5JXiu9EoZublUwbXhlq8kar_sguY6onWI=@protonmail.ch > "Re: Trouble with building Arrow GLib" on Fri, 20 Dec 2019 09:59:27 +, > Chengxin Ma c...@protonmail.ch.INVALID wrote: > > > Hi Kou, > > I am working on the Cartesius system. The OS on it is bullx Linux. > > I’ve successfully installed Meson according to your suggestion, however it > > couldn’t run, probably due to some compatibility issues on the system. > > (I’ve written an email to the Helpdesk for help.) > > In the meantime of waiting for their response, I would like to know if it > > is possible to do some modification in the source code of the current > > Apache Arrow release to disable building GObject Introspection, instead of > > waiting for the next release. > > Kind Regards > > Chengxin > > Sent with ProtonMail Secure Email. > > ‐‐‐ Original Message ‐‐‐ > > On Friday, December 20, 2019 4:49 AM, Sutou Kouhei k...@clear-code.com > > wrote: > > > > > Hi, > > > Arrow GLib can provide C API but it requires GObject > > > Introspection for now. So you need to install GObject > > > Introspection to use Arrow GLib for now. > > > (I'll add a build option to build without GObject > > > Introspection. Then you can build Arrow GLib without GObject > > > Introspection in the next release.) > > > Could you show your environment? Are you using a Linux distribution? > > > You can install Meson by "pip install --user meson". > > > See also: > > > https://mesonbuild.com/Getting-meson.html#installing-meson-with-pip > > > Thanks, > > > > > > kou > > > In > > > DIVG0Hqiw9iory_bs1T6A_cf5etWsPJ0-lbAFxaJ4H2hrTm87EGUED3ztGenYN6EjVzW5_oYL1KIH4V3F_BE1dMQyu4EtbBflin-m-iGC_Q=@protonmail.ch > > > "Trouble with building Arrow GLib" on Thu, 19 Dec 2019 10:53:48 +, > > > Chengxin Ma c...@protonmail.ch.INVALID wrote: > > > > > > > Hi All, > > > > I am building Arrow GLib on a system where I'm not the admin. > > > > In the installation instructions > > > > (https://github.com/apache/arrow/tree/master/c_glib) there are two > > > > options for building GLib: "How to build by users" and "How to build by > > > > developers". I followed the "by users" section and met the following > > > > problem: > > > > "checking for gobject-introspection... configure: error: > > > > gobject-introspection-1.0 is not installed". > > > > "GObject Introspection" is mentioned explicitly in the "How to build by > > > > developers" section. Does this indicate it is optional if we build as a > > > > user? > > > > I checked "./configure --help" but didn't find a way to exclude it in > > > > the build process. > > > > I've also tried to build GObject Introspection > > > > (http://www.linuxfromscratch.org/blfs/view/svn/general/gobject-introspection.html) > > > > but Meson isn't available on the system... > > > > Could someone please offer a suggestion for me to build Arrow GLib? > > > > Kind Regards > > > > Chengxin > > > > Sent with ProtonMail Secure Email.
[jira] [Created] (ARROW-8587) Compilation error when linking arrow-flight-perf-server
Chengxin Ma created ARROW-8587: -- Summary: Compilation error when linking arrow-flight-perf-server Key: ARROW-8587 URL: https://issues.apache.org/jira/browse/ARROW-8587 Project: Apache Arrow Issue Type: Bug Components: Benchmarking, C++, FlightRPC Affects Versions: 1.0.0 Environment: Linux HP 5.3.0-46-generic #38~18.04.1-Ubuntu SMP Tue Mar 31 04:17:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Reporter: Chengxin Ma I wanted to play around with Flight benchmark after seeing the discussion regarding Flight's throughput in arrow dev mailing list today. I met the following error when trying to build the benchmark from latest source code: {code:java} [ 95%] Linking CXX executable ../../../debug/arrow-flight-perf-server ../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to `boost::filesystem::detail::canonical(boost::filesystem::path const&, boost::filesystem::path const&, boost::system::error_code*)' ../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to `boost::system::system_category()' ../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to `boost::filesystem::path::parent_path() const' ../../../debug/libarrow_flight.so.18.0.0: undefined reference to `deflate' ../../../debug/libarrow_flight.so.18.0.0: undefined reference to `deflateEnd' ../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to `boost::system::generic_category()' ../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to `boost::filesystem::detail::current_path(boost::system::error_code*)' ../../../debug/libarrow_flight.so.18.0.0: undefined reference to `inflateInit2_' ../../../debug/libarrow_flight.so.18.0.0: undefined reference to `inflate' ../../../debug/libarrow_flight.so.18.0.0: undefined reference to `deflateInit2_' ../../../debug/libarrow_flight.so.18.0.0: undefined reference to `inflateEnd' ../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to `boost::filesystem::path::operator/=(boost::filesystem::path const&)' collect2: error: ld returned 1 exit status src/arrow/flight/CMakeFiles/arrow-flight-perf-server.dir/build.make:154: recipe for target 'debug/arrow-flight-perf-server' failed make[2]: *** [debug/arrow-flight-perf-server] Error 1 CMakeFiles/Makefile2:2609: recipe for target 'src/arrow/flight/CMakeFiles/arrow-flight-perf-server.dir/all' failed make[1]: *** [src/arrow/flight/CMakeFiles/arrow-flight-perf-server.dir/all] Error 2 Makefile:140: recipe for target 'all' failed make: *** [all] Error 2 {code} I was using {{cmake .. -DCMAKE_BUILD_TYPE=Debug -DARROW_DEPENDENCY_SOURCE=AUTO -DARROW_FLIGHT=ON -DARROW_BUILD_BENCHMARKS=ON -DARROW_CXXFLAGS="-lboost_filesystem -lboost_system"}} to configure the build. I noticed that there was a {{ARROW_BOOST_BUILD_VERSION: 1.71.0}} in the output, but the Boost library that I installed from the package manger was of this version: {{1.65.1.0ubuntu1}}. Could this be the cause of the problem? PS: I was able to build the benchmark [before|https://issues.apache.org/jira/browse/ARROW-7200]. It was on AWS with the OS being ubuntu-bionic-18.04-amd64-server-20191002, which should be very similar to the one I'm using on my laptop. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8861) Memory not released until Plasma process is killed
Chengxin Ma created ARROW-8861: -- Summary: Memory not released until Plasma process is killed Key: ARROW-8861 URL: https://issues.apache.org/jira/browse/ARROW-8861 Project: Apache Arrow Issue Type: Bug Components: C++ - Plasma Affects Versions: 0.16.0 Environment: Singularity container (Ubuntu 18.04) Reporter: Chengxin Ma Invoking the {{Delete(const ObjectID& object_id)}} method of a plasma client seems not really to free up the memory used by the object. To reproduce: 1. use {{htop}} (or other similar tools) to monitor memory usage; 2. start up the Plasma Object Store by {{plasma_store -m 10 -s /tmp/plasma}}; 3. use {{put.py}} to put an object into Plasma; 4. compile and run {{delete.cc}} ({{g++ delete.cc `pkg-config --cflags --libs arrow plasma` --std=c++11 -o delete}}); 5. kill the {{plasma_store}} process. Memory usage drops at Step 5, rather than Step 4. How to free up the memory while keeping Plasma Object Store running? {{put.py}}: {code:java} from pyarrow import plasma if __name__ == "__main__": client = plasma.connect("/tmp/plasma") object_id = plasma.ObjectID(20 * b"a") object_size = 5 buffer = memoryview(client.create(object_id, object_size)) for i in range(5): buffer[i] = i % 128 client.seal(object_id) client.disconnect() {code} {{delete.cc}}: {code:java} #include "arrow/util/logging.h" #include using namespace plasma; int main(int argc, char **argv) { PlasmaClient client; ARROW_CHECK_OK(client.Connect("/tmp/plasma")); ObjectID object_id = ObjectID::from_binary(""); client.Delete(object_id); ARROW_CHECK_OK(client.Disconnect()); } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7200) Running Arrow Flight benchmark on two hosts doesn't work
Chengxin Ma created ARROW-7200: -- Summary: Running Arrow Flight benchmark on two hosts doesn't work Key: ARROW-7200 URL: https://issues.apache.org/jira/browse/ARROW-7200 Project: Apache Arrow Issue Type: Bug Components: Benchmarking, C++, FlightRPC Affects Versions: 0.15.1, 0.15.0 Environment: AWS EC2 Instance type: t3a.xlarge AMI: ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20191002 Number of instances: 2 They are capable of pinging each other. Reporter: Chengxin Ma Attachments: Screen Shot 2019-11-18 at 16.00.38.png I was trying to evaluate the performance of Apache Arrow Flight on two hosts (one as the client and the other one as the server), using [the official benchmark|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight_benchmark.cc]]. Flags I used to build the project were: {code:java} -DARROW_FLIGHT=ON -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_BENCHMARKS=ON {code} The branch I used was maint-0.15.x since there was a build error on the master branch. _(The build error on master only existed in the environment where I set up two hosts: AWS. On my local environment (macOS) the build was successful on the master branch. I don't think this build error is relevant to the issue since there is no difference in the cpp source code.)_ On the host acting as the server, I ran {code:java} ./arrow-flight-perf-server{code} On the host acting as the client, I ran {code:java} ./arrow-flight-benchmark --server_host ip-172-31-11-18{code} It gives the following error: {code:java} Failed with error: << IOError: gRPC returned unavailable error, with message: Connect Failed. Detail: Unavailable{code} If I ran {code:java} ./arrow-flight-benchmark --server_host ip-172-31-11-17{code} the error will be different: {code:java} IOError: Server was not available after 10 attempts{code} This is understandable since this host doesn't exist at all. This indicates that Flight is able to find the existing host (ip-172-31-11-18), but the communication somehow didn't succeed. The benchmark works fine if I run it with the localhost, either by not specifying the server_host flag or running the server in another process on the same host. I am not sure if the problem is in the environment or in the code itself. Could someone please give me some hint on how to resolve the problem? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7320) Target arrow-type-benchmark failed to be built on bullx Linux
Chengxin Ma created ARROW-7320: -- Summary: Target arrow-type-benchmark failed to be built on bullx Linux Key: ARROW-7320 URL: https://issues.apache.org/jira/browse/ARROW-7320 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 1.0.0 Environment: bullx Linux Reporter: Chengxin Ma I was building Arrow on bullx Linux (a Linux distribution compatible with Red Hat Enterprise Linux). CMake options: {code} -DCMAKE_BUILD_TYPE=Debug -DARROW_FLIGHT=ON -DARROW_BUILD_BENCHMARKS=ON {code} {{make}} failed with the following error message: {code} Scanning dependencies of target arrow-type-benchmark [ 72%] Building CXX object src/arrow/CMakeFiles/arrow-type-benchmark.dir/type_benchmark.cc.o make[2]: *** No rule to make target `gbenchmark_ep/src/gbenchmark_ep-install/lib/libbenchmark_main.a', needed by `debug/arrow-type-benchmark'. Stop. make[1]: *** [src/arrow/CMakeFiles/arrow-type-benchmark.dir/all] Error 2 make: *** [all] Error 2 {code} This is due to the same reason as mentioned in [this commit|https://github.com/apache/arrow/pull/4246/commits/f6b0bc7f8dc56f02e2778752235e728b7623a9ee]: If {{-DCMAKE_INSTALL_LIBDIR=lib}} is not explicitly set, {{libbenchmark_main.a}} will be put in {{lib64}} instead of {{lib}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7411) [C++][Flight] Incorrect Arrow Flight benchmark output
Chengxin Ma created ARROW-7411: -- Summary: [C++][Flight] Incorrect Arrow Flight benchmark output Key: ARROW-7411 URL: https://issues.apache.org/jira/browse/ARROW-7411 Project: Apache Arrow Issue Type: Improvement Components: Benchmarking, C++, FlightRPC Affects Versions: 0.15.1 Environment: macOS Reporter: Chengxin Ma Assignee: Chengxin Ma Fix For: 1.0.0 When running Arrow Flight benchmark in the following scenario, the output is incorrect. {code} $ ./arrow-flight-perf-server & [1] 12986 Server host: localhost Server port: 31337 $ ./arrow-flight-benchmark -server_host localhost -test_put Using remote server: true Testing method: DoPut Server host: localhost Server port: 31337 Bytes read: 128000 Nanos: 496372147 Speed: 2459.25 MB/s {code} {{Using remote server}} should be {{false}} and {{Bytes read}} should be {{Bytes write}}. To correct the result of {{Using remote server}}, we can: * Change {{if (FLAGS_server_host == "")}} to another condition which checks if there is already an {{arrow-flight-perf-server}} running. This is a bit complicated to do and might add some unnecessary complexity (e.g. we need to make sure it support all OSes.). * Delete {{Using remote server}}, since we already have {{Server host}} in the output. I personally prefer the second option and will make a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7434) [GLib] Homebrew packages seem not working
Chengxin Ma created ARROW-7434: -- Summary: [GLib] Homebrew packages seem not working Key: ARROW-7434 URL: https://issues.apache.org/jira/browse/ARROW-7434 Project: Apache Arrow Issue Type: Bug Components: GLib Affects Versions: 0.15.1 Environment: macOS 10.15.2 Reporter: Chengxin Ma After installing {{apache-arrow}} and {{apache-arrow-glib}} via {{Homebrew}} according to the [Installation Guide|https://arrow.apache.org/install/], I wrote a very simple program to test if they were successfully installed. {code} $ cat hello_world.c #include #include int main(int argc, char **argv) { printf("Hello, World! \n"); } {code} {{gcc}} gave the following error: {code} $ gcc -o hello_world hello_world.c In file included from hello_world.c:3: In file included from /usr/local/include/arrow-glib/arrow-glib.h:22: /usr/local/include/arrow-glib/gobject-type.h:22:10: fatal error: 'glib-object.h' file not found #include ^~~ 1 error generated. {code} Is there any step that I didn’t follow here? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7522) Broken Record Batch returned from a function call
Chengxin Ma created ARROW-7522: -- Summary: Broken Record Batch returned from a function call Key: ARROW-7522 URL: https://issues.apache.org/jira/browse/ARROW-7522 Project: Apache Arrow Issue Type: Bug Components: C++, C++ - Plasma Affects Versions: 0.15.1 Environment: macOS Reporter: Chengxin Ma Scenario: retrieving Record Batch from Plasma with known Object ID. The following code snippet works well: {code:java} int main(int argc, char **argv) { plasma::ObjectID object_id = plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF"); // Start up and connect a Plasma client. plasma::PlasmaClient client; ARROW_CHECK_OK(client.Connect("/tmp/store")); plasma::ObjectBuffer object_buffer; ARROW_CHECK_OK(client.Get(&object_id, 1, -1, &object_buffer)); // Retrieve object data. auto buffer = object_buffer.data; arrow::io::BufferReader buffer_reader(buffer); std::shared_ptr record_batch_stream_reader; ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(&buffer_reader, &record_batch_stream_reader)); std::shared_ptr record_batch; arrow::Status status = record_batch_stream_reader->ReadNext(&record_batch); std::cout << "record_batch->column_name(0): " << record_batch->column_name(0) << std::endl; std::cout << "record_batch->num_columns(): " << record_batch->num_columns() << std::endl; std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << std::endl; std::cout << "record_batch->column(0)->length(): " << record_batch->column(0)->length() << std::endl; std::cout << "record_batch->column(0)->ToString(): " << record_batch->column(0)->ToString() << std::endl; } {code} {{record_batch->column(0)->ToString()}} would incur a segmentation fault if retrieving Record Batch is wrapped in a function: {code:java} std::shared_ptr GetRecordBatchFromPlasma(plasma::ObjectID object_id) { // Start up and connect a Plasma client. plasma::PlasmaClient client; ARROW_CHECK_OK(client.Connect("/tmp/store")); plasma::ObjectBuffer object_buffer; ARROW_CHECK_OK(client.Get(&object_id, 1, -1, &object_buffer)); // Retrieve object data. auto buffer = object_buffer.data; arrow::io::BufferReader buffer_reader(buffer); std::shared_ptr record_batch_stream_reader; ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(&buffer_reader, &record_batch_stream_reader)); std::shared_ptr record_batch; arrow::Status status = record_batch_stream_reader->ReadNext(&record_batch); // Disconnect the client. ARROW_CHECK_OK(client.Disconnect()); return record_batch; } int main(int argc, char **argv) { plasma::ObjectID object_id = plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF"); std::shared_ptr record_batch = GetRecordBatchFromPlasma(object_id); std::cout << "record_batch->column_name(0): " << record_batch->column_name(0) << std::endl; std::cout << "record_batch->num_columns(): " << record_batch->num_columns() << std::endl; std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << std::endl; std::cout << "record_batch->column(0)->length(): " << record_batch->column(0)->length() << std::endl; std::cout << "record_batch->column(0)->ToString(): " << record_batch->column(0)->ToString() << std::endl; } {code} The meta info of the Record Batch such as number of columns and rows is still available, but I can't see the content of the columns. {{lldb}} says that the stop reason is {{EXC_BAD_ACCESS}}, so I think the Record Batch is destroyed after {{GetRecordBatchFromPlasma}} finishes. But why can I still see the meta info of this Record Batch? What is the proper way to get the Record Batch if we insist using a function? -- This message was sent by Atlassian Jira (v8.3.4#803005)