Support of more manipulation for Record Batch

2020-04-02 Thread Chengxin Ma
Hi all,

I am working on a distributed sorting program which runs on multiple 
computation nodes.

In this sorting program, data is represented as pandas DataFrames and key 
operations are groupby, concat, and sort_values. For shuffling data among the 
computation nodes, the DataFrames are converted to Arrow Record Batches and 
communicated via Arrow Flight.

What I’ve noticed is that much time was spent on the conversion between 
DataFrame and Record Batch.

The [zero-copy 
feature](https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy)
 unfortunately cannot be applied to my case, since the DataFrames contain 
strings as well.

I wanted to try replacing DataFrames with Record Batches, so there would be no 
need of conversion. However, there seems to be no direct way to do groupby and 
sort_values on Record Batches, according to [the 
documentation](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html)

Is there a plan to add such methods to the API of Record Batch in the future?

Kind Regards

Chengxin

Sent with [ProtonMail](https://protonmail.com) Secure Email.

Re: Support of more manipulation for Record Batch

2020-04-03 Thread Chengxin Ma
Hi Wes,

Thank you for your answer.
The projects you mentioned look very exciting. I will keep an eye on them.

Kind Regards
Chengxin


Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On Thursday, April 2, 2020 5:46 PM, Wes McKinney  wrote:

> hi Chengxin,
>
> Yes, if you look at the JIRA tracker and look for past discussions on
> the mailing list, there are plans to develop comprehensive data
> manipulation and query processing capabilities in this project for use
> in Python, R, and any other language that binds to C++, including
> C/GLib and Ruby.
>
> The way that this functionality is exposed in the pyarrow API will
> almost certainly be different than pandas, though. Rather than have
> objects with long lists of instance methods, we would opt instead for
> computational functions that "act" on the data structures, producing
> one or more data structures as output, more similar to tools like
> dplyr (an R library). Developers are welcome to create pandas-like
> convenience layers, of course, should they so choose.
>
> References:
>
> -   C++ datasets API project
> 
> https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing
>
> -   C++ query engine project
> 
> https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing
>
> -   C++ data frame API project
> 
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
>
> Building these things take time, especially considering the scope of
> maintenance involved with keeping this project running. If anyone
> reading is interested in contributing time or money to this effort I'd
> be happy to speak with you offline about it. If you would like to
> contribute we would be glad to have you aboard.
>
> Thanks
> Wes
>
> On Thu, Apr 2, 2020 at 6:50 AM Chengxin Ma c...@protonmail.ch.invalid 
> wrote:
>
>
> > Hi all,
> > I am working on a distributed sorting program which runs on multiple 
> > computation nodes.
> > In this sorting program, data is represented as pandas DataFrames and key 
> > operations are groupby, concat, and sort_values. For shuffling data among 
> > the computation nodes, the DataFrames are converted to Arrow Record Batches 
> > and communicated via Arrow Flight.
> > What I’ve noticed is that much time was spent on the conversion between 
> > DataFrame and Record Batch.
> > The zero-copy feature unfortunately cannot be applied to my case, since the 
> > DataFrames contain strings as well.
> > I wanted to try replacing DataFrames with Record Batches, so there would be 
> > no need of conversion. However, there seems to be no direct way to do 
> > groupby and sort_values on Record Batches, according to the documentation
> > Is there a plan to add such methods to the API of Record Batch in the 
> > future?
> > Kind Regards
> > Chengxin
> > Sent with ProtonMail Secure Email.




Re: Flight benchmark question

2020-06-17 Thread Chengxin Ma
Hi Yibo,


Your discovery is impressive.


Did you consider the `num_streams` parameter [1] as well? If I understood 
correctly, this parameter is used for setting the conceptual concurrent streams 
between the client and the server, while `num_threads` is used for setting the 
size of the thread pool that actually handles these streams [2]. By default, 
both of the two parameters are 4.


As for CPU usage, the parameter `records_per_batch`[3] has an impact as well. 
If you increase the value of this parameter, you will probably see that the 
data transfer speed increased while the server-side CPU usage dropped [4].
My guess is that as more records are put in one record batch, the total number 
of batches would decrease. CPU is only used for (de)serializing the metadata 
(i.e. schema) of each record batch while the payload can be transferred with 
zero cost [5].


[1] 
https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L43
[2] 
https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L230
[3] 
https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L46
[4] 
https://drive.google.com/file/d/1aH84DdenLr0iH-RuMFU3_q87nPE_HLmP/view?usp=sharing
[5] See "Optimizing Data Throughput over gRPC" in 
https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/


Kind Regards
Chengxin


Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On Wednesday, June 17, 2020 8:35 AM, Yibo Cai  wrote:

> Find a way to achieve reasonable benchmark result with multiple threads. Diff 
> pasted below for a quick review or try.
> Tested on E5-2650, with this change:
> num_threads = 1, speed = 1996
> num_threads = 2, speed = 3555
> num_threads = 4, speed = 5828
>
> When running `arrow_flight_benchmark`, I find there's only one TCP connection 
> between client and server, no matter what `num_threads` is. All clients share 
> one TCP connection. At server side, I see only one thread is processing 
> network packets. On my machine, one client already saturates a CPU core, so 
> it becomes worse when `num_threads` increase, as that single server thread 
> becomes bottleneck.
>
> If running in standalone mode, flight clients are from different processes 
> and have their own TCP connections to the server. There're separated server 
> threads handling network traffics for each connection, without a central 
> bottleneck.
>
> I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before give 
> up. Setting that arg makes each client establishes its own TCP connection to 
> the server, similar to standalone mode.
>
> Actually, I'm not quite sure if we should set this arg. Sharing one TCP 
> connection is a reasonable configuration, and it's an advantage of gRPC[2].
>
> Per my test, most CPU cycles are spent in kernel mode doing networking and 
> data transfer. Maybe better solution is to leverage modern network techniques 
> like RDMA or user mode stack for higher performance.
>
> [1] 
> https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc
> [2] https://platformlab.stanford.edu/Seminar Talks/gRPC.pdf, page5
>
> diff --git a/cpp/src/arrow/flight/client.cc b/cpp/src/arrow/flight/client.cc
> index d530093d9..6904640d3 100644
> --- a/cpp/src/arrow/flight/client.cc
> +++ b/cpp/src/arrow/flight/client.cc
> @@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl {
> args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100);
> // Receive messages of any size
> args.SetMaxReceiveMessageSize(-1);
>
> -   // Setting this arg enables each client to open it's own TCP connection 
> to server,
> -   // not sharing one single connection, which becomes bottleneck under high 
> load.
> -   args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1);
>
> if (options.override_hostname != "") {
> args.SetSslTargetNameOverride(options.override_hostname);
>
> On 6/15/20 10:00 PM, Wes McKinney wrote:
>
>
> > On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou anto...@python.org wrote:
> >
> > > Le 15/06/2020 à 15:36, Wes McKinney a écrit :
> > >
> > > > When you have only a single server, all the gRPC traffic goes through
> > > > a common port and is handled by a common server, so if both client and
> > > > server are roughly IO bound you aren't going to get better performance
> > > > by hitting the server with multiple clients simultaneously, only worse
> > > > because the packets from different client requests are intermingled in
> > > > the TCP traffic on that port. I'm not a networking expert but this is
> > > > my best understanding of what is going on.
> > >
> > > Yibo Cai's experiment disproves that explanation, though.
> > > When I run a single client against the test server, I get ~4 GB/s. When
> > > I run 6 standalone clients against the same test server, I get ~8 GB/s
> > > aggregate. So there's

Re: [jira] [Created] (ARROW-7434) [GLib] Homebrew packages seem not working

2019-12-18 Thread Chengxin Ma
It works. Thanks!


Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On Wednesday, December 18, 2019 2:52 PM, Jeroen Ooms  
wrote:

> Try this:
>
> gcc -o hello_world hello_world.c $(pkg-config --libs --cflags arrow-glib)
>
> On Wed, Dec 18, 2019 at 1:07 PM Chengxin Ma (Jira) j...@apache.org wrote:
>
> > Chengxin Ma created ARROW-7434:
> >
> > 
> >
> >  Summary: [GLib] Homebrew packages seem not working
> >  Key: ARROW-7434
> >  URL: https://issues.apache.org/jira/browse/ARROW-7434
> >  Project: Apache Arrow
> >   Issue Type: Bug
> >   Components: GLib
> > Affects Versions: 0.15.1
> >  Environment: macOS 10.15.2
> > Reporter: Chengxin Ma
> >
> >
> > After installing {{apache-arrow}} and {{apache-arrow-glib}} via 
> > {{Homebrew}} according to the [Installation 
> > Guide|https://arrow.apache.org/install/], I wrote a very simple program to 
> > test if they were successfully installed.
> > {code}
> > $ cat hello_world.c
> > #include 
> > #include 
> > int main(int argc, char **argv) {
> > printf("Hello, World! \n");
> > }
> > {code}
> > {{gcc}} gave the following error:
> > {code}
> > $ gcc -o hello_world hello_world.c
> > In file included from hello_world.c:3:
> > In file included from /usr/local/include/arrow-glib/arrow-glib.h:22:
> > /usr/local/include/arrow-glib/gobject-type.h:22:10: fatal error: 
> > 'glib-object.h' file not found
> > #include 
> > ^~~
> > 1 error generated.
> > {code}
> > Is there any step that I didn’t follow here?
> > --
> > This message was sent by Atlassian Jira
> > (v8.3.4#803005)




Trouble with building Arrow GLib

2019-12-19 Thread Chengxin Ma
Hi All,

I am building Arrow GLib on a system where I'm not the admin.
In the installation instructions 
(https://github.com/apache/arrow/tree/master/c_glib) there are two options for 
building GLib: "How to build by users" and "How to build by developers". I 
followed the "by users" section and met the following problem:
"checking for gobject-introspection... configure: error: 
gobject-introspection-1.0 is not installed".

"GObject Introspection" is mentioned explicitly in the "How to build by 
developers" section. Does this indicate it is optional if we build as a user?
I checked "./configure --help" but didn't find a way to exclude it in the build 
process.
I've also tried to build GObject Introspection 
(http://www.linuxfromscratch.org/blfs/view/svn/general/gobject-introspection.html)
 but Meson isn't available on the system...

Could someone please offer a suggestion for me to build Arrow GLib?

Kind Regards
Chengxin

Sent with [ProtonMail](https://protonmail.com) Secure Email.

Re: Trouble with building Arrow GLib

2019-12-20 Thread Chengxin Ma
Hi Kou,

I am working on [the Cartesius 
system](https://userinfo.surfsara.nl/systems/cartesius). The OS on it is [bullx 
Linux](https://userinfo.surfsara.nl/systems/cartesius/software/rhel).

I’ve successfully installed Meson according to your suggestion, however it 
couldn’t run, probably due to some compatibility issues on the system. (I’ve 
written an email to the Helpdesk for help.)

In the meantime of waiting for their response, I would like to know if it is 
possible to do some modification in the source code of the current Apache Arrow 
release to disable building GObject Introspection, instead of waiting for the 
next release.

Kind Regards
Chengxin


Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On Friday, December 20, 2019 4:49 AM, Sutou Kouhei  wrote:

> Hi,
>
> Arrow GLib can provide C API but it requires GObject
> Introspection for now. So you need to install GObject
> Introspection to use Arrow GLib for now.
> (I'll add a build option to build without GObject
> Introspection. Then you can build Arrow GLib without GObject
> Introspection in the next release.)
>
> Could you show your environment? Are you using a Linux distribution?
>
> You can install Meson by "pip install --user meson".
> See also: https://mesonbuild.com/Getting-meson.html#installing-meson-with-pip
>
> Thanks,
>
> ---
>
> kou
>
> In 
> DIVG0Hqiw9iory_bs1T6A_cf5etWsPJ0-lbAFxaJ4H2hrTm87EGUED3ztGenYN6EjVzW5_oYL1KIH4V3F_BE1dMQyu4EtbBflin-m-iGC_Q=@protonmail.ch
> "Trouble with building Arrow GLib" on Thu, 19 Dec 2019 10:53:48 +,
> Chengxin Ma c...@protonmail.ch.INVALID wrote:
>
> > Hi All,
> > I am building Arrow GLib on a system where I'm not the admin.
> > In the installation instructions 
> > (https://github.com/apache/arrow/tree/master/c_glib) there are two options 
> > for building GLib: "How to build by users" and "How to build by 
> > developers". I followed the "by users" section and met the following 
> > problem:
> > "checking for gobject-introspection... configure: error: 
> > gobject-introspection-1.0 is not installed".
> > "GObject Introspection" is mentioned explicitly in the "How to build by 
> > developers" section. Does this indicate it is optional if we build as a 
> > user?
> > I checked "./configure --help" but didn't find a way to exclude it in the 
> > build process.
> > I've also tried to build GObject Introspection 
> > (http://www.linuxfromscratch.org/blfs/view/svn/general/gobject-introspection.html)
> >  but Meson isn't available on the system...
> > Could someone please offer a suggestion for me to build Arrow GLib?
> > Kind Regards
> > Chengxin
> > Sent with ProtonMail Secure Email.




Re: Trouble with building Arrow GLib

2019-12-20 Thread Chengxin Ma
Hi Kou,

Thanks for the quick fix. After installing the patch I am able to build 
Arrow-GLib now.
About the issue related to Meson: I was using Python 3, the problem was solved 
by using conda instead of pip.

Kind Regards
Chengxin


Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On Friday, December 20, 2019 12:27 PM, Sutou Kouhei  wrote:

> Hi,
>
> > I’ve successfully installed Meson according to your suggestion, however it 
> > couldn’t run, probably due to some compatibility issues on the system. 
> > (I’ve written an email to the Helpdesk for help.)
>
> I think that you're using Python 2. Meson requires Python 3.
>
> > In the meantime of waiting for their response, I would like to know if it 
> > is possible to do some modification in the source code of the current 
> > Apache Arrow release to disable building GObject Introspection, instead of 
> > waiting for the next release.
>
> Here is a patch to make GObject Introspection optional:
>
> https://patch-diff.githubusercontent.com/raw/apache/arrow/pull/6072.patch
>
> You need to use Meson to build with this patch.
>
> To use configure, you need to regenerate c_glib/configure
> from c_glib/configure.ac. It requires GObject Introspection.
> So you can't use configure on your environment.
>
> Thanks,
>
> ---
>
> kou
>
> In 
> tSWuBtndqpJhkCaHTDcHDaE3za0pbqK-8fnqKT99Vv6QGVxYAjKuZUUSZ4A94U6DiZaWxc8wYc5JXiu9EoZublUwbXhlq8kar_sguY6onWI=@protonmail.ch
> "Re: Trouble with building Arrow GLib" on Fri, 20 Dec 2019 09:59:27 +,
> Chengxin Ma c...@protonmail.ch.INVALID wrote:
>
> > Hi Kou,
> > I am working on the Cartesius system. The OS on it is bullx Linux.
> > I’ve successfully installed Meson according to your suggestion, however it 
> > couldn’t run, probably due to some compatibility issues on the system. 
> > (I’ve written an email to the Helpdesk for help.)
> > In the meantime of waiting for their response, I would like to know if it 
> > is possible to do some modification in the source code of the current 
> > Apache Arrow release to disable building GObject Introspection, instead of 
> > waiting for the next release.
> > Kind Regards
> > Chengxin
> > Sent with ProtonMail Secure Email.
> > ‐‐‐ Original Message ‐‐‐
> > On Friday, December 20, 2019 4:49 AM, Sutou Kouhei k...@clear-code.com 
> > wrote:
> >
> > > Hi,
> > > Arrow GLib can provide C API but it requires GObject
> > > Introspection for now. So you need to install GObject
> > > Introspection to use Arrow GLib for now.
> > > (I'll add a build option to build without GObject
> > > Introspection. Then you can build Arrow GLib without GObject
> > > Introspection in the next release.)
> > > Could you show your environment? Are you using a Linux distribution?
> > > You can install Meson by "pip install --user meson".
> > > See also: 
> > > https://mesonbuild.com/Getting-meson.html#installing-meson-with-pip
> > > Thanks,
> > >
> > > kou
> > > In 
> > > DIVG0Hqiw9iory_bs1T6A_cf5etWsPJ0-lbAFxaJ4H2hrTm87EGUED3ztGenYN6EjVzW5_oYL1KIH4V3F_BE1dMQyu4EtbBflin-m-iGC_Q=@protonmail.ch
> > > "Trouble with building Arrow GLib" on Thu, 19 Dec 2019 10:53:48 +,
> > > Chengxin Ma c...@protonmail.ch.INVALID wrote:
> > >
> > > > Hi All,
> > > > I am building Arrow GLib on a system where I'm not the admin.
> > > > In the installation instructions 
> > > > (https://github.com/apache/arrow/tree/master/c_glib) there are two 
> > > > options for building GLib: "How to build by users" and "How to build by 
> > > > developers". I followed the "by users" section and met the following 
> > > > problem:
> > > > "checking for gobject-introspection... configure: error: 
> > > > gobject-introspection-1.0 is not installed".
> > > > "GObject Introspection" is mentioned explicitly in the "How to build by 
> > > > developers" section. Does this indicate it is optional if we build as a 
> > > > user?
> > > > I checked "./configure --help" but didn't find a way to exclude it in 
> > > > the build process.
> > > > I've also tried to build GObject Introspection 
> > > > (http://www.linuxfromscratch.org/blfs/view/svn/general/gobject-introspection.html)
> > > >  but Meson isn't available on the system...
> > > > Could someone please offer a suggestion for me to build Arrow GLib?
> > > > Kind Regards
> > > > Chengxin
> > > > Sent with ProtonMail Secure Email.




[jira] [Created] (ARROW-8587) Compilation error when linking arrow-flight-perf-server

2020-04-24 Thread Chengxin Ma (Jira)
Chengxin Ma created ARROW-8587:
--

 Summary: Compilation error when linking arrow-flight-perf-server
 Key: ARROW-8587
 URL: https://issues.apache.org/jira/browse/ARROW-8587
 Project: Apache Arrow
  Issue Type: Bug
  Components: Benchmarking, C++, FlightRPC
Affects Versions: 1.0.0
 Environment: Linux HP 5.3.0-46-generic #38~18.04.1-Ubuntu SMP Tue Mar 
31 04:17:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Reporter: Chengxin Ma


I wanted to play around with Flight benchmark after seeing the discussion 
regarding Flight's throughput in arrow dev mailing list today.

I met the following error when trying to build the benchmark from latest source 
code:
{code:java}
[ 95%] Linking CXX executable ../../../debug/arrow-flight-perf-server
../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to 
`boost::filesystem::detail::canonical(boost::filesystem::path const&, 
boost::filesystem::path const&, boost::system::error_code*)'
../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to 
`boost::system::system_category()'
../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to 
`boost::filesystem::path::parent_path() const'
../../../debug/libarrow_flight.so.18.0.0: undefined reference to `deflate'
../../../debug/libarrow_flight.so.18.0.0: undefined reference to `deflateEnd'
../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to 
`boost::system::generic_category()'
../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to 
`boost::filesystem::detail::current_path(boost::system::error_code*)'
../../../debug/libarrow_flight.so.18.0.0: undefined reference to `inflateInit2_'
../../../debug/libarrow_flight.so.18.0.0: undefined reference to `inflate'
../../../debug/libarrow_flight.so.18.0.0: undefined reference to `deflateInit2_'
../../../debug/libarrow_flight.so.18.0.0: undefined reference to `inflateEnd'
../../../debug/libarrow_flight_testing.so.18.0.0: undefined reference to 
`boost::filesystem::path::operator/=(boost::filesystem::path const&)'
collect2: error: ld returned 1 exit status
src/arrow/flight/CMakeFiles/arrow-flight-perf-server.dir/build.make:154: recipe 
for target 'debug/arrow-flight-perf-server' failed
make[2]: *** [debug/arrow-flight-perf-server] Error 1
CMakeFiles/Makefile2:2609: recipe for target 
'src/arrow/flight/CMakeFiles/arrow-flight-perf-server.dir/all' failed
make[1]: *** [src/arrow/flight/CMakeFiles/arrow-flight-perf-server.dir/all] 
Error 2
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2

{code}
I was using {{cmake .. -DCMAKE_BUILD_TYPE=Debug -DARROW_DEPENDENCY_SOURCE=AUTO 
-DARROW_FLIGHT=ON -DARROW_BUILD_BENCHMARKS=ON 
-DARROW_CXXFLAGS="-lboost_filesystem -lboost_system"}} to configure the build.
 I noticed that there was a {{ARROW_BOOST_BUILD_VERSION: 1.71.0}} in the 
output, but the Boost library that I installed from the package manger was of 
this version: {{1.65.1.0ubuntu1}}. Could this be the cause of the problem?

PS:
I was able to build the benchmark 
[before|https://issues.apache.org/jira/browse/ARROW-7200]. It was on AWS with 
the OS being ubuntu-bionic-18.04-amd64-server-20191002, which should be very 
similar to the one I'm using on my laptop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8861) Memory not released until Plasma process is killed

2020-05-19 Thread Chengxin Ma (Jira)
Chengxin Ma created ARROW-8861:
--

 Summary: Memory not released until Plasma process is killed
 Key: ARROW-8861
 URL: https://issues.apache.org/jira/browse/ARROW-8861
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Plasma
Affects Versions: 0.16.0
 Environment: Singularity container (Ubuntu 18.04)
Reporter: Chengxin Ma


Invoking the {{Delete(const ObjectID& object_id)}} method of a plasma client 
seems not really to free up the memory used by the object.

To reproduce:
 1. use {{htop}} (or other similar tools) to monitor memory usage;
 2. start up the Plasma Object Store by {{plasma_store -m 10 -s 
/tmp/plasma}};
 3. use {{put.py}} to put an object into Plasma;
 4. compile and run {{delete.cc}} ({{g++ delete.cc `pkg-config --cflags --libs 
arrow plasma` --std=c++11 -o delete}});
 5. kill the {{plasma_store}} process.

Memory usage drops at Step 5, rather than Step 4.

How to free up the memory while keeping Plasma Object Store running?

{{put.py}}:
{code:java}
from pyarrow import plasma

if __name__ == "__main__":
client = plasma.connect("/tmp/plasma")
object_id = plasma.ObjectID(20 * b"a")
object_size = 5
buffer = memoryview(client.create(object_id, object_size))
for i in range(5):
buffer[i] = i % 128
client.seal(object_id)
client.disconnect()
{code}
{{delete.cc}}:
{code:java}
#include "arrow/util/logging.h"
#include 

using namespace plasma;

int main(int argc, char **argv)
{
PlasmaClient client;
ARROW_CHECK_OK(client.Connect("/tmp/plasma"));
ObjectID object_id = ObjectID::from_binary("");

client.Delete(object_id);

ARROW_CHECK_OK(client.Disconnect());
}
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7200) Running Arrow Flight benchmark on two hosts doesn't work

2019-11-18 Thread Chengxin Ma (Jira)
Chengxin Ma created ARROW-7200:
--

 Summary: Running Arrow Flight benchmark on two hosts doesn't work
 Key: ARROW-7200
 URL: https://issues.apache.org/jira/browse/ARROW-7200
 Project: Apache Arrow
  Issue Type: Bug
  Components: Benchmarking, C++, FlightRPC
Affects Versions: 0.15.1, 0.15.0
 Environment: AWS EC2
Instance type: t3a.xlarge
AMI: ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20191002
Number of instances: 2
They are capable of pinging each other.
Reporter: Chengxin Ma
 Attachments: Screen Shot 2019-11-18 at 16.00.38.png

I was trying to evaluate the performance of Apache Arrow Flight on two hosts 
(one as the client and the other one as the server), using [the official 
benchmark|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight_benchmark.cc]].

Flags I used to build the project were:

 
{code:java}
-DARROW_FLIGHT=ON
-DCMAKE_BUILD_TYPE=Debug
-DARROW_BUILD_BENCHMARKS=ON
{code}
 

The branch I used was maint-0.15.x since there was a build error on the master 
branch. _(The build error on master only existed in the environment where I set 
up two hosts: AWS. On my local environment (macOS) the build was successful on 
the master branch. I don't think this build error is relevant to the issue 
since there is no difference in the cpp source code.)_

On the host acting as the server, I ran 
{code:java}
./arrow-flight-perf-server{code}
On the host acting as the client, I ran 
{code:java}
./arrow-flight-benchmark --server_host ip-172-31-11-18{code}
It gives the following error: 
{code:java}
Failed with error: << IOError: gRPC returned unavailable error, with message: 
Connect Failed. Detail: Unavailable{code}
 

 If I ran 
{code:java}
./arrow-flight-benchmark --server_host ip-172-31-11-17{code}
the error will be different:
{code:java}
IOError: Server was not available after 10 attempts{code}
This is understandable since this host doesn't exist at all.

This indicates that Flight is able to find the existing host (ip-172-31-11-18), 
but the communication somehow didn't succeed.

The benchmark works fine if I run it with the localhost, either by not 
specifying the server_host flag or running the server in another process on the 
same host.

I am not sure if the problem is in the environment or in the code itself. Could 
someone please give me some hint on how to resolve the problem?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7320) Target arrow-type-benchmark failed to be built on bullx Linux

2019-12-04 Thread Chengxin Ma (Jira)
Chengxin Ma created ARROW-7320:
--

 Summary: Target arrow-type-benchmark failed to be built on bullx 
Linux
 Key: ARROW-7320
 URL: https://issues.apache.org/jira/browse/ARROW-7320
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 1.0.0
 Environment: bullx Linux
Reporter: Chengxin Ma


I was building Arrow on bullx Linux (a Linux distribution compatible with Red 
Hat Enterprise Linux).

CMake options:
{code}
-DCMAKE_BUILD_TYPE=Debug
-DARROW_FLIGHT=ON
-DARROW_BUILD_BENCHMARKS=ON
{code}

{{make}} failed with the following error message:
{code}
Scanning dependencies of target arrow-type-benchmark
[ 72%] Building CXX object 
src/arrow/CMakeFiles/arrow-type-benchmark.dir/type_benchmark.cc.o
make[2]: *** No rule to make target 
`gbenchmark_ep/src/gbenchmark_ep-install/lib/libbenchmark_main.a', needed by 
`debug/arrow-type-benchmark'.  Stop.
make[1]: *** [src/arrow/CMakeFiles/arrow-type-benchmark.dir/all] Error 2
make: *** [all] Error 2
{code}

This is due to the same reason as mentioned in [this 
commit|https://github.com/apache/arrow/pull/4246/commits/f6b0bc7f8dc56f02e2778752235e728b7623a9ee]:

If {{-DCMAKE_INSTALL_LIBDIR=lib}} is not explicitly set, 
{{libbenchmark_main.a}} will be put in {{lib64}} instead of {{lib}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7411) [C++][Flight] Incorrect Arrow Flight benchmark output

2019-12-17 Thread Chengxin Ma (Jira)
Chengxin Ma created ARROW-7411:
--

 Summary: [C++][Flight] Incorrect Arrow Flight benchmark output
 Key: ARROW-7411
 URL: https://issues.apache.org/jira/browse/ARROW-7411
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Benchmarking, C++, FlightRPC
Affects Versions: 0.15.1
 Environment: macOS
Reporter: Chengxin Ma
Assignee: Chengxin Ma
 Fix For: 1.0.0


When running Arrow Flight benchmark in the following scenario, the output is 
incorrect. 
{code}
$ ./arrow-flight-perf-server &
[1] 12986
Server host: localhost
Server port: 31337
$ ./arrow-flight-benchmark -server_host localhost -test_put 
Using remote server: true
Testing method: DoPut
Server host: localhost
Server port: 31337
Bytes read: 128000
Nanos: 496372147
Speed: 2459.25 MB/s
{code}

{{Using remote server}} should be {{false}} and {{Bytes read}} should be 
{{Bytes write}}.

To correct the result of {{Using remote server}}, we can:

* Change {{if (FLAGS_server_host == "")}} to another condition which checks if 
there is already an {{arrow-flight-perf-server}} running. This is a bit 
complicated to do and might add some unnecessary complexity (e.g. we need to 
make sure it support all OSes.). 

* Delete {{Using remote server}}, since we already have {{Server host}} in the 
output.

I personally prefer the second option and will make a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7434) [GLib] Homebrew packages seem not working

2019-12-18 Thread Chengxin Ma (Jira)
Chengxin Ma created ARROW-7434:
--

 Summary: [GLib] Homebrew packages seem not working
 Key: ARROW-7434
 URL: https://issues.apache.org/jira/browse/ARROW-7434
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Affects Versions: 0.15.1
 Environment: macOS 10.15.2
Reporter: Chengxin Ma


After installing {{apache-arrow}} and {{apache-arrow-glib}} via {{Homebrew}} 
according to the [Installation Guide|https://arrow.apache.org/install/], I 
wrote a very simple program to test if they were successfully installed.

{code}
$ cat hello_world.c
#include 

#include 

int main(int argc, char **argv) {
printf("Hello, World! \n");
}
{code}

{{gcc}} gave the following error:

{code}
$ gcc -o hello_world hello_world.c
In file included from hello_world.c:3:
In file included from /usr/local/include/arrow-glib/arrow-glib.h:22:
/usr/local/include/arrow-glib/gobject-type.h:22:10: fatal error: 
'glib-object.h' file not found
#include 
 ^~~
1 error generated.
{code}

Is there any step that I didn’t follow here?




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7522) Broken Record Batch returned from a function call

2020-01-08 Thread Chengxin Ma (Jira)
Chengxin Ma created ARROW-7522:
--

 Summary: Broken Record Batch returned from a function call
 Key: ARROW-7522
 URL: https://issues.apache.org/jira/browse/ARROW-7522
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, C++ - Plasma
Affects Versions: 0.15.1
 Environment: macOS
Reporter: Chengxin Ma


Scenario: retrieving Record Batch from Plasma with known Object ID.

The following code snippet works well:
{code:java}
int main(int argc, char **argv)
{
plasma::ObjectID object_id = 
plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF");

// Start up and connect a Plasma client.
plasma::PlasmaClient client;
ARROW_CHECK_OK(client.Connect("/tmp/store"));

plasma::ObjectBuffer object_buffer;
ARROW_CHECK_OK(client.Get(&object_id, 1, -1, &object_buffer));

// Retrieve object data.
auto buffer = object_buffer.data;

arrow::io::BufferReader buffer_reader(buffer); 
std::shared_ptr record_batch_stream_reader;
ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(&buffer_reader, 
&record_batch_stream_reader));

std::shared_ptr record_batch;
arrow::Status status = record_batch_stream_reader->ReadNext(&record_batch);

std::cout << "record_batch->column_name(0): " << 
record_batch->column_name(0) << std::endl;
std::cout << "record_batch->num_columns(): " << record_batch->num_columns() 
<< std::endl;
std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << 
std::endl;
std::cout << "record_batch->column(0)->length(): "
  << record_batch->column(0)->length() << std::endl;
std::cout << "record_batch->column(0)->ToString(): "
  << record_batch->column(0)->ToString() << std::endl;
}
{code}
{{record_batch->column(0)->ToString()}} would incur a segmentation fault if 
retrieving Record Batch is wrapped in a function:
{code:java}
std::shared_ptr GetRecordBatchFromPlasma(plasma::ObjectID 
object_id)
{
// Start up and connect a Plasma client.
plasma::PlasmaClient client;
ARROW_CHECK_OK(client.Connect("/tmp/store"));

plasma::ObjectBuffer object_buffer;
ARROW_CHECK_OK(client.Get(&object_id, 1, -1, &object_buffer));

// Retrieve object data.
auto buffer = object_buffer.data;

arrow::io::BufferReader buffer_reader(buffer);
std::shared_ptr record_batch_stream_reader;
ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(&buffer_reader, 
&record_batch_stream_reader));

std::shared_ptr record_batch;
arrow::Status status = record_batch_stream_reader->ReadNext(&record_batch);

// Disconnect the client.
ARROW_CHECK_OK(client.Disconnect());

return record_batch;
}

int main(int argc, char **argv)
{
plasma::ObjectID object_id = 
plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF");

std::shared_ptr record_batch = 
GetRecordBatchFromPlasma(object_id);

std::cout << "record_batch->column_name(0): " << 
record_batch->column_name(0) << std::endl;
std::cout << "record_batch->num_columns(): " << record_batch->num_columns() 
<< std::endl;
std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << 
std::endl;
std::cout << "record_batch->column(0)->length(): "
  << record_batch->column(0)->length() << std::endl;
std::cout << "record_batch->column(0)->ToString(): "
  << record_batch->column(0)->ToString() << std::endl;
}
{code}
The meta info of the Record Batch such as number of columns and rows is still 
available, but I can't see the content of the columns.

{{lldb}} says that the stop reason is {{EXC_BAD_ACCESS}}, so I think the Record 
Batch is destroyed after {{GetRecordBatchFromPlasma}} finishes. But why can I 
still see the meta info of this Record Batch?
 What is the proper way to get the Record Batch if we insist using a function?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)