Re: Contributing to Arrow

2020-04-24 Thread karuppayya
Hi Micah, Thanks for letting me know. I will ping Ji Liu on the jira, and see how I can help with the jira issue. Thanks Karuppayya On Fri, 24 Apr 2020, 21:05 Micah Kornfield, wrote: > Hi Karuppayya, > Welcome! > > The only issue I can think of off the top of my head on the Java side that > i

[jira] [Created] (ARROW-8592) [C++] Docs still list LLVM 7 as compiler used

2020-04-24 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-8592: -- Summary: [C++] Docs still list LLVM 7 as compiler used Key: ARROW-8592 URL: https://issues.apache.org/jira/browse/ARROW-8592 Project: Apache Arrow Issue

Re: Contributing to Arrow

2020-04-24 Thread Micah Kornfield
Hi Karuppayya, Welcome! The only issue I can think of off the top of my head on the Java side that is on the basic side is: https://issues.apache.org/jira/browse/ARROW-6931 I'm not sure if Ji Liu is planning on working on it, you might ping Ji Liu on the JIRA and see if you can help out. In parti

Re: Strategy for Writing a Large Table?

2020-04-24 Thread Hei Chan
Hi Wes, Thanks for your pointers. It seems like to skip pandas as intermediary, I can only construct pyarrow.RecordBatch from pyarrow.Array or pyarrow.StructArray:https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html And StructArray.from_pandas()'s description states, "Convert

Re: Python is there support for extension types in Parquet?

2020-04-24 Thread Bryan Cutler
Thanks for the tips Micah and Wes. The storage type is an int64 list, which works in a roundtrip for parquet by itself. I'll look into it a bit more to see what is going on. On Fri, Apr 24, 2020 at 11:50 AM Wes McKinney wrote: > Extension types will round trip correctly through Parquet so long a

[jira] [Created] (ARROW-8591) [Rust] Reverse lookup for a key in DictionaryArray

2020-04-24 Thread Mahmut Bulut (Jira)
Mahmut Bulut created ARROW-8591: --- Summary: [Rust] Reverse lookup for a key in DictionaryArray Key: ARROW-8591 URL: https://issues.apache.org/jira/browse/ARROW-8591 Project: Apache Arrow Issue T

[jira] [Created] (ARROW-8590) [Rust] Use Arrow pretty print utility in DataFusion

2020-04-24 Thread Mark Hildreth (Jira)
Mark Hildreth created ARROW-8590: Summary: [Rust] Use Arrow pretty print utility in DataFusion Key: ARROW-8590 URL: https://issues.apache.org/jira/browse/ARROW-8590 Project: Apache Arrow Issu

Re: Question regarding Arrow Flight Throughput

2020-04-24 Thread Wes McKinney
gRPC breaks large buffers into smaller pieces that have to be reassembled after receipt -- this does add some overhead. I would guess that circumventing gRPC for the transfer of each IPC messages would be the route to throughput beyond the 20-40Gbps that we're able to achieve now. On Fri, Apr 24,

Re: Question regarding Arrow Flight Throughput

2020-04-24 Thread Antoine Pitrou
I'm not sure a new transport for gRPC would change anything. gRPC currently uses HTTP (HTTP2 I believe), and there's no reason for HTTP to be the culprit here. Regards Antoine. Le 24/04/2020 à 20:48, Micah Kornfield a écrit : > A couple of questions: > 1. For same node transport would doing

Re: Python is there support for extension types in Parquet?

2020-04-24 Thread Wes McKinney
Extension types will round trip correctly through Parquet so long as the storage type can be roundtripped (as Micah pointed out support for reading all nested types is not yet available). Note for reinforcement that Feather V2 is exactly an Arrow IPC file -- so IPC files could already do this prio

Re: Question regarding Arrow Flight Throughput

2020-04-24 Thread Micah Kornfield
A couple of questions: 1. For same node transport would doing something with Plasma be a reasonable approach? 2. What are the advantages/disadvantages of creating a new transport for gRPC [1] vs building an entirely new backend of flight? Thanks, Micah [1] https://github.com/grpc/grpc/issues/79

[jira] [Created] (ARROW-8589) ModuleNotFoundError: No module named 'pyarrow._orc'

2020-04-24 Thread ryan (Jira)
ryan created ARROW-8589: --- Summary: ModuleNotFoundError: No module named 'pyarrow._orc' Key: ARROW-8589 URL: https://issues.apache.org/jira/browse/ARROW-8589 Project: Apache Arrow Issue Type: Bug

Re: Question regarding Arrow Flight Throughput

2020-04-24 Thread David Li
Having alternative backends for Flight has been a goal from the start, hence why gRPC is wrapped and generally not exposed to the user. I would be interested in collaborating on an HTTP/1 backend that is accessible from the browser (or via an alternative transport meeting the same requirements, e.g

[jira] [Created] (ARROW-8588) `driver` param removed from `hdfs.connect()`

2020-04-24 Thread Jack Fan (Jira)
Jack Fan created ARROW-8588: --- Summary: `driver` param removed from `hdfs.connect()` Key: ARROW-8588 URL: https://issues.apache.org/jira/browse/ARROW-8588 Project: Apache Arrow Issue Type: Bug

Re: Python is there support for extension types in Parquet?

2020-04-24 Thread Micah Kornfield
Hi Bryan, Extension types isn't explicitly called out but https://issues.apache.org/jira/browse/ARROW-1644 (and related subtasks) might be a good place to track this. Thanks, Micah On Fri, Apr 24, 2020 at 11:13 AM Bryan Cutler wrote: > I've been trying out IO with Arrow's extension types and I

Python is there support for extension types in Parquet?

2020-04-24 Thread Bryan Cutler
I've been trying out IO with Arrow's extension types and I was able write a parquet file but reading it back causes an error: "pyarrow.lib.ArrowInvalid: Unsupported nested type: ...". Looking at the code for the parquet reader, it checks nested types and only allows a few specific ones. Is this a k

Re: Question regarding Arrow Flight Throughput

2020-04-24 Thread Antoine Pitrou
Hi Jiajia, I see. I think there are two possible avenues to try and improve this: * better use gRPC in the hope of achieving higher performance. This doesn't seem to be easy, though. I've already tried to change some of the parameters listed here, but didn't get any benefits: https://grpc.gi

RE: Question regarding Arrow Flight Throughput

2020-04-24 Thread Li, Jiajia
Hi Antoine, >The question, though, is: do you *need* those higher speeds on localhost? In >which context are you considering Flight? We want to send large data(in cache) to the data analytic application(in local). Thanks, Jiajia -Original Message- From: Antoine Pitrou Sent: Saturday

Re: Question regarding Arrow Flight Throughput

2020-04-24 Thread Antoine Pitrou
Hi Jiajia, It's true one should be able to reach higher speeds. For example, I can reach more than 7 GB/s on a simple TCP connection, in pure Python, using only two threads: https://gist.github.com/pitrou/6cdf7bf6ce7a35f4073a7820a891f78e The question, though, is: do you *need* those higher spe

RE: Question regarding Arrow Flight Throughput

2020-04-24 Thread Li, Jiajia
Hi Antoine, I think here 5 GB/s is in localhost. As localhost does not depend on network speed and I've checked the CPU is not the bottleneck when running benchmark, I think flight can get a higher throughput. Thanks, Jiajia -Original Message- From: Antoine Pitrou Sent: Friday, April

[jira] [Created] (ARROW-8587) Compilation error when linking arrow-flight-perf-server

2020-04-24 Thread Chengxin Ma (Jira)
Chengxin Ma created ARROW-8587: -- Summary: Compilation error when linking arrow-flight-perf-server Key: ARROW-8587 URL: https://issues.apache.org/jira/browse/ARROW-8587 Project: Apache Arrow Issu

Re: Strategy for Writing a Large Table?

2020-04-24 Thread Wes McKinney
I recommend going directly via Arrow instead of routing through pandas (or at least only using pandas as an intermediary to convert smaller chunks to Arrow). Tables can be composed from smaller RecordBatch objects (see Table.from_batches) so you don't need to accumulate much non-Arrow data in memor

Strategy for Writing a Large Table?

2020-04-24 Thread Hei Chan
Hi, I am new to Arrow and Parquet. My goal is to decode a 4GB binary file (packed c struct) and write all records to a file that can be used by R dataframe and Pandas dataframe and so others can do some heavy analysis on the big dataset efficiently (in terms of loading time and running statistic

Re: [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

2020-04-24 Thread Francois Saint-Jacques
+1 (binding) On Fri, Apr 24, 2020 at 5:41 AM Krisztián Szűcs wrote: > > +1 (binding) > > On 2020. Apr 24., Fri at 1:51, Micah Kornfield > wrote: > > > +1 (binding) > > > > On Thu, Apr 23, 2020 at 2:35 PM Sutou Kouhei wrote: > > > > > +1 (binding) > > > > > > In > > > "[VOTE] Add "trivial" Re

[jira] [Created] (ARROW-8586) Failed to Install arrow From CRAN

2020-04-24 Thread Hei (Jira)
Hei created ARROW-8586: -- Summary: Failed to Install arrow From CRAN Key: ARROW-8586 URL: https://issues.apache.org/jira/browse/ARROW-8586 Project: Apache Arrow Issue Type: Bug Components: R

Re: Contributing to Arrow

2020-04-24 Thread Antoine Pitrou
Hi, Le 24/04/2020 à 01:36, karuppayya a écrit : > Hi All, > I am interested i contributing to Arrow project > > I am planning to start with some jiras on Arrow Java component. > I tried looking for jiras with component *Java* and labels *beginner*, > *beginners*, *newbie.* We're not using the

[jira] [Created] (ARROW-8585) [Packaging][Python] Windows wheels fail to build because of link error

2020-04-24 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8585: -- Summary: [Packaging][Python] Windows wheels fail to build because of link error Key: ARROW-8585 URL: https://issues.apache.org/jira/browse/ARROW-8585 Project: Apa

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-04-24-0

2020-04-24 Thread Krisztián Szűcs
On Fri, Apr 24, 2020 at 12:07 PM Crossbow wrote: > > > Arrow Build Report for Job nightly-2020-04-24-0 > > All tasks: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-24-0 > > Failed Tasks: > - debian-stretch-amd64: > URL: > https://github.com/ursa-labs/crossbow/branc

[jira] [Created] (ARROW-8584) [Packaging][C++] Protobuf link error in debian-stretch build

2020-04-24 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8584: -- Summary: [Packaging][C++] Protobuf link error in debian-stretch build Key: ARROW-8584 URL: https://issues.apache.org/jira/browse/ARROW-8584 Project: Apache Arrow

[jira] [Created] (ARROW-8583) [C++][Doc] Undocumented parameter in Dataset namespace

2020-04-24 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8583: -- Summary: [C++][Doc] Undocumented parameter in Dataset namespace Key: ARROW-8583 URL: https://issues.apache.org/jira/browse/ARROW-8583 Project: Apache Arrow

[jira] [Created] (ARROW-8582) [Packaging][Python] macOS wheels occasionally exceed travis build time limit

2020-04-24 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8582: -- Summary: [Packaging][Python] macOS wheels occasionally exceed travis build time limit Key: ARROW-8582 URL: https://issues.apache.org/jira/browse/ARROW-8582 Projec

[NIGHTLY] Arrow Build Report for Job nightly-2020-04-24-0

2020-04-24 Thread Crossbow
Arrow Build Report for Job nightly-2020-04-24-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-24-0 Failed Tasks: - debian-stretch-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-24-0-github-debian-stretch-amd64 - tes

RE: Question regarding Arrow Flight Throughput

2020-04-24 Thread Li, Jiajia
Hi Wes, Thanks for your reply! Thanks, Jiajia -Original Message- From: Wes McKinney Sent: Friday, April 24, 2020 11:15 AM To: dev Subject: Re: Question regarding Arrow Flight Throughput On Thu, Apr 23, 2020 at 10:02 PM Wes McKinney wrote: > > hi Jiajia, > > See my TODO here > > htt

Re: Question regarding Arrow Flight Throughput

2020-04-24 Thread Antoine Pitrou
The problem with gRPC is that it was designed with relatively small requests and payloads in mind. We're using it for a large data application which it wasn't optimized for. Also, its threading model is inscrutable (yielding those weird benchmark results). However, 5 GB/s is indeed very good i

Re: [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

2020-04-24 Thread Krisztián Szűcs
+1 (binding) On 2020. Apr 24., Fri at 1:51, Micah Kornfield wrote: > +1 (binding) > > On Thu, Apr 23, 2020 at 2:35 PM Sutou Kouhei wrote: > > > +1 (binding) > > > > In > > "[VOTE] Add "trivial" RecordBatch body compression to Arrow IPC > > protocol" on Wed, 22 Apr 2020 19:24:09 -0500, > >

[jira] [Created] (ARROW-8581) [C#] Date32/64Array write & read back introduces off-by-one error

2020-04-24 Thread Adam Szmigin (Jira)
Adam Szmigin created ARROW-8581: --- Summary: [C#] Date32/64Array write & read back introduces off-by-one error Key: ARROW-8581 URL: https://issues.apache.org/jira/browse/ARROW-8581 Project: Apache Arrow