Re: Arrow for low-latency streaming of small batches?

2020-06-25 Thread Wes McKinney
Is it feasible to preallocate the memory region where you are writing the record batch? On Thu, Jun 25, 2020 at 1:06 PM Chris Osborn wrote: > > Hi, > > I am investigating Arrow for a project that needs to transfer records from > a producer to one or more consumers in small batches (median batch s

Error while selecting columns from hierarchical parquet file

2020-06-25 Thread Rafael Ladeira
Hi, Is it possible to read just selected columns from a dataframe with hierarchical levels in the columns, passing tuple to 'columns' argument? Example: pd.read_parquet('file.parquet', engine='pyarrow', columns=[('level_0_key', 'level_1_key')]) Trying to accomplish this in version '0.17.1' of

[JavaScript] how to set column name after creation?

2020-06-25 Thread Ryan McKinley
Apologies if this is the wrong list or place to ask... What is the best way to update a column name for a Table in javascript? const col = table.getColumnAt(i); col.name = 'new name!' Currently: Cannot assign to 'name' because it is a read-only property Thanks! ryan

Re: [DISCUSS] Removing top-level validity bitmap from Union type

2020-06-25 Thread Wes McKinney
I updated the PR to fix some issues with my edits that Antoine pointed out. I can start working on a C++ patch to implement the C++ changes in the next few days if that helps. Given the time urgency of deciding what to do on this if anyone else could express opinions it would be helpful. I see one

Re: [VOTE] Add Decimal::bitWidth field to Schema.fbs for forward compatibility

2020-06-25 Thread Sutou Kouhei
+1 (binding) In "[VOTE] Add Decimal::bitWidth field to Schema.fbs for forward compatibility" on Tue, 23 Jun 2020 13:35:04 -0500, Wes McKinney wrote: > Hi, > > As discussed on the mailing list [1] I would like to add a "bit width" > field to our Decimal metadata to allow for supporting dif

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-06-25 Thread Radu Teodorescu
Understood and agreed My proposal really addresses a number of mechanisms on layer 2 ( "Virtual" tables) in your taxonomy (I can adjust interface names accordingly as part of the review process). One additional element I am proposing here is the ability to insert and modify rows in a vectorized

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-06-25 Thread Wes McKinney
hi Radu, It's going to be challenging for me to review in detail until after the 1.0.0 release is out, but in general I think there are 3 layers that we need to be talking about: * Materialized in-memory tables * "Virtual" tables, whose in-memory/not-in-memory semantics are not exposed -- permitt

Arrow for low-latency streaming of small batches?

2020-06-25 Thread Chris Osborn
Hi, I am investigating Arrow for a project that needs to transfer records from a producer to one or more consumers in small batches (median batch size is 1) and with low latency. The usual structure for something like this would be a single producer multi-consumer queue*. Is there any sane way to

Re: [DISCUSS] Addition of a feature enum

2020-06-25 Thread David Li
Hey, Sorry for the delay - now that the enum values are power-of-two I think this is fine for any hypothetical encoding for Flight. In particular gRPC allows binary headers so if we wanted to directly send a dummy schema that would be fine, or we could encode it as a bitfield. (There is some limit

Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-06-25 Thread Radu Teodorescu
Here it is as a pull request: https://github.com/apache/arrow/pull/7548 I hope this can be a starter for an active conversation diving into specifics, and I look forward to contribute with more design and algorithm ideas as well as concrete code. > O

Re: [DISCUSS] Addition of a feature enum

2020-06-25 Thread Antoine Pitrou
I would be mostly interested in feedback by David Li and other Flight developers, otherwise it's fine to me. Regards Antoine. Le 25/06/2020 à 05:12, Micah Kornfield a écrit : > I've updated the PR. More feedback welcome, I'd like to start a vote by > end-of-week if possible. > > On Wed, Jun

Re: [VOTE] Add Decimal::bitWidth field to Schema.fbs for forward compatibility

2020-06-25 Thread Bryan Cutler
+1 On Wed, Jun 24, 2020, 10:38 AM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > +1 (binding) >

Re: [DISCUSS] Incrementing Arrow MetadataVersion from V4 to V5 for 1.0.0 release

2020-06-25 Thread Wes McKinney
On Thu, Jun 25, 2020 at 5:31 AM Antoine Pitrou wrote: > > > Le 25/06/2020 à 12:18, Antoine Pitrou a écrit : > > > > Le 25/06/2020 à 00:40, Wes McKinney a écrit : > >> hi folks, > >> > >> This has come up in some other contexts, but I believe it would be a > >> good idea to increment the version nu

Re: [DISCUSS] Incrementing Arrow MetadataVersion from V4 to V5 for 1.0.0 release

2020-06-25 Thread Antoine Pitrou
Le 25/06/2020 à 12:18, Antoine Pitrou a écrit : > > Le 25/06/2020 à 00:40, Wes McKinney a écrit : >> hi folks, >> >> This has come up in some other contexts, but I believe it would be a >> good idea to increment the version number in Schema.fbs starting with >> 1.0.0 to separate the pre-1.0 and

Re: [DISCUSS] Incrementing Arrow MetadataVersion from V4 to V5 for 1.0.0 release

2020-06-25 Thread Antoine Pitrou
Le 25/06/2020 à 00:40, Wes McKinney a écrit : > hi folks, > > This has come up in some other contexts, but I believe it would be a > good idea to increment the version number in Schema.fbs starting with > 1.0.0 to separate the pre-1.0 and post-1.0 worlds > > https://github.com/apache/arrow/blob

[NIGHTLY] Arrow Build Report for Job nightly-2020-06-25-0

2020-06-25 Thread Crossbow
Arrow Build Report for Job nightly-2020-06-25-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-25-0 Failed Tasks: - debian-stretch-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-25-0-travis-debian-stretch-arm64 - tes

Re: Proposal for the plugin API to support user customized compression codec

2020-06-25 Thread Antoine Pitrou
What is the performance of, say, HW GZip against SW ZSTD? Regards Antoine. On Thu, 25 Jun 2020 07:06:58 + "Xu, Cheng A" wrote: > Thanks Micha and Wes for the reply. W.R.T the scope, we’re working in > progress together with Parquet community to refine our proposal. > https://www.mail-a

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-25 Thread Antoine Pitrou
Le 25/06/2020 à 00:02, Wes McKinney a écrit : > hi folks, > > (cross-posting to dev@arrow and dev@parquet since there are > stakeholders in both places) > > It seems there are still problems at least with the C++ implementation > of LZ4 compression in Parquet files > > https://issues.apache.or

Re: Proposal for the plugin API to support user customized compression codec

2020-06-25 Thread Micah Kornfield
Hi Cheng Xu, > Since Arrow is more in memory format mostly for intermediate data, I would > expect less consideration in backward compatibility different from on-disk > Parquet format. 1. The Arrow file format is not ephemeral and now supports compressed buffers. 2. Even with other parts of Ar

RE: Proposal for the plugin API to support user customized compression codec

2020-06-25 Thread Xu, Cheng A
Thanks Micha and Wes for the reply. W.R.T the scope, we’re working in progress together with Parquet community to refine our proposal. https://www.mail-archive.com/dev@parquet.apache.org/msg12463.html This proposal here is more general to Arrow (indeed it can be used by native Parquet as well).