Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Paul Taylor
Responding to this comment from GitHub[1]: If we had to make a bet about what % of dictionaries empirically are between 128 and 255 elements, I would bet that the percentage is small. If it turned out that 40% of dictionaries fell in that range then I would agree that this makes sense. I agr

Re: [JavaScript] how to set column name after creation?

2020-06-26 Thread Ryan McKinley
excellent thank you: https://github.com/grafana/grafana/pull/25871/files#diff-ad1ca51923ba5a2e01652c1686ae6797R169 On Fri, Jun 26, 2020 at 7:54 AM Brian Hulette wrote: > Hi Ryan, > Here or user@arrow.apache.orgis a fine place to ask :) > > The metadata on Table/Column/Field objects are all immu

[RESULT] [VOTE] Add Decimal::bitWidth field to Schema.fbs for forward compatibility

2020-06-26 Thread Wes McKinney
The vote carries with 5 binding +1 votes and 1 non-binding +1. I will merge the change and open some JIRAs about the reference implementations adding forward compatibility checks that the bit width they receive is either null or 128 On Thu, Jun 25, 2020 at 4:02 PM Sutou Kouhei wrote: > > +1 (bind

Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Wes McKinney
I think that situations where you need uint64 indices are likely to be exceedingly esoteric. I would recommend that the specification advise against use of 64-bit indices at all unless that are actually needed to represent the data (i.e. dictionaries have more than INT32_MAX / UINT32_MAX elements,

Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Paul Taylor
If positive integers are expected, I'm in favor of supporting unsigned index types. I was surprised at Arrow C++ restriction on signed indices in the RAPIDS thread, perhaps it's newer than when I ported the logic in JS. Based on the flatbuffer schemas, dictionary indices could technically be a

Re: Deep copy for ArrayData,Array, Table in C++ API

2020-06-26 Thread Antoine Pitrou
On Fri, 26 Jun 2020 13:56:26 -0400 Radu Teodorescu wrote: > Looks like Concatenate is my best bet if I am looking at putting together > ranges, certainly doesn’t look as neatly packaged as Take, but this might be > the right tool for this job. Yes, you could Slice the array and then Concatena

Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Micah Kornfield
I think in the interest of not having the spec fork we should probably do this. It is partially our fault for not providing better documentation in Schema.fbs (and potentially more thorough integration tests). Maybe we should explicitly disallow uint64 which provides the biggest headache for the

Re: [DISCUSS] Incrementing Arrow MetadataVersion from V4 to V5 for 1.0.0 release

2020-06-26 Thread Micah Kornfield
I agree I think we have to do this given the number of changes in flight (especially union types). On Fri, Jun 26, 2020 at 7:29 AM Wes McKinney wrote: > I created a JIRA about this > > https://issues.apache.org/jira/browse/ARROW-9231 > > This issue is quite important so please take a look. > > O

Re: [JavaScript] how to set column name after creation?

2020-06-26 Thread Paul Taylor
You can also use the `Field.prototype.clone()` method[1] like this to further reduce the boilerplate: function renameColumn(col, new_name) {   return Column.new(col.field.clone(new_name), col.chunks); } 1. https://github.com/apache/arrow/blob/master/js/src/schema.ts#L139-L146 On 6/26/20 7:54

Re: Deep copy for ArrayData,Array, Table in C++ API

2020-06-26 Thread Radu Teodorescu
Looks like Concatenate is my best bet if I am looking at putting together ranges, certainly doesn’t look as neatly packaged as Take, but this might be the right tool for this job. > On Jun 26, 2020, at 1:01 PM, Radu Teodorescu > wrote: > > That is fabulous and pretty much it! > Follow up qu

Re: Deep copy for ArrayData,Array, Table in C++ API

2020-06-26 Thread Radu Teodorescu
That is fabulous and pretty much it! Follow up questions: 1. Is there any efficient way to refer to ranges: say I want to take rows 1000-2000 and 4000-5000, feels unwieldy to have to create an index array of 2000 elements and then also the underlying implementation would be less efficient having

Re: Deep copy for ArrayData,Array, Table in C++ API

2020-06-26 Thread Micah Kornfield
This sounds like the Take kernel? On Friday, June 26, 2020, Radu Teodorescu wrote: > (Light weigh topic this time) > Are there any existing functions for deep copying Array,ArrayData or Table > objects in the C++ API? > Ultimately, I am trying to get a bunch of sparse row ranges from a ranges >

Deep copy for ArrayData,Array, Table in C++ API

2020-06-26 Thread Radu Teodorescu
(Light weigh topic this time) Are there any existing functions for deep copying Array,ArrayData or Table objects in the C++ API? Ultimately, I am trying to get a bunch of sparse row ranges from a ranges into a contiguous new Table - I can see how I can copy Buffer and I can implement it all myse

Re: optimal way to store historical data

2020-06-26 Thread anthony . abate
Also, let me clarify so there is no confusion - There should be no problem creating static / read only arrow data files with a 'date to batch' index in the manner i described. The problem I am referring to only becomes an issue if you need to append a new batch on a daily basis -Anthony On Fri

Re: optimal way to store historical data

2020-06-26 Thread anthony . abate
+1 to this.. There is a logical way to do this now - If you create a batch per day you can maintain a separate arrow file (an index) to map the date to batch.. We do this for indexing via other keys, and I can say it works well for 'large' files - 25gb+. I think unfortunately, doing this via the c

Re: optimal way to store historical data

2020-06-26 Thread Dachuan Zhao
+1 Is the dataset the model for that? On Fri, Jun 26, 2020 at 11:42 AM Kirill Lykov wrote: > Hi, > > I wonder what is the best way to represent time series in the arrow. > Maybe someone did a research already about different ways of > representing these data? Or there is a ready-to-use solution

optimal way to store historical data

2020-06-26 Thread Kirill Lykov
Hi, I wonder what is the best way to represent time series in the arrow. Maybe someone did a research already about different ways of representing these data? Or there is a ready-to-use solution inside the library. Basically, I need a third dimension to the table which is time. One of the solutio

Re: [JavaScript] how to set column name after creation?

2020-06-26 Thread Brian Hulette
Hi Ryan, Here or user@arrow.apache.orgis a fine place to ask :) The metadata on Table/Column/Field objects are all immutable, so doing this right now would require creating a new instance of Table with the field renamed, which takes quite a lot of boilerplate. A helper for renaming a column (or ev

Re: [DISCUSS] Incrementing Arrow MetadataVersion from V4 to V5 for 1.0.0 release

2020-06-26 Thread Wes McKinney
I created a JIRA about this https://issues.apache.org/jira/browse/ARROW-9231 This issue is quite important so please take a look. On Thu, Jun 25, 2020 at 8:53 AM Wes McKinney wrote: > > On Thu, Jun 25, 2020 at 5:31 AM Antoine Pitrou wrote: > > > > > > Le 25/06/2020 à 12:18, Antoine Pitrou a éc

Re: Arrow for low-latency streaming of small batches?

2020-06-26 Thread Chris Osborn
Yes, it would be quite feasible to preallocate a region large enough for several thousand rows for each column, assuming I read from that region while it's still filling in. When that region is full, I could either allocate a new big chunk or loop around if I no longer need the data. I'm now doi

[Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Wes McKinney
hi folks, At the moment, using unsigned integers for dictionary indices/codes isn't exactly forbidden by the metadata [1], which says that the indices must be "positive integers". Meanwhile, the columnar format specification says "When a field is dictionary encoded, the values are represented by

[NIGHTLY] Arrow Build Report for Job nightly-2020-06-26-0

2020-06-26 Thread Crossbow
Arrow Build Report for Job nightly-2020-06-26-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-26-0 Failed Tasks: - centos-7-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-26-0-travis-centos-7-aarch64 - debian-bust