Re: [VOTE] Accept donation of Rust Ballista project

2021-03-23 Thread Francois Saint-Jacques
+1 On Mon, Mar 22, 2021 at 8:33 AM Andrew Lamb wrote: > > +1 > > On Sun, Mar 21, 2021 at 7:08 PM paddy horan wrote: > > > +1 (non-binding) > > > > > > > > From: Sutou Kouhei > > Sent: Sunday, March 21, 2021 4:34:43 PM > > To: dev@arrow.apache.org > > Subject: R

Re: RE: [Go] expose ability to write arrow.Table to JSON

2021-04-23 Thread Francois Saint-Jacques
You can either use the provided server facility found in flight [1], or use stream directly via ipc [2]. You can look at the tests on how to use both facilities. François [1] https://github.com/apache/arrow/tree/master/go/arrow/flight [2] https://github.com/apache/arrow/tree/master/go/arrow/ipc

Re: Join operation on attributes from arrow structs

2020-04-02 Thread Francois Saint-Jacques
They're mapped with the StructType/StructArray, which is also columnar representation, e.g. one buffer per field in the sub-object. If you have varying/incompatible types, a field will be promoted to a UnionType. François On Thu, Apr 2, 2020 at 12:54 AM Micah Kornfield wrote: > > Hi Hasara, > Th

Re: Attn: Wes, Re: Masked Arrays

2020-04-06 Thread Francois Saint-Jacques
It does make sense, I would go a little further and make this field/property a single value of the same type than the array. This would allow using any arbitrary sentinel value for unknown values (0 in your suggested case). The end result is zero-copy for R bindings (if stars are aligned). I create

Re: [VOTE] Release Apache Arrow 0.17.0 - RC0

2020-04-17 Thread Francois Saint-Jacques
+1 (binding) Verified all sources locally on Ubuntu 18.04 (including Javascript). Verified the binaries, wheels verification matches the one found in https://github.com/apache/arrow/pull/6961 François On Fri, Apr 17, 2020 at 8:12 AM Antoine Pitrou wrote: > > > Hi, > > I tested the sources on Ub

Re: [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

2020-04-24 Thread Francois Saint-Jacques
+1 (binding) On Fri, Apr 24, 2020 at 5:41 AM Krisztián Szűcs wrote: > > +1 (binding) > > On 2020. Apr 24., Fri at 1:51, Micah Kornfield > wrote: > > > +1 (binding) > > > > On Thu, Apr 23, 2020 at 2:35 PM Sutou Kouhei wrote: > > > > > +1 (binding) > > > > > > In > > > "[VOTE] Add "trivial" Re

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Francois Saint-Jacques
Hello David, I think that what you ask is achievable with the dataset API without much effort. You'd have to insert the pre-buffering at ParquetFileFormat::ScanFile [1]. The top-level Scanner::Scan method is essentially a generator that looks like flatmap(Iterator>). It consumes the fragment in-or

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Francois Saint-Jacques
stems since on linux it can call `readahead` and/or `madvise`. François On Thu, Apr 30, 2020 at 8:56 AM Francois Saint-Jacques wrote: > > Hello David, > > I think that what you ask is achievable with the dataset API without > much effort. You'd have to insert the pre-bufferi

Re: [DISCUSS] Need for Arrow 0.17.1 patch release (binary only?)

2020-05-07 Thread Francois Saint-Jacques
I'll add https://issues.apache.org/jira/browse/ARROW-8726 to the list. On Tue, May 5, 2020 at 6:52 PM Wes McKinney wrote: > > Sorry I haven't had enough coffee today. > > The patches that still need to be resolved AFAICT are ARROW-8684 and > ARROW-8706 (AKA PARQUET-1857), so it will take a little

Re: [VOTE] Release Apache Arrow 0.17.1 - RC1

2020-05-15 Thread Francois Saint-Jacques
+1 binding, verified sources and binaries locally (no exclusions). On Fri, May 15, 2020 at 10:38 AM Neal Richardson wrote: > > +1 (binding) > > Verification here: https://github.com/apache/arrow/pull/7170 > > Still haven't worked out the Windows source verification job, but > everything else look

Re: [DISCUSS] Add kernel integer overflow handling

2020-06-04 Thread Francois Saint-Jacques
I documented [1] the behaviors by experimentation or by reading the documentation. My experiments were mostly about checking INT64_MAX + 1. My preference would be to use the platform defined behavior by default and provide a safety option that errors. Feel free to add more databases/systems. Fra

Re: Using gdb on a test

2020-06-15 Thread Francois Saint-Jacques
As Antoine said, debug mode is probably the most important configuration. You can also try the `relwithdebinfo` if you're trying to debug the optimized code. I'd also add the following: 1. Building out of conda provides a much better integration with gdb and the system's libstdc++ due to the prett

Re: Generate random arrow table

2020-06-22 Thread Francois Saint-Jacques
Hello, We use this extensively in unit tests, see [1] François [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/testing/random.h On Mon, Jun 22, 2020 at 9:51 AM Kirill Lykov wrote: > > Hi, > > I wonder if there is existing C++ code which allows to generate a > random arrow table by

Re: [DISCUSS][C++] Performance work and compiler standardization for linux

2020-06-22 Thread Francois Saint-Jacques
We should aim to improve the performance of the most widely used *default* packages, which are python pip, python conda and R (all platforms). AFAIK, both pip (manywheel) and conda use gcc on Linux by default. R uses gcc on Linux and mingw (gcc) on Windows. I suppose (haven't checked) that clang is

Re: Generate random arrow table

2020-06-23 Thread Francois Saint-Jacques
> something like RandomTableGenerator before implementing myself one > > > using RandomArrayGenerator. > > > > > > On Mon, Jun 22, 2020 at 4:49 PM Francois Saint-Jacques > > > wrote: > > > > > > > > Hello, > > > > > > > > We

Re: Generate random arrow table

2020-06-23 Thread Francois Saint-Jacques
If you configured CMake to build tests (-DARROW_BUILD_TESTS=ON) and install locally, there should be a `libarrow_testing.so` that you need to link against. What I meant is that this library is _not_ part of pip/conda/dpkg/rpm. François

Re: Feather v2 random access

2020-06-24 Thread Francois Saint-Jacques
Hello Yue, FeatherV2 is just a facade for the Arrow IPC file format. You can find the implementation here [1]. I will try to answer your question with inline comments. On a high level, the file format writes a schema and then multiple "chunks" called RecordBatch. Your lowest level of granularity

Re: Feather v2 random access

2020-06-24 Thread Francois Saint-Jacques
AM Francois Saint-Jacques wrote: > > Hello Yue, > > FeatherV2 is just a facade for the Arrow IPC file format. You can find > the implementation here [1]. I will try to answer your question with > inline comments. On a high level, the file format writes a schema and > then mul

Re: [VOTE] Add Decimal::bitWidth field to Schema.fbs for forward compatibility

2020-06-24 Thread Francois Saint-Jacques
+1 (binding)

Re: [DISCUSS] Removing top-level validity bitmap from Union type

2020-06-24 Thread Francois Saint-Jacques
OTOH, how do we handle NullType -> UnionType cast conversion? Do we require some convention like the first children ArrayData null bitmap to be set and all tags set to 0? François On Wed, Jun 24, 2020 at 1:09 PM Antoine Pitrou wrote: > > > Le 24/06/2020 à 18:34, Wes McKinney a écrit : > > On We

Re: [VOTE] Permitting unsigned integers for Arrow dictionary indices

2020-06-30 Thread Francois Saint-Jacques
+1 (binding) On Tue, Jun 30, 2020 at 10:55 AM Neal Richardson wrote: > > +1 (binding) > > On Tue, Jun 30, 2020 at 2:52 AM Antoine Pitrou wrote: > > > > > +1 (binding) > > > > Le 29/06/2020 à 23:59, Wes McKinney a écrit : > > > Hi, > > > > > > As discussed on the mailing list [1], it has been pro

Re: Ursabot Benchmark framework for other languages

2020-08-27 Thread Francois Saint-Jacques
Hello Kazuaki! I recommend you read and take a look at the benchmark sub-library [1] of archery and how it's glued [2]. You will need to implement: - A runner for the framework you intend to use [3] and [4], it also implies capturing the output into a class that implements the "Benchmark" interfa

Re: [C++] 0x00 in Binary type

2020-11-18 Thread Francois Saint-Jacques
I would say at first sight that it's due to your usage of char[] and builder.Append(d) implicitly does a strlen. François On Wed, Nov 18, 2020 at 2:00 PM Ying Zhou wrote: > > Sure! > > BinaryBuilder builder; > char d[] = "\x00\x01\xbf\x5b”; > (void)(builder.Append(d)); > std::shared_ptr array; >

Re: [DISCUSS] C++ SO versioning with 1.0.0

2019-07-29 Thread Francois Saint-Jacques
Sounds reasonable to me. On Sat, Jul 20, 2019 at 5:55 AM Sutou Kouhei wrote: > > Hi, > > No more opinions? > > If there are no more opinions, we'll use the current > SO versioning schema committed by > https://github.com/apache/arrow/pull/4801 for 1.0.0. The > current versioning schema is the fol

Re: [VOTE] Adopt FORMAT and LIBRARY SemVer-based version schemes for Arrow 1.0.0 and beyond

2019-07-29 Thread Francois Saint-Jacques
Do we bump the library version on changes from _any_ language implementation, or just the C++/Java version? François On Fri, Jul 26, 2019 at 3:34 PM Wes McKinney wrote: > > hello, > > As discussed on the mailing list thread [1], Micah Kornfield has > proposed a version scheme for the project to

Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

2019-07-29 Thread Francois Saint-Jacques
Hello, if each record has a different size, then I suggest to just use a Struct> where Dim is a struct (or expand in the outer struct). You can probably add your own logic with the recently introduced ExtensionType [1]. François [1] https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b

Re: [Discuss] C++ filenames: hyphens or underscores?

2019-08-06 Thread Francois Saint-Jacques
My vote would go with underscore to minimize changes and minimize exceptions to the google style guide reference. I also suggests that we add this to the linters somehow, if it's not too much trouble. François On Tue, Aug 6, 2019 at 9:35 PM Sutou Kouhei wrote: > > Hi, > > I like hyphens. > > Bec

Re: [ANNOUNCE] New Arrow PMC member: Micah Kornfield

2019-08-09 Thread Francois Saint-Jacques
Congrats! well deserved. On Fri, Aug 9, 2019 at 11:12 AM Wes McKinney wrote: > > The Project Management Committee (PMC) for Apache Arrow has invited > Micah Kornfield to become a PMC member and we are pleased to announce > that Micah has accepted. > > Congratulations and welcome!

Re: [DISCUSS] ArrayBuilders with mutable type

2019-08-19 Thread Francois Saint-Jacques
Indeed, I'd expect the `type()` method to not be called in the hot path. François On Mon, Aug 19, 2019 at 10:17 AM Wes McKinney wrote: > > hi Ben, > > On this possibility > > - Make ArrayBuilder::type() virtual. This will be much more expensive for > nested builders and for applications which ne

Re: [DISCUSS] Ternary logic

2019-08-30 Thread Francois Saint-Jacques
I created the ticket https://issues.apache.org/jira/browse/ARROW-6396, I think we can offer both. François On Thu, Aug 29, 2019 at 5:10 PM Ben Kietzman wrote: > > Indeed it's not about sanitizing nulls; it's about how nulls should > interact with boolean (and other) expressions. > > For purpose

Re: Size of c++ libraries

2019-09-04 Thread Francois Saint-Jacques
Hello Ivan, There's a software called `bloaty` [1] that can tell you the size of binary object per symbols. Thank you, François [1] https://github.com/google/bloaty On Wed, Sep 4, 2019 at 12:00 PM Ivan Popivanov wrote: > > Have been trying to figure out the binary size of a basic arrow static

Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal Richardson

2019-09-06 Thread Francois Saint-Jacques
Congrats to everyone! François On Fri, Sep 6, 2019 at 4:34 AM Kenta Murata wrote: > > Thank you very much everyone! > I'm very happy to join this community. > > 2019年9月6日(金) 12:39 Micah Kornfield : > > > > > Congrats everyone. > > > > On Thu, Sep 5, 2019 at 7:06 PM Ji Liu wrote: > > > > > Congr

Re: Travis CI delays

2019-09-27 Thread Francois Saint-Jacques
Hello, I suggest we tackle https://jira.apache.org/jira/browse/ARROW-5801. For Rust, that would be https://jira.apache.org/jira/browse/ARROW-5809. Once ported to docker/docker-compose, it's trivial to activate github action for the same test (see https://github.com/apache/arrow/pull/5530). As I'm

Re: [VOTE] Release Apache Arrow 0.15.0 - RC2

2019-10-02 Thread Francois Saint-Jacques
+1 (non binding) Source release verified. ARROW_FLIGHT=OFF due to system protobuf. Binary release verified. Ubuntu 18.04 François On Wed, Oct 2, 2019 at 1:18 AM Micah Kornfield wrote: > > +1 (binding) > > On Debian Stretch I ran: dev/release/verify-release-candidate.sh binaries > 0.15.0 2 and i

Re: [C++] The quest for zero-dependency builds

2019-10-10 Thread Francois Saint-Jacques
There's always the route of vendoring some library and not exposing external CMake options. This would achieve the goal of compile-out-of-the-box and enable important feature in the basic build. We also simplify dependencies requirements (benefits CI or developer). The downside is following securit

Re: [DISCUSS] Result vs Status

2019-10-19 Thread Francois Saint-Jacques
As mentioned, Result is an improvement for function which returns a single value, e.g. Make/Factory-like. My vote goes Result for such case. For multiple return types, we have std::tuple like Antoine proposed. François On Fri, Oct 18, 2019 at 9:19 PM Antoine Pitrou wrote: > > > Le 18/10/2019 à 2

Re: [VOTE] Release Apache Arrow 0.15.1 - RC0

2019-10-31 Thread Francois Saint-Jacques
+1 (non-binding) Ubuntu 18.04 - Source release verified - Binary release verified François On Fri, Oct 25, 2019 at 2:43 PM Krisztián Szűcs wrote: > > Hi, > > I would like to propose the following release candidate (RC0) of Apache > Arrow version 0.15.1. This is a patch release consisting of 36

Re: [VOTE] Release Apache Arrow 0.15.1 - RC0

2019-11-01 Thread Francois Saint-Jacques
nges in > >> > https://github.com/apache/arrow/pull/5600 > >> > + pip3 install -e dev/archery > >> > Obtaining file:///tmp/arrow-0.15.1.7OxLD/apache-arrow-0.15.1/dev/archery > >> > Complete output from command python setup.py

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-08-0

2019-11-08 Thread Francois Saint-Jacques
Lint and Rust failures fixed (https://github.com/apache/arrow/commit/aa9f5c95253ef1fe713c5010f0a8f740ef284109) Gandiva failures fixed (https://github.com/apache/arrow/commit/1d23ec42fd786141b7de58a057d91c74ca19c32e) Centos7 failure fixed (https://github.com/apache/arrow/commit/5a47c5e8c2d5dba5eac52

Re: Parquet cpp status

2019-11-15 Thread Francois Saint-Jacques
The parquet c++ implementation has all the facilities to expose the required information to implement predicate pushdown. The experimental Dataset API does make use of this with parquet. See [1] for an example of the API. Or a real-life usage with the nyc-tlc taxi dataset [2]. The relevant implemen

Re: [C++][Parquet]: Stream API handling of optional fields

2019-11-15 Thread Francois Saint-Jacques
I'm all for it. Created [1] it would also enable an operator[] for arrays of primitive types [2]. [1] https://issues.apache.org/jira/browse/ARROW-7178 [2] https://issues.apache.org/jira/browse/ARROW-6276 On Fri, Nov 15, 2019 at 12:40 AM Micah Kornfield wrote: > > I think there are potentially ot

Arrow sync call November 13 recap

2019-11-18 Thread Francois Saint-Jacques
Attendees: - Projjal Chanda - Uwe Korn - Antoine Pitrou - Prudhvi Porandla - François Saint-Jacques Discussion: - Dataset API is going to be a first candidate for the Result refactor (see https://github.com/apache/arrow/pull/5857) - There's an overlap of dataset::Expression class and gandiva::Node

Re: [DISCUSS][C++] Pointer name aliasing

2019-11-21 Thread Francois Saint-Jacques
This notation is already used in some parts of the codebase [1]. I think it was introduced when absorbing gandiva and then in a draft of the logical operations in the compute module. I have no strong opinion for/against. I find it convenient to reduce typing, but the style guide argue against this.

Re: [DISCUSS][C++] Pointer name aliasing

2019-11-22 Thread Francois Saint-Jacques
I'll revert, some questions: 1. Should we revert only the pointer aliases, or also the Vector/Iterator. 2. Should we revert all modules, i.e. gandiva and compute. François

Re: Unions: storing type_ids or type_codes?

2019-11-26 Thread Francois Saint-Jacques
It seems that the array_union_test.cc does the latter, look at how `expected_types` is constructed. I opened https://issues.apache.org/jira/browse/ARROW-7265 . Wes, is the intended usage of type_ids to allow a producer to pass a subset columns of unions without modifying the type codes? François

Re: Non-chunked large files / hdf5 support

2019-11-26 Thread Francois Saint-Jacques
Hello Maarten, In theory, you could provide a custom mmap-allocator and use the builder facility. Since the array is still in "build-phase" and not sealed, it should be fine if mremap changes the pointer address. This might fail in practice since the allocator is also used for auxiliary data, e.g.

Re: Apache Arrow sync now

2019-11-27 Thread Francois Saint-Jacques
Attendees: - Micah Kornfield, Google - Praveen Kumar, Dremio - Todd Hendricks - François Saint-Jacques RStudio/Ursa Labs Subject - Bazel. Micah wants feedback on the PR. This first is aimed a developer productivity, notably shorter link time and sandboxed build. As a first PoC, parts of the python

Re: Datasets and Java

2019-11-27 Thread Francois Saint-Jacques
Hello Hongze, The C++ implementation of dataset, notably Dataset, DataSource, DataSourceDiscovery, and Scanner classes are not ready/designed for distributed computing. They don't serialize and they reference by pointer all around, thus I highly doubt that you can implement parts in Java, and some

Re: [ANNOUNCE] New Arrow committer: Joris van den Bossche

2019-12-09 Thread Francois Saint-Jacques
Bravo! On Mon, Dec 9, 2019 at 6:55 AM Wes McKinney wrote: > > On behalf of the Arrow PMC, I'm happy to announce that Joris has > accepted an invitation to become a committer on Apache Arrow. > > Welcome, and thank you for your contributions!

Re: [Gandiva] question about IR optimization

2019-12-11 Thread Francois Saint-Jacques
It seems that LLVM can't auto vectorize. I don't have a debug build, so I can't get the `-debug-only` information from llvm-opt/opt about why it can't vectorize. The buffer address mangling should be hoisted out of the loop (still doesn't enable auto vectorization) [1]. The buffer juggling should b

Re: Arrow sync call December 11 at 12:00 US/Eastern, 17:00 UTC

2019-12-11 Thread Francois Saint-Jacques
Attendees: - Antoine Pitrou, Ursa Labs/RStudio - Francois Saint-Jaques, Ursa Labs/RStudio - Ravindra Pindikura, Dremio - Neville Dipale - Rok Mihevc Subjects: - Arrow 1.0 release: - Neville has been working on the Rust IPC bindings (https://github.com/apache/arrow/pull/6013) - Antoine is worki

Re: [Gandiva] question about IR optimization

2019-12-11 Thread Francois Saint-Jacques
functionality. > > On Wed, Dec 11, 2019 at 10:06 PM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > > > It seems that LLVM can't auto vectorize. I don't have a debug build, > > so I can't get the `-debug-only` information from llvm-opt/opt ab

Re: [Gandiva] question about IR optimization

2019-12-11 Thread Francois Saint-Jacques
Missing [1] link. [1] https://godbolt.org/z/S8tixP On Wed, Dec 11, 2019 at 12:58 PM Francois Saint-Jacques wrote: > > So, llvm _can_ auto-vectorize, I was just missing the `-mtripple` > option [1]. That still requires to hoist the buffer juggling. > > François > > On Wed,

Re: [DISCUSS][C++] Pointer name aliasing

2019-12-19 Thread Francois Saint-Jacques
nk we can probably take an incremental approach of: > 1. Eliminate *Ptr in src/arrow code (discuss similar changes in > parquet/gandiva). > 2. Decide on the Iterator/Vector. > > On Fri, Nov 22, 2019 at 10:47 AM Wes McKinney wrote: > > > hi Francois > > > > On Fri, No

Re: Human-readable version of Arrow Schema?

2020-01-09 Thread Francois Saint-Jacques
The desired goal for this feature is trivial modifications, e.g. within an editor, by data-scientists and researchers. I'd go for the flatbuffer's json representation as it is stable and has native support in almost any language or editor due to the ubiquity of JSON. The C interface schema string

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Francois Saint-Jacques
What's the point of having zero copy if the OS is doing the decompression in kernel (which trumps the zero-copy argument)? You might as well just use parquet without filesystem compression. I prefer to have compression algorithm where the columnar engine can benefit from it [1] than marginally impr

Re: [Format] Array/RowBatch filters

2020-01-24 Thread Francois Saint-Jacques
By filter, you mean a filter expression, or a selection vector/bitmap? On Thu, Jan 23, 2020 at 11:38 PM Micah Kornfield wrote: > > One of the things that I think got overlooked in the conversation on having > a slice offset in the C API was a suggestion from Jacques of perhaps > generalizing the

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-02-02-0

2020-02-03 Thread Francois Saint-Jacques
Opened https://github.com/apache/arrow/pull/6342 to silence the OSX jar issue. On Sun, Feb 2, 2020 at 8:31 AM Crossbow wrote: > > > Arrow Build Report for Job nightly-2020-02-02-0 > > All tasks: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-02-0 > > Failed Tasks: > -

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-02-02-0

2020-02-03 Thread Francois Saint-Jacques
Whelp, gmail didn't help with the thread folding. I'll just approve Krisz' patch :). On Mon, Feb 3, 2020 at 8:22 AM Francois Saint-Jacques wrote: > > Opened https://github.com/apache/arrow/pull/6342 to silence the OSX jar issue. > > On Sun, Feb 2, 2020

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-02-03-0

2020-02-03 Thread Francois Saint-Jacques
The debian buster failure seems to be a network issue with github upload, we'll see tomorrow. The gandiva-jar will be gone in the next nightly (https://github.com/apache/arrow/pull/6342). On Mon, Feb 3, 2020 at 8:48 AM Crossbow wrote: > > > Arrow Build Report for Job nightly-2020-02-03-0 > > All

Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-03 Thread Francois Saint-Jacques
+1 Binaries verification didn't have any issues. Sources verification worked with some local environment hiccups François On Mon, Feb 3, 2020 at 8:46 PM Andy Grove wrote: > > +1 (binding) based on running the Rust tests > > Thanks. > > On Thu, Jan 30, 2020 at 8:13 PM Krisztián Szűcs > wrote: >

Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-03 Thread Francois Saint-Jacques
Tested on ubuntu 18.04 for the source release. On Mon, Feb 3, 2020 at 10:07 PM Francois Saint-Jacques wrote: > > +1 > > Binaries verification didn't have any issues. > Sources verification worked with some local environment hiccups > > François > > On Mon, F

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-02-04-0

2020-02-04 Thread Francois Saint-Jacques
This is a first! On Tue, Feb 4, 2020 at 8:47 AM Crossbow wrote: > > > Arrow Build Report for Job nightly-2020-02-04-0 > > All tasks: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-02-04-0 > > Succeeded Tasks: > - centos-6: > URL: > https://github.com/ursa-labs/crossbo

Re: Arrow doesn't have a MapType

2020-02-07 Thread Francois Saint-Jacques
Arrow does have a Map type [1][2][3]. It is represented as a list of pairs. François [1] https://github.com/apache/arrow/blob/762202418541e843923b8cae640d15b4952a0af6/format/Schema.fbs#L60-L87 [2] https://github.com/apache/arrow/blob/762202418541e843923b8cae640d15b4952a0af6/cpp/src/arrow/type.h

Re: Arrow Datasets Functionality for Python

2020-02-10 Thread Francois Saint-Jacques
Hello Matthew, The dplyr binding is just syntactic sugar on top of the dataset API. There's no analytics capabilities yet [1], other than the select and the limited projection supported by the dataset API. It looks like it is doing analytics due to properly placed `collect()` calls, which converts

Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-10 Thread Francois Saint-Jacques
; > > > > >>>> >> > > > > > > [1] > > > >>>> >> https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files > > > >>>> >> > > > > > > [2] > > > >>>> >> > > > > > > > > >>>> >>

Re: [VOTE] Adopt Arrow in-process C Data Interface specification

2020-02-13 Thread Francois Saint-Jacques
+1 On Thu, Feb 13, 2020 at 9:08 PM Fan Liya wrote: > > +1 (binding) > > On Thu, Feb 13, 2020 at 11:52 AM Wes McKinney wrote: > > > +1 (binding) > > > > On Tue, Feb 11, 2020 at 4:29 PM Antoine Pitrou wrote: > > > > > > > > > Ah, you're right, it's PR 6040: > > > https://github.com/apache/arrow/p

[DISCUSS] Field reference ambiguity

2020-03-13 Thread Francois Saint-Jacques
Hello, the recent dataset and compute work has forced us to think about schema projection. One problem that surfaced is referencing fields in nested schemas and/or schemas where duplicate column names exists. We currently have (C++) APIs that either pass a vector or a vector to represent fields su

Re: Algorithmic explorations of bitmaps vs. sentinel values

2018-10-17 Thread Francois Saint-Jacques
It seems the code for the naive Scalar example is not friendly with the compiler auto-vectorization component. If you accumulate in a local state (instead of SumState pointer), you'll get different results. at least with clang++6.0. benchmark-noavx (only SSE): BM_SumInt32Scalar

Re: Algorithmic explorations of bitmaps vs. sentinel values

2018-10-17 Thread Francois Saint-Jacques
rks to do that (and merge with the > SumState at the end of the function) for thoroughness. Thanks! > On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques > wrote: > > > > It seems the code for the naive Scalar example is not friendly with the > > compiler auto-vectori

Re: [Discuss] Monorepo vs. independent repositories for independent implementations

2018-10-17 Thread Francois Saint-Jacques
One point toward seperate repositories, vendoring Arrow for C++ project with git submodules becomes awkward if it's a multi-lang monorepo. On Tue, Oct 16, 2018 at 9:22 PM Wes McKinney wrote: > I would also add -- Krisztian's recent work Dockerizing the project is > setting us up to be able to de

Re: [Discuss] Monorepo vs. independent repositories for independent implementations

2018-10-17 Thread Francois Saint-Jacques
Not the nesting, but pulling a lot of unused files. On Wed, Oct 17, 2018 at 12:39 PM Wes McKinney wrote: > Why would one level of directory nesting cause awkwardness (curious)? > > On Wed, Oct 17, 2018, 12:28 PM Francois Saint-Jacques < > fsaintjacq...@networkdump.com> wro

Re: Cast from Array to Array and Array

2018-11-14 Thread Francois Saint-Jacques
Seems like the type combination you're using (int32 -> uint32) and (int32 -> uint64) don't match the following pattern-matching https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L191-L192 which avoid using "safe" cast and revert to the following cast implementation

Re: RFC: Type inference rules

2018-11-30 Thread Francois Saint-Jacques
Hello, With JSON and other "typed" formats (msgpack, protobuf, ...) you need to take account unions, e.g. {a: "herp", b: 10} {a: true, c: "derp"} The type for `a` would be union. I think we should also evaluate into investing at ingesting different schema DSL (protobuf idl, json-schema) to avoi

Re: Reviewing PRs (was: Re: Arrow sync call)

2018-12-12 Thread Francois Saint-Jacques
I'd also suggest that we extend Romain's effort to add labels to all languages, review states, and mabye. While the string labeling with [], works, github search/filtering is not very good compared to filtering by labels. lang-{R,c++,py,java,...} review-{wip,ready} comp-{doc,gandiva,parquet,plasma

Re: How to append to parquet file periodically and read intermediate data - pyarrow.lib.ArrowIOError: Invalid parquet file. Corrupt footer.

2018-12-19 Thread Francois Saint-Jacques
Hello Darren, what Uwe suggests is usually the way to go, your active process writes to a new file every time. Then you have a parallel process/thread that does compaction of smaller files in the background such that you don't have too many files. On Wed, Dec 19, 2018 at 7:59 AM Uwe L. Korn wrot

Re: Arrow pull requests: please limit squashing your commits

2018-12-19 Thread Francois Saint-Jacques
No issue with this. When the final squash is done, which title/body is preserved? On Wed, Dec 19, 2018 at 8:43 AM Wes McKinney wrote: > hi folks, > > As the contributor base has grown, our development styles have grown > increasingly diverse. > > Sometimes contributors are used to working in a

Datum API

2019-01-09 Thread Francois Saint-Jacques
Is there a reason why Datum::ARRAY stores an ArrayData and not an Array? I'm aware there's the `make_array` method to obtain the equivalent, but was wondering if there was a deeper reason.

Re: Arrow Sync call: 17:00 UTC (12p Eastern)

2019-01-23 Thread Francois Saint-Jacques
Notes from today's meeting Attendees: - François Saint-Jacques (Ursa Labs/RStudio) - Wes McKinney (Ursa Labs/RStudio) - Benchmark Project - PR Backlog - Ben Kietzman (Ursa Labs/RStudio) - Neville Dipale - Siddhart Teotia (Dremio) - Andy Grove - Ravindra (Dremio) - Shyam SIngh (Dremio) - Li Jin

Re: [Format] Passing selection masks with Arrow record batches

2019-01-28 Thread Francois Saint-Jacques
On Mon, Jan 28, 2019 at 12:53 AM Wes McKinney wrote: > I was having a discussion recently about Arrow and the topic of > server-side filtering vs. client-side filtering came up. > > The basic problem is this: > > If you have a RecordBatch that you wish to filter out some of the > "rows", one way

Re: Git workflow question

2019-01-30 Thread Francois Saint-Jacques
This is also applicable to a per-repository basis by modifying the clone `.git/config` file instead of the global one in your home. On Wed, Jan 30, 2019 at 1:49 PM Antoine Pitrou wrote: > > That will be activated for all repositories, though, not only Arrow? > > Regards > > Antoine. > > > Le 30/

Re: Compute kernels and Gandiva operators

2019-02-13 Thread Francois Saint-Jacques
Hi, I also agree that we should follow a model similar to what you propose. I think the plan is, correct me if I'm wrong Wes, to write the logical plan operators, then write a small execution engine prototype and produce a proper design document out of this experiment. There's also a placeholder t

Re: Flight / gRPC scalability issue

2019-02-21 Thread Francois Saint-Jacques
Can you remind us what's the easiest way to get flight working with grpc? clone + make install doesn't really work out of the box. François On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou wrote: > > Hello, > > I've been trying to saturate several CPU cores using our Flight > benchmark (which sp

Re: Flight / gRPC scalability issue

2019-02-21 Thread Francois Saint-Jacques
al wrote: > > > I like flamegraphs for investigating this sort of problem: > > > > > > https://github.com/brendangregg/FlameGraph > > > > > > There are likely many other techniques for inspecting where time > is being spent but

Re: [VOTE] Release Apache Arrow 0.12.1 RC0

2019-02-22 Thread Francois Saint-Jacques
+1 (non-binding) * Validated sources on Ubuntu 18.04 with cmake 3.10.2 * Validated binaries On Fri, Feb 22, 2019 at 6:33 AM Uwe L. Korn wrote: > +1 (binding) > > * Checked sources on Ubuntu 16.04 with an updated CMake and Gandiva turned > off. > * Verified the uploaded signatures of sources and

Re: Flight / gRPC scalability issue

2019-02-24 Thread Francois Saint-Jacques
; > Regards > > Antoine. > > > Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit : > > You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to > perf, > > it'll help the unwinding. Sometimes it's better than the stack pointer > > meth

Re: Flaky Travis CI builds on master

2019-02-27 Thread Francois Saint-Jacques
I think we're witnessing multiple issues. 1. Travis seems to be slow (is it an OOM issue?) - https://travis-ci.org/apache/arrow/jobs/499122041#L1019 - https://travis-ci.org/apache/arrow/jobs/498906118#L3694 - https://travis-ci.org/apache/arrow/jobs/499146261#L2316 2. https://issues.apache.or

Re: URI library for C++

2019-02-27 Thread Francois Saint-Jacques
There's a good chance we end up using curl for the dataset project. Curl has a new url API https://github.com/curl/curl/wiki/URL-API , but it requires a recent version (7.62.0 october 2018) which means vendoring. François On Wed, Feb 27, 2019 at 11:06 AM Antoine Pitrou wrote: > > Hello, > > As

Re: URI library for C++

2019-02-27 Thread Francois Saint-Jacques
LS library and who knows what else... > > Regards > > Antoine. > > > On Wed, 27 Feb 2019 11:16:49 -0500 > Francois Saint-Jacques wrote: > > There's a good chance we end up using curl for the dataset project. Curl > > has a new url API https://github.co

Re: Flaky Travis CI builds on master

2019-03-01 Thread Francois Saint-Jacques
Also just created https://issues.apache.org/jira/browse/ARROW-4728 On Thu, Feb 28, 2019 at 3:53 AM Ravindra Pindikura wrote: > > > > On Feb 28, 2019, at 2:10 PM, Antoine Pitrou wrote: > > > > > > Le 28/02/2019 à 07:53, Ravindra Pindikura a écrit : > >> > >> > >>> On Feb 27, 2019, at 1:48 AM, An

Re: Flaky Travis CI builds on master

2019-03-01 Thread Francois Saint-Jacques
> > Thoughts? > > -Micah > > > On Fri, Mar 1, 2019 at 8:55 AM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > > > Also just created https://issues.apache.org/jira/browse/ARROW-4728 > > > > On Thu, Feb 28, 2019 at 3:53 AM Ravindra Pindiku

Re: Flaky Travis CI builds on master

2019-03-01 Thread Francois Saint-Jacques
an > be tagged with the label > > On Fri, Mar 1, 2019 at 12:45 PM Francois Saint-Jacques > wrote: > > > > I agree with adding a tag/label for this and even marking the failure as > > critical. > > > > > > On Fri, Mar 1, 2019 at 12:18 PM Micah Kornf

Re: Flaky Travis CI builds on master

2019-03-01 Thread Francois Saint-Jacques
Could someone give me write/edit access to confluence? Thank you, François On Fri, Mar 1, 2019 at 3:55 PM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > I'll take this. > > On Fri, Mar 1, 2019 at 3:55 PM Wes McKinney wrote: > >> We could create a pa

Re: Flaky Travis CI builds on master

2019-03-04 Thread Francois Saint-Jacques
On Fri, Mar 1, 2019 at 8:09 PM Francois Saint-Jacques > wrote: > > > > Could someone give me write/edit access to confluence? > > > > Thank you, > > François > > > > On Fri, Mar 1, 2019 at 3:55 PM Francois Saint-Jacques < > > fsaintjacq...@gm

Re: [ANNOUNCE] New Arrow committer: Micah Kornfield

2019-03-08 Thread Francois Saint-Jacques
Congrats, great addition! On Fri, Mar 8, 2019 at 3:12 PM Philipp Moritz wrote: > Congrats Micah! > > On Fri, Mar 8, 2019 at 11:28 AM Wes McKinney wrote: > > > On behalf of the Arrow PMC, I'm happy to announce that Micah has > > accepted an invitation to become a committer on Apache Arrow. > > >

[C++] Failing constructors and internal state

2019-03-08 Thread Francois Saint-Jacques
Greetings, I noted that the current C++ API permits constructing core objects breaking said classes invariants. The following recent issues were affected by this: - ARROW-4766: segfault due to invalid ArrayData with nullptr buffer - ARROW-4774: segfault due to invalid Table with columns of differ

Re: [C++] Failing constructors and internal state

2019-03-11 Thread Francois Saint-Jacques
then_ pass that to > >> std::make_shared(..., vector_arg, ...). > >> > >> I do not agree with refactoring these methods to use "validating" > >> constructors. Users of these C++ APIs should know what their > >> requirements are, and we pro

Re: Creating Arrays from builders using bitmasks

2019-03-22 Thread Francois Saint-Jacques
Hello Felipe, it's a bit per value as per memory layout documentation. François On Fri, Mar 22, 2019 at 10:48 AM Felipe Aramburu wrote: > In the builder base class I see this api > > > https://github.com/apache/arrow/blob/ad1697e5d25eeaff5630421f55b0120f45cf0ce1/cpp/src/arrow/array/builder_b

Re: Creating Arrays from builders using bitmasks

2019-03-22 Thread Francois Saint-Jacques
Actually, this specific method seems to use a byte per value as you questioned. I think it's worth adding documentation and an explicit warning if it confused me. I'll let bkietz chime in to comment on the usage. François On Fri, Mar 22, 2019 at 10:57 AM Francois Saint-Jacques &l

Re: Creating Arrays from builders using bitmasks

2019-03-22 Thread Francois Saint-Jacques
to do this but I am > trying to do things as the documentation suggest which I assumed was the > preferred method of doing this. > > > > On Fri, Mar 22, 2019 at 8:13 AM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > > > Actually, this specific method

  1   2   3   >