Re: Re: Re: [DISCUSS][Rust] Biweekly sync call for arrow/datafusion again?

2021-09-19 Thread Rémi Dettai
> > > DataFusion. > > > > It will be fantastic to have an opportunity to communicate with > community > > > > members "face to face". > > > > > > > > Best, > > > > Yijie > > > > > >

Re: [DISCUSS][Rust] Biweekly sync call for arrow/datafusion again?

2021-09-16 Thread Rémi Dettai
I am also very interested in re-instoring these events, at least occasionally. I do think that sharing some higher level goals and ideas in more *informal *discussions could help us understand each other better in our asynchronous work (design documents, issues, PRs). I also agree that no decisio

Re: [Rust] [DataFusion] Move Statistics and Cost Based Optimizations to physical plans

2021-09-11 Thread Rémi Dettai
Thanks Andrew for bringing this PR forward. I would just like to give the big picture that led to this modification. We would like to make Datafusion more efficiently integratable with table formats. I have recently written a design document in that sense [1] that goes through the various ways we

Re: Arrow Rust sync call February 10 at 12:00 US/Eastern, 17:00 UTC

2021-02-13 Thread Rémi Dettai
d fit > your architecture nicely and I think shouldn't be too hard to create the > query from the filters/projection in the datasource scan method to spend > less time in Lambda. > > On Wed, Feb 10, 2021, 18:44 Rémi Dettai wrote: > > > Thanks for the notes Andy. Here is

Re: Arrow Rust sync call February 10 at 12:00 US/Eastern, 17:00 UTC

2021-02-10 Thread Rémi Dettai
Thanks for the notes Andy. Here is the slide deck I presented, for further reference: https://docs.google.com/presentation/d/1uZ5PbazC1zCX24k0Hh-UItddIh9BRvD5GL7NUDgc9eQ/edit?usp=sharing If anyone wants to see how it works in practice and does not have an AWS account to try it out, feel free to re

Re: [Rust][DataFusion] DataFusion Overview / Architecture

2021-02-04 Thread Rémi Dettai
Hi Andrew! The book "How query engines work" ( https://leanpub.com/how-query-engines-work) that Andy wrote is pretty great! It documents query engine APIs in Kotlin and not Rust, as it was written during earlier Ballista experimentations, but almost all items still apply to DataFusion (feel free t

Re: Introducing Buzz, Arrow powered serverless query engine

2021-01-28 Thread Rémi Dettai
mplementation of Arrow. > > > > On Tue, Jan 26, 2021 at 10:18 AM Rémi Dettai wrote: > > > > > Hi all, > > > > > > I have been following this community for nearly a year now, trying to > > > contribute whenever I could. It was really a great experi

Re: Arrow Rust Sync Call 1/27/2021

2021-01-27 Thread Rémi Dettai
thanks Andy!! Le mer. 27 janv. 2021 à 18:41, Andy Grove a écrit : > Attendees > >- > >Mahmut Bulut >- > >Remi Dettai >- > >Andy Grove >- > >Fernando Herrera >- > >Jorn Horstmann >- > >Andrew Lamb >- > >Jorge Leitao >- > >Mike Seddon

Introducing Buzz, Arrow powered serverless query engine

2021-01-26 Thread Rémi Dettai
Hi all, I have been following this community for nearly a year now, trying to contribute whenever I could. It was really a great experience and I sure learned a lot. Today, it's my time to give back to the community with the open sourcing of the project I have started to develop a few months ago.

Re: [Rust] Proposed discussion items for the Rust sync up meeting this Wednesday Jan 27, 2021

2021-01-26 Thread Rémi Dettai
Great topics Andrew, to my knowledge nothing has been decided on these topics. We also agreed last time that it would be nice to go round the table so that each of us has an opportunity to present briefly its use case for the Rust Arrow implementation. Remi Le dim. 24 janv. 2021 à 13:16, Andrew

Re: Announce: Bi-weekly Rust-specific Arrow sync call

2021-01-13 Thread Rémi Dettai
Hi Andy! I am stuck in the waiting room! Le mer. 13 janv. 2021 à 17:58, Andy Grove a écrit : > The first of these calls will be starting shortly. I will try and remember > to send reminders in advance for future calls. > > On Mon, Jan 11, 2021 at 4:40 PM Andy Grove wrote: > > > As discussed at

Re: [Rust] Announcement: Rust is now part of the Arrow Flight Integration Test

2021-01-10 Thread Rémi Dettai
Congratulations to all contributors ! Rémi Le sam. 9 janv. 2021 à 19:48, David Li a écrit : > Congrats to all involved, this is indeed a big milestone! > > Best, > David > > On Sat, Jan 9, 2021, at 13:13, Chao Sun wrote: > > Congrats! this is awesome work! > > > > On Sat, Jan 9, 2021 at 4:28 AM

Re: [DataFusion] Blocking async of async is not async

2020-11-16 Thread Rémi Dettai
t;> two thread pools: one for synchronous tasks and one for async tasks > >> > >> I am fairly sure there can be only one global Runtime (because when I > >> tried > >> try to explicitly create one when an existing one is present, tokio > >> panic!'

Re: [DataFusion] Blocking async of async is not async

2020-11-13 Thread Rémi Dettai
mplementation, > to give you a sense of the kinds of issues we are hoping to avoid in > DataFusion with using async > > Andrew > > > On Fri, Oct 30, 2020 at 4:28 AM Rémi Dettai wrote: > > > Hi everyone! > > > > If you are reading this, it means that you f

Re: Rust ParquetReader trait

2020-11-12 Thread Rémi Dettai
e way that you've implemented. Just I need to > understand the surface impact for the team. > > Best, > Mahmut > > Rémi Dettai , 11 Kas 2020 Çar, 19:06 tarihinde şunu > yazdı: > > > Hi Mahmut, > > > > The way of implementing sources for Parqu

Re: Rust ParquetReader trait

2020-11-11 Thread Rémi Dettai
Hi Mahmut, The way of implementing sources for Parquet has changed. The new way is to implement the ChunkReader trait. This is simpler (less methods to implement) and more efficient (you have more information about the upcoming bytes that will be read). The ParquetReader has been made private as i

Re: Best way to store ragged packet data in Parquet files

2020-11-04 Thread Rémi Dettai
Hi Jason! I guess this question would better echo on the Parquet mailing list https://parquet.apache.org/community/ Very interesting remark though. I looked into it and didn't find any obvious explanation. The entire size of the file is taken up by the "data" column as storing df[['data']] yields

[DataFusion] Blocking async of async is not async

2020-10-30 Thread Rémi Dettai
Hi everyone! If you are reading this, it means that you felt in the trap of my catchy (but meaningless) title! This discussion somewhat relates to [1]. DataFusion has recently made its top level "actions" (collect, write...) async. The problem is that most of the codebase is not async (in partic

Re: [DISCUSS] [C++] custom allocator for large objects

2020-07-11 Thread Rémi Dettai
b.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/arrow/array/builder_binary.h#L253 > > On Tue, Jun 16, 2020 at 8:07 AM Rémi Dettai wrote: > > > Hi Antoine and all ! > > > > Sorry for the delay, I wanted to understand things a bit better before >

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-16 Thread Rémi Dettai
ion or not. Hope this all makes sense. Took me a while to understand how the decoding works ;-) Remi Le ven. 5 juin 2020 à 17:20, Antoine Pitrou a écrit : > > Le 05/06/2020 à 17:09, Rémi Dettai a écrit : > > I looked into the details of why the decoder could not estimate the > target

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Rémi Dettai
25, Uwe L. Korn a écrit : > > On Fri, Jun 5, 2020, at 3:13 PM, Rémi Dettai wrote: > > Hi Antoine ! > > > I would indeed have expected jemalloc to do that (remap the pages) > > I have no idea about the performance gain this would provide (if any). > > Could be interest

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Rémi Dettai
e Pitrou a écrit : > > Le 05/06/2020 à 14:25, Rémi Dettai a écrit : > > Hi Uwe! > > > >> As your suggestions don't seem to be specific to Arrow, why not > > contribute them directly to jemalloc? They are much better in reviewing > > allocator code than w

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Rémi Dettai
. > > Still, when we read a column, we should be able to determine its final > size from the Parquet metadata. Maybe we're passing an information there > not along? > > Best, > Uwe > > On Thu, Jun 4, 2020, at 5:48 PM, Rémi Dettai wrote: > > When creating large arr

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Rémi Dettai
r allocations ? Le jeu. 4 juin 2020 à 17:58, Antoine Pitrou a écrit : > On Thu, 4 Jun 2020 17:48:16 +0200 > Rémi Dettai wrote: > > When creating large arrays, Arrow uses realloc quite intensively. > > > > I have an example where y read a gzipped parquet column (strings) tha

[DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Rémi Dettai
When creating large arrays, Arrow uses realloc quite intensively. I have an example where y read a gzipped parquet column (strings) that expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc cannot anticipate this and every reallocate call above 1MB (the most critical ones) ends up

Re: [DISCUSS] Add kernel integer overflow handling

2020-06-04 Thread Rémi Dettai
It makes sense to me that the default behaviour of such a low level api as kernel does not do any automagic promotion, but shouldn't this kind of promotion still be requestable by the so called "system developer" user ? Otherwise he would need to materialize a promoted version of each original arra

Re: Arrow sync all at 12pm US-Eastern / 16:00 UTC

2020-05-27 Thread Rémi Dettai
t > Projjal Chanda > Rémi Dettai > Laurent Goujon > Andy Grove > Uwe Korn > Micah Kornfield > Wes McKinney > Rok Mihevc > Neal Richardson > François Saint-Jacques > > Discussion: > * patch queue is growing, please review things > * 1.0 > * Timeline: ta

Re: [C++][Python] Highlighting some known problems with our Arrow C++ and Python packages

2020-04-30 Thread Rémi Dettai
Hi! Does your point 1 also apply to the AWS SDK dependency ? Currently it seems that it cannot be built in BUNDLED mode. As stated in https://issues.apache.org/jira/browse/ARROW-8565 I struggled a lot to make a static build with the S3 dependency activated ! I would really like to help on this bec

Re: Follow up on ARROW-8451, datafusion part of Arrow

2020-04-14 Thread Rémi Dettai
elieved that they were being hindered > by being a part of monorepo, we could create a new repository under > apache/ on GitHub for the part that wants to split into a standalone > GitHub repository. That wouldn't change the governance of that code. > > - Wes > > On Tue, Ap

Follow up on ARROW-8451, datafusion part of Arrow

2020-04-14 Thread Rémi Dettai
This is a follow up on https://issues.apache.org/jira/browse/ARROW-8451. First thanks for your answer! It's true that I was also surprised to see all implementations of Arrow mixed up in a single repository! I was really considering the separation of the repositories as a mean to separate concer