Re: New Pandas-Apache repo

Weston Pace Fri, 27 Jan 2023 09:53:46 -0800

The new kernels are interesting.  There has been some ask recently[1]
for weighted averages and I think you have some of the pieces (if not
all of it) here.  We also recently plumbed in support for binary
aggregates into Acero[2] so having more binary aggregate kernels would
be nice.

Outside of the kernels I agree with Kou that this probably doesn't
need to be a part of the main repo.  There is already some discussion
of splitting the main repo itself[3] in the interest of having smaller
composable pieces over monolithic pieces.

We do want tools like this to exist and flourish so I think Benson's
idea of a blog post is nice.  We also have a "powered by" page[4].

As for the library itself:

This appears to be a fairly faithful reproduction of the pandas API
and I imagine would be quite friendly to those coming from python.
Since you are using compute kernels directly you are going to be
limited to operating on what you can fit in memory (though I'm sure
there are plenty of valid use cases in this space).  I think the
primary challenge you will encounter in seeking users will be that the
audience of C++ data scientists is pretty small.  You aren't, for
example, going to get a significant performance boost over pandas /
numpy (as they use C/C++ for the heavy lifting already) and so the
real benefit will only be for those that are stuck using C++ already.

[1] https://github.com/apache/arrow/issues/15103
[2] https://github.com/apache/arrow/pull/15083
[3] https://github.com/apache/arrow/issues/15280
[4] https://arrow.apache.org/powered_by/

On Sun, Jan 22, 2023 at 2:44 AM Adesola Adedewe <ava6...@g.rit.edu> wrote:
>
> Yes I will, I haven't taken enough time to clean up the README , it was
> generated based on my source code with CHATGPT. I will do that later in the
> week.
>
> On Sun, Jan 22, 2023 at 2:36 AM Benson Muite <benson_mu...@emailplus.org>
> wrote:
>
> > On 1/22/23 13:15, Adesola Adedewe wrote:
> > > i'm working on a project where big financial data needs to be loaded
> > stored
> > > and manipulated. the data is stored as parquet. my initial version had
> > > arrow just load the parquet data and i used the basic unorderedmap but
> > this
> > > limited me to only one data type. i found i could make my database more
> > > generic with arrow and its performance benefits. unfortunately my team is
> > > mostly filled with python dev, so i decided to write a cleaner interface
> > > over arrow, and using interfaces closer to panda. This enabled us to use
> > > fewer lines of code as well, and still enjoy the benefit. i will write a
> > > blog post later, i was mostly looking for other developers looking to
> > > collaborate, or who may need this as well. not necessarily add it to the
> > > main library, but i'm not opposed to that. I also implemented some
> > > custom kernels like covariance correlation, cumprod, shift, pctchange.
> > >
> >
> > The context is very helpful. A blog post would certainly alert others in
> > the Arrow community of your work.  Most developers are over burdened, so
> > explaining a use case and how it may help them would encourage
> > exploration and review of your repository, so would encourage a blog
> > post that alerts the wider Arrow developer community about your work.
> > Updating the README of your repository would also encourage use.
> >
> >

Re: New Pandas-Apache repo

Reply via email to