Re: New Pandas-Apache repo

Adesola Adedewe Fri, 27 Jan 2023 11:01:39 -0800

Thanks for taking the time for reviewing . It work great for my use-case,
as i am doing a one shot loading and data manipulation on big data. And the
data is basically immutable for the rest of the lifetime of the process,
just read. but i know from testing and benchmarking that it is limited by
the fact that we cant easily perform inplace operation, i’m still a novice
at the kernel portion of arrow so im sure they are opportunities to improve
performance.


On Fri, Jan 27, 2023 at 9:53 AM Weston Pace <weston.p...@gmail.com> wrote:

> The new kernels are interesting.  There has been some ask recently[1]
> for weighted averages and I think you have some of the pieces (if not
> all of it) here.  We also recently plumbed in support for binary
> aggregates into Acero[2] so having more binary aggregate kernels would
> be nice.
>
> Outside of the kernels I agree with Kou that this probably doesn't
> need to be a part of the main repo.  There is already some discussion
> of splitting the main repo itself[3] in the interest of having smaller
> composable pieces over monolithic pieces.
>
> We do want tools like this to exist and flourish so I think Benson's
> idea of a blog post is nice.  We also have a "powered by" page[4].
>
> As for the library itself:
>
> This appears to be a fairly faithful reproduction of the pandas API
> and I imagine would be quite friendly to those coming from python.
> Since you are using compute kernels directly you are going to be
> limited to operating on what you can fit in memory (though I'm sure
> there are plenty of valid use cases in this space).  I think the
> primary challenge you will encounter in seeking users will be that the
> audience of C++ data scientists is pretty small.  You aren't, for
> example, going to get a significant performance boost over pandas /
> numpy (as they use C/C++ for the heavy lifting already) and so the
> real benefit will only be for those that are stuck using C++ already.
>
> [1] https://github.com/apache/arrow/issues/15103
> [2] https://github.com/apache/arrow/pull/15083
> [3] https://github.com/apache/arrow/issues/15280
> [4] https://arrow.apache.org/powered_by/
>
> On Sun, Jan 22, 2023 at 2:44 AM Adesola Adedewe <ava6...@g.rit.edu> wrote:
> >
> > Yes I will, I haven't taken enough time to clean up the README , it was
> > generated based on my source code with CHATGPT. I will do that later in
> the
> > week.
> >
> > On Sun, Jan 22, 2023 at 2:36 AM Benson Muite <benson_mu...@emailplus.org
> >
> > wrote:
> >
> > > On 1/22/23 13:15, Adesola Adedewe wrote:
> > > > i'm working on a project where big financial data needs to be loaded
> > > stored
> > > > and manipulated. the data is stored as parquet. my initial version
> had
> > > > arrow just load the parquet data and i used the basic unorderedmap
> but
> > > this
> > > > limited me to only one data type. i found i could make my database
> more
> > > > generic with arrow and its performance benefits. unfortunately my
> team is
> > > > mostly filled with python dev, so i decided to write a cleaner
> interface
> > > > over arrow, and using interfaces closer to panda. This enabled us to
> use
> > > > fewer lines of code as well, and still enjoy the benefit. i will
> write a
> > > > blog post later, i was mostly looking for other developers looking to
> > > > collaborate, or who may need this as well. not necessarily add it to
> the
> > > > main library, but i'm not opposed to that. I also implemented some
> > > > custom kernels like covariance correlation, cumprod, shift,
> pctchange.
> > > >
> > >
> > > The context is very helpful. A blog post would certainly alert others
> in
> > > the Arrow community of your work.  Most developers are over burdened,
> so
> > > explaining a use case and how it may help them would encourage
> > > exploration and review of your repository, so would encourage a blog
> > > post that alerts the wider Arrow developer community about your work.
> > > Updating the README of your repository would also encourage use.
> > >
> > >
>

Re: New Pandas-Apache repo

Reply via email to